Week 5: Image and Video Generation

Getting oriented #

Variational Autoencoder (VAE).
Generative Adversarial Networks (GAN)
Auto-regressive
Diffusion models

Variational Autoencoder (VAE) #

One of the earlier techniques from 2013. It’s implemented as a series of linear layers (neural network).

VAE Training
- Encoder: a neural network that maps an input image into a lower-dimensional space (1-dimensional vector). Early on these were trained using the MNIST dataset, which is a ton of handwritten digits.
- Decoder: another neural network that maps the lower-dimensional vector back into an image. There’s a loss function that measures the difference between the original image and the reconstructed image. The goal of training is to minimize this loss function, i.e., get better at representing the input image.
VAE Inference.
- We don’t need the inference part; we just need the Decoder. The input is just a random vector (sampled from a multi-variate Gaussian distribution). Then the decoder samples the data based on the training data.
- How VAE can be implmented and a helpful visualizer to help you develop some intuition of how VAEs work.

Interesting article that shows how someone generated X-ray data using a VAE.

But…VAE is not typically used for consumer-oriented applications because the outputs are often blurry and not very realistic. This is because VAEs are designed to learn a latent space that captures the underlying structure of the data, which may not always translate well to consumer-oriented applications.

Generative Adversarial Network (GAN) #

Generative Adversarial Network (2014) quickly became one of the most popular papers in computer vision history :-)

In GAN training there are two neural networks with a sequence of layers.

The Generator tries to generate better and better images.
The Discriminator becomes better and better at distinguishing between real and fake (AI-generated) images.
- This is just a classifier: Fake or Real.

The loss function: the more that the generator can fool the Discriminator the better. The training process aims to constantly tweak both networks simultaneously.

Again, GANs are not commonly used for advanced text-to-image generation. Google has some training that explains the basics of GANs, including the difficulties building them. But still, they made a lot of progress, like StyleGAN2 doing things like adjusting the facial expression in photos.

Auto-regressive modeling #

Unlike VAE and GAN that each have two components, Auto-regressive modeling treats images like a sequence of pixels (or patches of pixels). In which case, you can generate the image pixel by pixel. While this started out using a convolutional neural network (CNN), it has been extended to use transformers (self-attention mechanism).

Real image -> Convert to sequence -> Vector -> MODEL -> vector.

Each pixel is conditionally generated based on all the previous pixels.

The advantage of using transformers (self-attention) is that it’s not random. Because the concepts of things like ‘cat’ or ‘dog’ are not random, but rather learned from the data. The model is conditioned on the text input. Like LLMs, this is really key in text-to-image (t2i) generation. That’s what DALLE.E was based on.

Diffusion Models #

First introduced in 2015, This is an iterative process: you start with an image, then add noise to it, then train the model to predict a slightly denoised version of the image. This step is repeated iteratively until the output is a high-quality image. During training, the goal is to reduce the loss function such that the model can generate images that are indistinguishable from real original images.

At inference time, you start with a random noise vector and iteratively denoise it until you get a high-quality image.

This was transformational in terms of the complexity, quality, and size of images. For example, in DALLE.2, OpenAI switched from GAN to diffusion models with a significant improvement in image quality.

Comparison #

Each approach has its trade-offs - VAEs and GANs are faster but lower quality, while autoregressive and diffusion models produce better results at the cost of speed and computational resources.

Characteristics	VAE	GAN	Autoregressive	Diffusion
Quality	Low	Moderate	High	Exceptional
Speed	Fast	Fast	Slow	Slow
Training stability	Stable	Unstable	Stable	Stable
Control over generation	Limited	Limited	Flexible	Moderate
Facial manipulation	No	Yes	No	No
Novelty	Limited	Limited	High	High
Resource intensity	Moderate	Moderate	High	High

Text to Image #

Diffusion models start from random noise and work towards a clear image that approximates the training data. But with text-to-image, the whole point is to avoid randomness!

Random generation:

graph LR dm[Diffusion model] --> output

Text-to-image (T2I) is conditioned by the input text:

graph LR dog[A dog playing outside] --> dm[Diffusion model] --> output style dog fill:yellow