Getting oriented #
- Variational Autoencoder (VAE).
- Generative Adversarial Networks (GAN)
- Auto-regressive
- Diffusion models
Variational Autoencoder (VAE) #
One of the earlier techniques from 2013. It’s implemented as a series of linear layers (neural network).
- VAE Training
- Encoder: a neural network that maps an input image into a lower-dimensional space (1-dimensional vector). Early on these were trained using the MNIST dataset, which is a ton of handwritten digits.
- Decoder: another neural network that maps the lower-dimensional vector back into an image. There’s a loss function that measures the difference between the original image and the reconstructed image. The goal of training is to minimize this loss function, i.e., get better at representing the input image.
- VAE Inference.
- We don’t need the inference part; we just need the Decoder. The input is just a random vector (sampled from a multi-variate Gaussian distribution). Then the decoder samples the data based on the training data.
- How VAE can be implmented and a helpful visualizer to help you develop some intuition of how VAEs work.
Interesting article that shows how someone generated X-ray data using a VAE.
But…VAE is not typically used for consumer-oriented applications because the outputs are often blurry and not very realistic. This is because VAEs are designed to learn a latent space that captures the underlying structure of the data, which may not always translate well to consumer-oriented applications.
Generative Adversarial Network (GAN) #
Generative Adversarial Network (2014) quickly became one of the most popular papers in computer vision history :-)
In GAN training there are two neural networks with a sequence of layers.
- The Generator tries to generate better and better images.
- The Discriminator becomes better and better at distinguishing between real and fake (AI-generated) images.
- This is just a classifier: Fake or Real.
The loss function: the more that the generator can fool the Discriminator the better. The training process aims to constantly tweak both networks simultaneously.
Again, GANs are not commonly used for advanced text-to-image generation. Google has some training that explains the basics of GANs, including the difficulties building them. But still, they made a lot of progress, like StyleGAN2 doing things like adjusting the facial expression in photos.
Auto-regressive modeling #
Unlike VAE and GAN that each have two components, Auto-regressive modeling treats images like a sequence of pixels (or patches of pixels). In which case, you can generate the image pixel by pixel. While this started out using a convolutional neural network (CNN), it has been extended to use transformers (self-attention mechanism).
Real image -> Convert to sequence -> Vector -> MODEL -> vector.
Each pixel is conditionally generated based on all the previous pixels.
The advantage of using transformers (self-attention) is that it’s not random. Because the concepts of things like ‘cat’ or ‘dog’ are not random, but rather learned from the data. The model is conditioned on the text input. Like LLMs, this is really key in text-to-image (t2i) generation. That’s what DALLE.E was based on.
Diffusion Models #
First introduced in 2015, This is an iterative process: you start with an image, then add noise to it, then train the model to predict a slightly denoised version of the image. This step is repeated iteratively until the output is a high-quality image. During training, the goal is to reduce the loss function such that the model can generate images that are indistinguishable from real original images.
At inference time, you start with a random noise vector and iteratively denoise it until you get a high-quality image.
This was transformational in terms of the complexity, quality, and size of images. For example, in DALLE.2, OpenAI switched from GAN to diffusion models with a significant improvement in image quality.
Comparison #
Each approach has its trade-offs - VAEs and GANs are faster but lower quality, while autoregressive and diffusion models produce better results at the cost of speed and computational resources.
| Characteristics | VAE | GAN | Autoregressive | Diffusion |
|---|---|---|---|---|
| Quality | Low | Moderate | High | Exceptional |
| Speed | Fast | Fast | Slow | Slow |
| Training stability | Stable | Unstable | Stable | Stable |
| Control over generation | Limited | Limited | Flexible | Moderate |
| Facial manipulation | No | Yes | No | No |
| Novelty | Limited | Limited | High | High |
| Resource intensity | Moderate | Moderate | High | High |
Text to Image #
Diffusion models start from random noise and work towards a clear image that approximates the training data. But with text-to-image, the whole point is to avoid randomness!
- Random generation:
- Text-to-image (T2I) is conditioned by the input text:
Resources #
- Auto-Encoding Variational Bayes
- Kaggle MNIST dataset
- TensorFlow MNIST
- VAE implementation
- VAE latent space visualization
- GAN paper
- Overview of GAN structure
- StyleGAN2
- Conditional Image Generation with PixelCNN Decoders
- Image Transformer
- DALL.E
- Zero-Shot Text-to-Image Generation
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics
- Denoising Diffusion Probabilistic Models
- DALLE2 paper
- This person does not exist
- DALLE2
- DALLE3
- Introducing 4o Image Generation
- GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
- Imagen 4
- LMArean leaderboard
- Laion
- High-Resolution Image Synthesis with Latent Diffusion Models
- U-Net paper
- DiT paper
- Video diffusion models
- Video generation models as world simulators
- Veo
- Video generation leaderboard