CSE5519 Advances in Computer Vision (Topic D: 2021 and before: Image and Video Generation)

High-Resolution Image Synthesis with Latent Diffusion Models.

Image synthesis in high resolution.

Novelty in Latent Diffusion Models

Transformer encoder for LDMs

use cross-attention to integrate the text embedding into the latent space.

Tip

How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?

On lecture new takes

Variational Autoencoder (VAE)

Map input data into a probabilistic latent space and then reconstruct back the original data.
Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
Each sample is mapped to a gaussian distribution in the latent space.
The exact posterior is not known. We use a gaussian prior to approximate the posterior.

Drawbacks:

Lose high-frequency information.
Joint latent space is not usually gaussian.

Diffusion models:

Stacks of learnable VAE decoders.

Latent Diffusion Models (Stable diffusion)

Ok, that’s the name I recognize.

Vanilla diffusion models operates on pixel space is expensive.

Perform diffusion process in latent space.

FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.

VAE training

Semantic compression: LDM

Perceptual compression: Autoencoder+GAN

Limitations

Lack of contextual understanding.