Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic D: 2021 and before: Image and Video Generation)

CSE5519 Advances in Computer Vision (Topic D: 2021 and before: Image and Video Generation)

High-Resolution Image Synthesis with Latent Diffusion Models.

link to the paper 

Image synthesis in high resolution.

Novelty in Latent Diffusion Models

Transformer encoder for LDMs

use cross-attention to integrate the text embedding into the latent space.

Tip

How are the transformer encoder and decoder embedded in UNet? How does the implementation go? How does the cross-attention help improve the image generation?

On lecture new takes

Variational Autoencoder (VAE)

  • Map input data into a probabilistic latent space and then reconstruct back the original data.
  • Probabilistic latent space allows model to operate on smoother latent space from which we can sample.
  • Each sample is mapped to a gaussian distribution in the latent space.
  • The exact posterior is not known. We use a gaussian prior to approximate the posterior.

Drawbacks:

  • Lose high-frequency information.
  • Joint latent space is not usually gaussian.

Diffusion models:

Stacks of learnable VAE decoders.

Latent Diffusion Models (Stable diffusion)

Ok, that’s the name I recognize.

Vanilla diffusion models operates on pixel space is expensive.

Perform diffusion process in latent space.

FIrst train a powerful VAE to encode data. Then do diffusion on these VAE latent codes. Then decode the latent space to get the image using VAE decoder.

VAE training

Semantic compression: LDM

Perceptual compression: Autoencoder+GAN

Limitations

Lack of contextual understanding.

Last updated on