CSE5519 Advances in Computer Vision (Topic D: 2023: Image and Video Generation)

Scalable Diffusion Models with Transformers

Create a diffusion model with transformers.

Train conditional DiT models over latent patches replacing the U-Net.

Tip

This paper provides a scalable way to integrate the conditional DiT models over latent patches, replacing the U-Net to improve the performance of image generation.

I wonder how classifier-free guidance is used in training the DiT and if the model also has in-context learning ability, as other transformer models do.

Last updated on March 9, 2026

CSE5519 Advances in Computer Vision (Topic C: 2023: Neural Rendering)CSE5519 Advances in Computer Vision (Topic E: 2023: Deep Learning for Geometric Computer Vision)