CSE5519 Advances in Computer Vision (Topic D: 2024: Image and Video Generation)

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

This paper shows that the autoregressive model can outperform the diffusion model in terms of image generation.

Novelty in the autoregressive model

Use Llama 3.1 as the autoregressive model.

Use code book and downsampling to reduce the memory footprint.

Tip

This paper shows that the autoregressive model can outperform the diffusion model in terms of image generation.

And in later works, we showed that usually the image can be represented by a few code words; for example, 32 tokens may be enough to represent most of the images (that most humans need to annotate). However, I doubt the result if it can be generalized to more complex image generation tasks, for example, the image generation with a human face, since I found it difficult to describe people around me distinctively without calling their name.

For more real-life videos, to ensure contextual consistency, we may need to use more code words. Is such a method scalable to video generation to produce realistic results? Or will there be an exponential memory cost for the video generation?