CSE5519 Advances in Computer Vision (Topic B: 2024: Vision-Language Models)

Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)

This paper shows that the visual instruction tuning can improve the performance of the vision-language model.

Novelty in LLaVA-1.5

Scaling to high resolution images by dividing images into grids and maintaining the data efficiency.
Compositional ability, (use long-form language reasoning together with shorter visual reasoning can improve the model’s writing ability)
Random downsampling will not degrade the performance.

Tip

This paper shows that LLaVA-1.5 obeys the scaling law and splitting the high resolution images into grids to maintain the data efficiency. I wonder why this method is not applicable to multi-image understanding tasks? Why we cannot assign index embeddings to each image and push the image sets to the model for better understanding? What are the technical challenges to implement this idea?