CSE5519 Advances in Computer Vision (Topic B: 2023: Vision-Language Models)

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Tip

This paper introduces InstructBLIP, a framework for a vision-language model that aligns with text instructions.

It consists of three submodules: the BLIP-2 model with an image decoder, an LLM, and a query Transformer (Q-former) to bridge the two.

From qualitative results, we can see some hints that the model is following the text instructions, but I wonder if this framework could also bring to the image editing and generation tasks? What might be the difficulties in migrating this framework to context-awarded image generation?

Last updated on March 9, 2026

CSE5519 Advances in Computer Vision (Topic H: 2023: Safety, Robustness, and Evaluation of CV Models)CSE5519 Advances in Computer Vision (Topic G: 2023: Correspondence Estimation and Structure from Motion)