CSE5519 Advances in Computer Vision (Topic B: 2022: Vision-Language Models)

BLIP

link to the paper

Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is a unified Vision-Language Pre-training framework to learn from noisy image-text pairs.

Novelty in BLIP

MED

MED is a multimodel mixture of encoder-decoder architecture.

Unimodel encoder
- separately encodes image and text
Image-grounded text encoder
- injects visual information by inserting one additional cross-attention layer
Image-grounded text decoder
- replaces bidirectional self-attention layer in the image-grounded text encoder with causal self-attention layers

Pre-training objectives

Image-text contrastive loss
- align the feature space of the visual transformer and text transformer
Image-text matching loss
- learn image-text multimodal representation that captures the fine-grained alignment between the image and text
Language modeling loss
- generate textual descriptions given an image. Optimize the cross-entropy loss of the predicted text tokens.

CapFilt

CapFilter is a method to improve the quality of text corpus.

Captioner to generate captions given web images.

Filter to remove noisy image-text pairs.

Diversity is the key for synthetic captions.
Improvement is not due to longer training data.

Tip

This paper shows a new way to pre-train a unified Vision-Language model from noisy image-text pairs. With combined architecture and pre-training objectives, the model is able to learn from noisy image-text pairs and generate high-quality captions and deal with various image-text tasks. The CapFilter is a simple but effective method to improve the quality of text corpus over noisy image-alt text pairs.

I wonder how this method is applied to general non-labeled images since in some extreme cases, the caption are solely generated by the CapFilter. Will the performance continue to improve? Taking few steps further, is it possible to use actual image only data as some alignment and use some image generation model integrated into the framework to self-improve the model?