CSE5519 Advances in Computer Vision (Topic B: 2022: Vision-Language Models)
BLIP
Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
This is a unified Vision-Language Pre-training framework to learn from noisy image-text pairs.
Novelty in BLIP
MED
MED is a multimodel mixture of encoder-decoder architecture.
- Unimodel encoder
- separately encodes image and text
- Image-grounded text encoder
- injects visual information by inserting one additional cross-attention layer
- Image-grounded text decoder
- replaces bidirectional self-attention layer in the image-grounded text encoder with causal self-attention layers
Pre-training objectives
- Image-text contrastive loss
- align the feature space of the visual transformer and text transformer
- Image-text matching loss
- learn image-text multimodal representation that captures the fine-grained alignment between the image and text
- Language modeling loss
- generate textual descriptions given an image. Optimize the cross-entropy loss of the predicted text tokens.
CapFilt
CapFilter is a method to improve the quality of text corpus.
Captioner to generate captions given web images.
Filter to remove noisy image-text pairs.
- Diversity is the key for synthetic captions.
- Improvement is not due to longer training data.
This paper shows a new way to pre-train a unified Vision-Language model from noisy image-text pairs. With combined architecture and pre-training objectives, the model is able to learn from noisy image-text pairs and generate high-quality captions and deal with various image-text tasks. The CapFilter is a simple but effective method to improve the quality of text corpus over noisy image-alt text pairs.
I wonder how this method is applied to general non-labeled images since in some extreme cases, the caption are solely generated by the CapFilter. Will the performance continue to improve? Taking few steps further, is it possible to use actual image only data as some alignment and use some image generation model integrated into the framework to self-improve the model?