CSE5519 Advances in Computer Vision (Topic B: 2021 and before: Vision-Language Models)
Learning Transferable Visual Models From Natural Language Supervision
By OpenAI. That’s sick…

Novelty in CLIP
CLIP (Contrastive Language-Image Pre-training) is a simplified version of ConVIRT trained on large scale image-text pairs.
using natural language supervision for image representation learning.
Use more generalized image-caption pairs as supervision. (400 million (image, text) pairs collected from the internet)
No need to pre-train the model to fit ImageNet but predict with high accuracy on ImageNet.
This line of work represents the current pragmatic middle ground between learning from a limited amount of supervised “gold-labels” and learning from practically unlimited amounts of raw text.
Use 5 version of ResNets and 3 version of ViTs as the base encoder. Then do standard attention-based contrastive learning to train the model.
Prompt Engineering
“A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text. For instance, just using this prompt improves accuracy on ImageNet by 1.3%.
Limitations
we estimate around a 1000x increase in compute is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware. Further research into improving upon the computational and data efficiency of CLIP will be necessary.
Zero-shot CLIP still generalizes poorly to data that is truly out-of-distribution for it.
In defining the general task that CLIP can solve, and experimental results from Zero-Shot CLIP vs. Linear Probe on ResNet50. I can see that the performance of Zero-Shot CLIP is better than Linear Probe on ResNet50 on the tasks that are somehow “frequently labeled” by humans. For example, the car brand, the location of the image, etc. And perform badly when humans don’t label the image or the idea is more abstract. For example, the distance of the camera, the numbers in the image, a satellite image of terrain, etc.
Is the CLIP model really learning enough knowledge from the general natural language description of the image? If the description is more comprehensive, will CLIP outperform the Linear Probe on ResNet50?
On lecture new takes
Flicker image label selection.
Visual Commonsense Reasoning dataset. (Visual Bert)
Loss based on cosine similarity between caption pairs.