Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic B: 2025: Vision-Language Models)

CSE5519 Advances in Computer Vision (Topic B: 2025: Vision-Language Models)

Molmo and PixMo:

link to paper 

Novelty in Molmo and PixMo

PixMo dataset (712k images with long 200+ words description)

  • Simplified two-stage training pipline
    • Standard ViT architecture with tokenizer and image encoder (CLIP) and pooling the embeddings to the decoder only LLM.
  • overlapping multi-crop policy
    • Add overlapping region and image cropping to truncate the large image.
  • training over multiple annotations
    • Text-only residual dropout
  • optimizer setups
Tip

This paper provides an interesting dataset and a refined training pipeline that is comparable to current closed-source SOTA performance. What is the contribution of the paper from the algorithm perspective? It seems that it is just a test for a new dataset with a slightly altered training pipeline.

Last updated on