Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection)

CSE5519 Advances in Computer Vision (Topic J: 2023 - 2024: Open-Vocabulary Object Detection)

Grounding DINO

link to the paper 

Novelty in Grounding DINO

  • Use CLIP to enhance the feature with DETER
  1. Contrastive loss for text-region alignment
  2. Localization loss-box regression (DINO style)
  3. Auxiliary loss across decoder layers

Top 900 bounding boxes for inference.

Tip

This paper shows a novel approach to open-vocabulary object detection by marrying DINO with CLIP. The authors use a DINO model to get the query features and then use a grounding head to get the bounding box and class label.

I’m really interested in the number of bounding boxes for inference. I wonder how fine-grained the bounding boxes are? Does it serve a good reference for counting problems and doing logical reasoning for example the hand with 6 fingers?

Last updated on