Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic J: 2021 and before: Open-Vocabulary Object Detection)

CSE5519 Advances in Computer Vision (Topic J: 2021 and before: Open-Vocabulary Object Detection)

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

link to the paper 

MDETR uses a convolutional backbone to extract visual features, and a language model such as RoBERTa to extract text features. The features of both modalities are projected to a shared embedding space, concatenated and fed to a transformer encoder-decoder that predicts the bounding boxes of the objects and their grounding in text.

DETR Our approach to modulated detection builds on the DETR system

Novelty in MDETR

We present the two additional loss functions used by MDETR, which encourage alignment between the image and the text. Both of these use the same source of annotations: free form text with aligned bounding boxes.

The first loss function that we term as the soft token prediction loss is a non parametric alignment loss.

The second, termed as the text-query contrastive alignment is a parametric loss function enforcing similarity between aligned object queries and tokens.

While the soft token prediction uses positional information to align the objects to text, the contrastive alignment loss enforces alignment between the embedded representations of the object at the output of the decoder, and the text representation at the output of the cross encoder.

During MDETR pre-training, the model is trained to detect all objects mentioned in the question. To extend it for question answering, we provide QA specific queries in addition to the object queries as input to the transformer decoder. We use specialized heads for different question types.

Tip

This paper really shows the power of transformer architecture in object detection.

I was shocked by the in-context learning ability of the model when it knows the referred object “what” in the text query.

I wonder if it is possible to use this model in reverse to generate a concise and comprehensive description of the image? Maybe we can combine it with some other image generation model as the generator and the discriminator to capture the essence of the topology of the image? Would that train a better image generation model and transformer model for object description?

Last updated on