CSE5519 Advances in Computer Vision (Topic J: 2022: Open-Vocabulary Object Detection)

MViT is a proposed Multimodal Vision Transformer.

It achieves state‑of‑the‑art performance on various downstream tasks like class-agonistic object detection.

Novelty in MViT

GPV: A unified architecture for multi-task learning. Trained on data from five different vision-language tasks.
MDETER: modulated transformer trained to detect objects in an image conditioned on a text query.
MAVL: Multi-scale attention Vision Transformer, using multi-scale spacial context to achieve efficient training.
- MSDA: Multi-scale Deformable attention. Sample a small set of keys around a reference query image location.
- Late Multi-modal fusion: Use the spacial structure of an image to sparsely sample keys for each query point.

The model has strong generalization ability, that is able to detect object that only occurs few times in training datasets (lynx, humidifier, and armadillo). But they cannot generalized to medical imaging.

The model has enhances interactability, that is it is able to comprehend “all objects” or “long objects” in the query and able to select them out.

Tip

This is an interesting paper that provides a comprehensive framework for multimodal vision language models. It uses different components that specialize in each task to achieve SOTA performance and proposes multi-scale deformable attention to speed up the training process. The final model has strong generalization ability and impressive interactability in understanding abstract concepts like “all”, “tall”.

I wonder what is the source of understanding abstract concepts in natural language emergence in MViT. Does the model learn the correlation between the words, like “tall” with man, “all” with a large bounding box, or logical reasoning? If we give the model the same-sized plushes of monkey and whale, preferably the monkey is slightly larger. If we ask the model for “large objects”, will it select the monkey plush or the whale plush?

CSE5519 Advances in Computer Vision (Topic J: 2022: Open-Vocabulary Object Detection)

Class-agnostic Object Detection with Multi-modal Transformer

Novelty in MViT