CSE5519 Advances in Computer Vision (Topic A: 2022: Semantic Segmentation)

Masked-Attention Mask Transformer for Universal Image Segmentation

Definitions in Semantic Segmentation

Semantic Segmentation: Classify each pixel into a semantic class. (return a class label for each pixel.)
Panoptic Segmentation: Classify each pixel into a semantic class and instance class.
Instance Segmentation: Classify each pixel into an instance class. (return a mask with a single class label.)

Novelty in Masked-Attention Mask Transformer

The authors propose a new universal architecture for panoptic segmentation.

Masked-Attention in the Transformer

Masked-attention is a variant of cross-attention that only attends within the foreground masked region. This accelerates the convergence by the assumption that the local feature around the region is sufficient to update query features and context information.

Multi-scale high-resolution feature

Use feature pyramid produced by the pixel decoder with various resolutions of the original image.with positional embedding and scale-level embedding (learnable) to

Improvements on the training and inference

Drop out is not necessary and usually decreases the performance.

Use importance sampling to reduce memory usage.

Tip

Compared with previous works, this paper shows the potential of a universal architecture for panoptic segmentation by utilizing the masked-attention to replace the cross-attention in the transformer.

However, compared with other works with transformer, this paper does not show the generalization ability of the model across different datasets. Additional training is required to adapt the model to different datasets. Is this due to the lack of generalizable dataset? If we increase the variance of the dataset, will the model have better generalization ability? Will the performance degrade on specialized datasets compared with the single dataset trained model?