CSE5519 Advances in Computer Vision (Topic A: 2021 and before: Semantic Segmentation)

SETR

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

link to the paper

Treating semantic segmentation as a sequence-to-sequence prediction task.

Novelty in SETR

FCN-based semantic segmentation

An FCN encoder consists of a stack of sequentially connected convolutional layers.

Having limited receptive fields for context modeling is thus an intrinsic limitation of the vanilla FCN architecture.

Segmentation transformers (SETR)

Image to sequence.

By further mapping each vectorized patch $p$ into a latent $C$ -dimensional embedding space using a linear projection function $f: p \rightarrow e \in \mathbb{R}^C$ , we obtain a 1D sequence of patch embeddings for an image $x$ . To encode the patch spacial information, we learn a specific embedding $p_i$ for every location $i$ which is added to $e_i$ to form the final sequence input $E = \{e_1 + p_1, e_2 + p_2, \cdots, e_L + p_L\}$ . This way, spatial information is kept despite the orderless self-attention nature of transformers.

Decoder for segmentation

Three choices:

Naive Upsampling:

Upsample the sequence to the original image size.
Then use a 1x1 convolution to get the final segmentation map.

Use Progressive Upsampling to get the final segmentation map.

Use Multi-level feature aggregation to get the final segmentation map.

Tip

This paper shows a remarkable success of transformer in semantic segmentation. The authors use linear projection split large images to mini patches to get the patch embeddings and then use a transformer encoder to get the final segmentation map.

I’m really interested in the linear projection function $f$ . How does it work to preserve the spatial information across the patches? What will happen if we have square frames overlapping the image? how doe the transformer encoder work to solve the occlusion problem or it is out of scope of the paper?

On lecture new takes

DeepLabv3+

Atrous convolutions (large receptive field)

Separable convolutions (depthwise convolutions)

SETR

Learned positional embeddings.