CSE5519 Advances in Computer Vision (Topic A: 2021 and before: Semantic Segmentation)
SETR
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Treating semantic segmentation as a sequence-to-sequence prediction task.
Novelty in SETR
FCN-based semantic segmentation
An FCN encoder consists of a stack of sequentially connected convolutional layers.
Having limited receptive fields for context modeling is thus an intrinsic limitation of the vanilla FCN architecture.
Segmentation transformers (SETR)
Image to sequence.
By further mapping each vectorized patch into a latent -dimensional embedding space using a linear projection function , we obtain a 1D sequence of patch embeddings for an image . To encode the patch spacial information, we learn a specific embedding for every location which is added to to form the final sequence input . This way, spatial information is kept despite the orderless self-attention nature of transformers.
Decoder for segmentation
Three choices:
Naive Upsampling:
- Upsample the sequence to the original image size.
- Then use a 1x1 convolution to get the final segmentation map.
Use Progressive Upsampling to get the final segmentation map.
Use Multi-level feature aggregation to get the final segmentation map.
This paper shows a remarkable success of transformer in semantic segmentation. The authors use linear projection split large images to mini patches to get the patch embeddings and then use a transformer encoder to get the final segmentation map.
I’m really interested in the linear projection function . How does it work to preserve the spatial information across the patches? What will happen if we have square frames overlapping the image? how doe the transformer encoder work to solve the occlusion problem or it is out of scope of the paper?
On lecture new takes
DeepLabv3+
Atrous convolutions (large receptive field)
Separable convolutions (depthwise convolutions)
SETR
Learned positional embeddings.