Skip to Content
CSE559AComputer Vision (Lecture 13)

CSE559A Lecture 13

Positional Encodings

Fixed Positional Encodings

Set of sinusoids of different frequencies.

f(p,2i)=sin(p100002i/d)f(p,2i+1)=cos(p100002i/d)f(p,2i)=\sin(\frac{p}{10000^{2i/d}})\quad f(p,2i+1)=\cos(\frac{p}{10000^{2i/d}})

source 

Positional Encodings in Reconstruction

MLP is hard to learn high-frequency information from scaler input (x,y)(x,y).

Example: network mapping from (x,y)(x,y) to (r,g,b)(r,g,b).

Generalized Positional Encodings

  • Dependence on location, scaler, metadata, etc.
  • Can just be fully learned (use nn.Embedding and optimize based on a categorical input.)

Vision Transformer (ViT)

Class Token

In Vision Transformers, a special token called the class token is added to the input sequence to aggregate information for classification tasks.

Hidden CNN Modules

  • PxP convolution with stride P (split the image into patches and use positional encoding)

ViT + ResNet Hybrid

Build a hybrid model that combines the vision transformer after 50 layer of ResNet.

Moving Forward

At least for now, CNN and ViT architectures have similar performance at least in ImageNet.

  • General Consensus: once the architecture is big enough, and not designed terribly, it can do well.
  • Differences remain:
    • Computational efficiency
    • Ease of use in other tasks and with other input data
    • Ease of training

Wrap up

Self attention as a key building block

Flexible input specification using tokens with positional encodings

A wide variety of architectural styles

Up Next: Training deep neural networks

Last updated on