CSE5519 Advances in Computer Vision (Topic F: 2022: Representation Learning)

Masked Autoencoders Are Scalable Vision Learners

Novelty in MAE

Masked Autoencoders

Masked Autoencoders are a type of autoencoders that mask out some of the input data and train the model to reconstruct the original data. For best performance, they mask out 75% of the input data.

A masked auto encoder with a single-block decoder can perform strongly with fine tuning.

This method speeds up the training process by a factor of 3-4

Tip

This paper shows a new way to train a vision model by using masked autoencoders. The authors masked out 75% of the input data and train the model to reconstruct the original data by the insight that image data is highly redundant compared to text data when using transformer architecture.

Currently, the sampling method is uniform and simple with surprising results. I wonder if we could use better sampling method, for example uniform sampling with information entropy on each patches would yield better results?