CSE5519 Advances in Computer Vision (Topic F: 2023: Representation Learning)

Self-supervised learning from images with a joint-embedding predictive architectureLinks to an external site.

Novelty in Joint-Embedding Predictive Architecture

Sample target blocks with sufficiently large scale
Use a sufficiently informative (sparsely distributed) context blocks to predict the target block

Combining vision transformer

representation learning in biological systems is the adaptation of an internal model to predict the sensory input responses.

I-JEPA model predicts the missing information in the abstract latent space.

Use multiple ViT to predict the different sections of target block. Similar to Masked autoencoders. However, the prediction is made to the abstract representation of the target block, then decoded by a target decoder.

(Recall in MAE, the prediction is made to the pixel space of the target block.)

Tip

This paper presents a simple and effective self-supervised learning method for image representation learning. The key seems to be the multi block masking learning from the abstract representation of the target block given the representation of the context blocks.

In the ablation study, the author found that the multi-block masking is more effective than the single-block masking or random masking strategies. I wonder if we increase the number of multiple blocks to predict the target block, will the performance continue to improve? Is the performance improvement mainly contributed by the fine-grained prediction of the target block originated from learning the representation of the target block from the context blocks or just larger context size for each ViT prediction? How the consistency of multi-block prediction is guaranteed? I don’t know if I missed it but if we increase the consistency for intersection of the multiple blocks to predict the target block, will the performance continue to improve?