CSE5519 Advances in Computer Vision (Topic F: 2024: Representation Learning)

Long-CLIP: Unlocking the long-text capability of CLIP

Novelty in Long-CLIP

a knowledge-preserving stretching of positional embeddings
a primary component matching of CLIP features.

Knowledge-preserving stretching of positional embeddings

Retain the embedding of the top 20 positions, as for remaining 57 positions (training text is lower than 77 tokens), use the large ratio for linear interpolation.

\operatorname{PE}^*(pos)=\begin{cases} \operatorname{PE}(pos) & \text{if } pos \leq 20 \\ \operatorname{PE}(1-\alpha)\times \operatorname{PE}(\lfloor \frac{pos}{\lambda_2}\rfloor) + \alpha \times \operatorname{PE}(\lceil \frac{pos}{\lambda_2}\rceil) & \text{if } pos > 20 \end{cases}

where $\alpha=\frac{pos\%\lambda_2}{\lambda_2}$ .

Primary component matching of CLIP features

Do not train with long text, (may decrease the performance of CLIP for short text)

Use fine-grained and coarse-grained components to match the CLIP features.

Tip

This paper shows an interesting approach to increasing the long-text capability of CLIP. The authors use a knowledge-preserving stretching of positional embeddings and a primary component matching of CLIP features to achieve this.

However, the primary component matching is not a very satisfying solution, as it may not capture the novelty in high-frequency components, for example, the texture of clothes in the main character in the image, suppose multiple texture exists in the image. How does the model solve this problem and align the feature to the correct object of description? Or simply assumes that the bigger objects in the image are more important for the captioning task?