CSE5519 Advances in Computer Vision (Topic D: 2022: Image and Video Generation)

An Image is Worth One Word

Personalizing Text-to-Image Generation using Textual Inversion

Goal: Enable language-guided generation of new, user-specific concepts.

Novelty in Textual Inversion

Use pseudo-words that can guide generation.

Textual inversion:

v_*=\arg\min_{v}\mathbb{E}_{z\sim \varepsilon(x),y,\epsilon\sim \mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c_{\theta}(y))\|_2^2\right],

Tip

This paper shows that we can use pseudo-words to guide the generation of new, user-specific concepts.

However, the technical details are not fully explained in the paper, for example, how the loss function is constructed from scratch and how this maximize the generalization ability of the model over different styles and concepts?

For example, what is $v_*=\arg\min_{v}\mathbb{E}_{z\sim \varepsilon(x),y,\epsilon\sim \mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c_{\theta}(y))\|_2^2\right],$ means and how it is used to generate the new, user-specific concepts?