Skip to Content
CSE5519CSE5519 Advances in Computer Vision (Topic D: 2022: Image and Video Generation)

CSE5519 Advances in Computer Vision (Topic D: 2022: Image and Video Generation)

An Image is Worth One Word

link to the paper 

Personalizing Text-to-Image Generation using Textual Inversion

Goal: Enable language-guided generation of new, user-specific concepts.

Novelty in Textual Inversion

Use pseudo-words that can guide generation.

Textual inversion:

v=argminvEzε(x),y,ϵN(0,1),t[ϵϵθ(zt,t,cθ(y))22],v_*=\arg\min_{v}\mathbb{E}_{z\sim \varepsilon(x),y,\epsilon\sim \mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c_{\theta}(y))\|_2^2\right],
Tip

This paper shows that we can use pseudo-words to guide the generation of new, user-specific concepts.

However, the technical details are not fully explained in the paper, for example, how the loss function is constructed from scratch and how this maximize the generalization ability of the model over different styles and concepts?

For example, what is v=argminvEzε(x),y,ϵN(0,1),t[ϵϵθ(zt,t,cθ(y))22],v_*=\arg\min_{v}\mathbb{E}_{z\sim \varepsilon(x),y,\epsilon\sim \mathcal{N}(0,1),t}\left[\|\epsilon-\epsilon_\theta(z_t,t,c_{\theta}(y))\|_2^2\right], means and how it is used to generate the new, user-specific concepts?

Last updated on