StyleGAN-NADA: Blind Training and Other Wonders

Continuing the series of articles about the AI system DALL·E 2 and the models used in it, this time I will talk about the StyleGAN-NADA model, CLIP-Guided Domain Adaptation of Image Generators. If you want to learn more about the CLIP model, you can check out my other article!

Introduction & Basics
Imagine how cool it would be if you could describe a GAN with a text prompt (for example, Dog → The Joker) and get a complete generator that synthesizes images corresponding to the provided text query in any domain. Imagine how cool it would be if a generative model could be trained to produce those images without seeing any image beforehand at all.

It is actually possible with the StyleGAN-NADA model. And it is really cool.

Dog → The Joker

Leveraging the semantic power of large scale CLIP (Contrastive-Language-Image-Pre-training) models, Rinon Gal and his colleagues present a text-driven method that allows shifting a generative model to new domains and does not have to collect even a single image from those domains. In other words, the StyleGAN-NADA model is trained blindly. All it takes is only a natural language text prompt and a few minutes of training, and by that the method can adapt a generator across a great number of domains characterized by diverse styles and shapes.

The domains that the StyleGAN-NADA covers are very specific and fun — or maybe a little bit creepy:

Human → Mark Zuckerberg

Church → New York City

Human → Zombie

Why StyleGAN-NADA matters
GAN training requires obtaining a multitude of images from a specific domain and usually it’s a pretty difficult task. Of course, you can leverage the information learned by Vision-Language models such as the CLIP model, yet applying these models to manipulate pretrained generators to synthesize out-of-domain images is not that easy. That’s why the authors of the StyleGAN-NADA model propose to use dual generators and an adaptive layer selection procedure to increase training stability. Unlike other models and methods, StyleGAN-NADA works in a zero-shot manner and automatically selects a subset of layers to update at each iteration.

Pre-training Setup
It all starts with a pre-trained generator and two text prompts describing a direction of change (for example, “Dog” to “The Joker”). Instead of editing a single image, the authors of StyleGAN-NADA use the signal from the CLIP model in order to train the generator itself. So there is actually no need for training data, and the process is really fast. The training takes minutes or even less.

If you’re interested in the more detailed overview of the training setup, here it is:

The authors of the StyleGAN-NADA model initialize two intertwined generators — G-frozen and G-train using the weights of a generator pre-trained on images from a source domain. The weights of G-frozen remain fixed throughout the whole process, while the weights of G-train are modified through optimization and an iterative layer-freezing scheme. The process shifts the domain of G-train according to a user-provided textual direction while maintaining a shared latent space.

How StyleGAN-NADA works
The main goal of the method is to shift a pre-trained generator from a given source domain to a new target domain only with the textual prompts, without using images of the target domain. Here’s the training scheme that helps to achieve that goal:

Network Architecture
The model consists of two pretrained StyleGAN2 generators with a shared mapping network and the same latent space. The goal is to change the domain of one of the paired generators with a CLIP-based loss and keep the other fixed as a reference with a layer-freezing scheme that can adapt and select which layers to update at each iteration.

CLIP-based Guidance
There are 3 different types of losses that are used:

Global target loss
The global loss is the most intuitive CLIP loss. It minimizes the CLIP-space cosine distance between the generated images and the given target text prompt and either collapses to a single image or fools CLIP by adding per-pixel noise to the images.

Directional loss
It’s a more advanced type of loss that seeks to align the direction of CLIP embeddings between images from two domains to the CLIP direction of the corresponding text queries.

Embedding-norm loss
Embedding-norm loss uses a regularized version of StyleCLIP’s latent mapper that is used to reduce the number of semantic artifacts on synthesized images.

Layer-Freezing
It happens that some layers of the generator are more important for specific domains than others, hence at each iteration a set of W+ vectors is generated — a separate style vector for each layer in the generator. A number of StyleCLIP global optimization steps are performed to measure which layers changed the most. Only those most changed layers are updated, while all other layers are frozen for that iteration.

Latent-Mapper
During the last step, it is noted that the generator does not undergo a complete transformation for some shape changes. For some domain (for example “Dog” to “The Joker”) the resulting generator can output both dogs, and the jokers and everything that lies in-between. Therefore a StyleCLIP latent mapper can be trained to map all latent codes to the dog region of the latent space.

Conclusion
So this is how StyleGAN-NADA, a CLIP-guided zero-shot method for Non-Adversarial Domain Adaptation of image generators, works. Although the StyleGAN-NADA is focused on StyleGAN, it can be applied to other generative architectures such as OASIS and many others.

The ability to blindly train intertwined generators leads to new cool possibilities. For example, with the StyleGAN-NADA model you can edit images in ways that are constrained almost only by your own creativity or synthesize paired cross-domain data and labeled images for downstream applications such as image-to-image translation. And it's only the beginning! The method surely will be developed in the future. Maybe this article inspired you to explore the world of textually-guided generation and abilities of the CLIP model yourself.