Image: Rinon Gal et al.
Thanks to OpenAIs CLIP, the StyleGAN-NADA AI system generates images via text instructions – even for categories and styles that it has never seen before. Why is that so special?
GANs generate images such as portraits, animals, buildings or vehicles. Like many other AI systems, however, the neural networks are specialists: a system trained with cat images cannot generate dog images. This cat GAN would first have to see lots of photos of dogs before they can paint their favorite pets on one picture together.
This training principle also applies to style changes: If the GAN is to draw cats in oil instead of photo-realistic pictures of cats, it must first be trained in the specific visual characteristics of oil paintings.
So far, this rule has been pretty immovable. But at the beginning of 2021, OpenAI also showed CLIP, a multimodal AI model, in addition to DALL-E. CLIP is multimodal because it was trained with images and texts. For example, it can determine whether a caption matches the content of the image.
Since the publication of CLIP, AI researchers and artists have been experimenting with OpenAIs image AI. Some combine the picture-language model with generative networks such as Nvidia’s StyleGAN – creating a new generation of generative AI.
StyleCLIP changes image details by entering text
Israeli researchers worked with Adobe in March 2021 to show StyleCLIP, a GAN that can be controlled by entering text. Using a simple interface, the researchers change the color of a person’s hair, make a cat cute or turn a tiger into a lion.
However, StyleCLIP first needed sample images and additional training for such changes within the respective domain (e.g. cat pictures or human portraits).
But could CLIP also control GA networks in such a way that these generate images outside of their originally trained domain? Can a GA network trained only with cat images generate dog images?
Thanks to multimodal orientation: Image AI is becoming more versatile
This is exactly what StyleGAN-NADA should do most of the time. Like StyleCLIP, the new system relies on Nvidia’s StyleGAN for image generation and on CLIP as the control mechanism. The “Nada” (Spanish for “nothing”) in the name is an allusion to the training data that is not required.
Unlike StyleCLIP, StyleGAN-NADA networks can generate images and styles without additional training outside of their domain: Artificial people in photo-realistic portraits turn into werewolves. Drawings or paintings appear in the style of selected artists. Dogs become bears or dogs with Nicolas Cage’s face. The image of a church becomes the urban landscape of New York.
The specialty: none of these imaging networks has ever seen pictures of werewolves, drawings by artists, bears, Nicolas Cage or New York.
This progress is made possible by a special architecture of the AI model: The researchers rely on two generators whose capabilities are the same at the beginning. The weightings in the neural network of a generator are frozen and serve as an orientation for the second generator, which adjusts its weightings until new images are created from an initial image that correspond to the specifications of CLIP.
The input and target categories are used, such as “human” and “werewolf” or “dog” and “Nicolas Cage”. In order to increase the quality of the generated images, the second generator changes its weights layer by layer.
Huge models make artificial intelligence more flexible
Even if the BAC networks described did not explicitly see their target categories or styles, all this information is implicitly contained in CLIP. Because OpenAI trained the AI model with large amounts of images and text from the Internet: Nicolas Cage, Tron, Bären, paintings by Picasso – all of these visual motifs were included in CLIPs data training and are linked to their respective linguistic terms. A study by OpenAI demonstrated this phenomenon, reminiscent of the grandmother neuron.
StyleGAN-NADA uses the comprehensive representations of visual motifs contained in CLIP as specifications for its own specifically trained GANs. In comparison with other GAN systems that rely on explicit training images as a template, StyleGAN-NADA is significantly more effective. The researchers write that this also applies to variants that rely on little training data. Moving the domain only takes a few minutes. The following video illustrates the process.
StyleGAN-NADA is another example of the versatility of large AI models such as CLIP or GPT-3 that have been trained over the past two years. Due to their extensive pre-training with huge amounts of data, they serve as the basis for specific AI applications, which can then be developed more quickly through comparatively less complex fine-tuning.
The code for StyleGAN-NADA is available from Github.
Read more about Artificial Intelligence: