Tuesday, September 28, 2021
HomeNewsNew image AI shows the potential of huge AI models

New image AI shows the potential of huge AI models

New image AI shows the potential of huge AI models

Image: Rinon Gal et al.

The article can only be displayed with activated JavaScript. Please activate JavaScript in your browser and reload the page.

Thanks to OpenAIs CLIP, the StyleGAN-NADA AI system generates images via text instructions – even for categories and styles that it has never seen before. Why is that so special?

GANs generate images such as portraits, animals, buildings or vehicles. Like many other AI systems, however, the neural networks are specialists: a system trained with cat images cannot generate dog images. This cat GAN would first have to see lots of photos of dogs before they can paint their favorite pets on one picture together.

This training principle also applies to style changes: If the GAN is to draw cats in oil instead of photo-realistic pictures of cats, it must first be trained in the specific visual characteristics of oil paintings.

So far, this rule has been pretty immovable. But at the beginning of 2021, OpenAI also showed CLIP, a multimodal AI model, in addition to DALL-E. CLIP is multimodal because it was trained with images and texts. For example, it can determine whether a caption matches the content of the image.

Since the publication of CLIP, AI researchers and artists have been experimenting with OpenAIs image AI. Some combine the picture-language model with generative networks such as Nvidia’s StyleGAN – creating a new generation of generative AI.

StyleCLIP changes image details by entering text

Israeli researchers worked with Adobe in March 2021 to show StyleCLIP, a GAN that can be controlled by entering text. Using a simple interface, the researchers change the color of a person’s hair, make a cat cute or turn a tiger into a lion.

However, StyleCLIP first needed sample images and additional training for such changes within the respective domain (e.g. cat pictures or human portraits).

AI-generated images, with specific details such as hair color or eyes that have been changed by entering text
StyleCLIP can be controlled by entering text. However, changes can only be made in the respective domain, for example from “cat” to “cute cat”. | Image: Patashnik et. al.

But could CLIP also control GA networks in such a way that these generate images outside of their originally trained domain? Can a GA network trained only with cat images generate dog images?

Thanks to multimodal orientation: Image AI is becoming more versatile

This is exactly what StyleGAN-NADA should do most of the time. Like StyleCLIP, the new system relies on Nvidia’s StyleGAN for image generation and on CLIP as the control mechanism. The “Nada” (Spanish for “nothing”) in the name is an allusion to the training data that is not required.

Unlike StyleCLIP, StyleGAN-NADA networks can generate images and styles without additional training outside of their domain: Artificial people in photo-realistic portraits turn into werewolves. Drawings or paintings appear in the style of selected artists. Dogs become bears or dogs with Nicolas Cage’s face. The image of a church becomes the urban landscape of New York.

AI-generated images, for example of a dog wearing Nicolas Cage's eye area
StyleGAN-NADA can make numerous changes to images. The dog model has never seen Nicolas Cage, the car model has never seen Tron. Nevertheless, it can produce images that clearly contain elements of these two visual motifs. | Image: Gan et. al.

The specialty: none of these imaging networks has ever seen pictures of werewolves, drawings by artists, bears, Nicolas Cage or New York.

This progress is made possible by a special architecture of the AI ​​model: The researchers rely on two generators whose capabilities are the same at the beginning. The weightings in the neural network of a generator are frozen and serve as an orientation for the second generator, which adjusts its weightings until new images are created from an initial image that correspond to the specifications of CLIP.

Some animal images generated by the web
All of the animal pictures shown in this picture were generated with a StyleGAN variant that was trained exclusively with pictures of dogs. | Image: Gan et. al.

The input and target categories are used, such as “human” and “werewolf” or “dog” and “Nicolas Cage”. In order to increase the quality of the generated images, the second generator changes its weights layer by layer.

Huge models make artificial intelligence more flexible

Even if the BAC networks described did not explicitly see their target categories or styles, all this information is implicitly contained in CLIP. Because OpenAI trained the AI ​​model with large amounts of images and text from the Internet: Nicolas Cage, Tron, Bären, paintings by Picasso – all of these visual motifs were included in CLIPs data training and are linked to their respective linguistic terms. A study by OpenAI demonstrated this phenomenon, reminiscent of the grandmother neuron.

StyleGAN-NADA uses the comprehensive representations of visual motifs contained in CLIP as specifications for its own specifically trained GANs. In comparison with other GAN systems that rely on explicit training images as a template, StyleGAN-NADA is significantly more effective. The researchers write that this also applies to variants that rely on little training data. Moving the domain only takes a few minutes. The following video illustrates the process.

StyleGAN-NADA is another example of the versatility of large AI models such as CLIP or GPT-3 that have been trained over the past two years. Due to their extensive pre-training with huge amounts of data, they serve as the basis for specific AI applications, which can then be developed more quickly through comparatively less complex fine-tuning.

The code for StyleGAN-NADA is available from Github.

Read more about Artificial Intelligence:

Source: StyleCLIP (Arxiv), StyleGAN-NADA (Arxiv)



Please enter your comment!
Please enter your name here

Trending News

Recent Comments