Machine learning —

Nvidia developed a radically different way to compress video calls

Nvidia Maxine uses Generative Adversarial Networks to re-create video frames.

Instead of transmitting an image for every frame, Maxine sends keypoint data that allows the receiving computer to re-create the face using a neural network.
Enlarge / Instead of transmitting an image for every frame, Maxine sends keypoint data that allows the receiving computer to re-create the face using a neural network.
Nvidia

Last month, Nvidia announced a new platform called Maxine that uses AI to enhance the performance and functionality of video conferencing software. The software uses a neural network to create a compact representation of a person's face. This compact representation can then be sent across the network, where a second neural network reconstructs the original image—possibly with helpful modifications.

Nvidia says that its technique can reduce the bandwidth needs of video conferencing software by a factor of 10 compared to conventional compression techniques. It can also change how a person's face is displayed. For example, if someone appears to be facing off-center due to the position of her camera, the software can rotate her face to look straight instead. Software can also replace someone's real face with an animated avatar.

Maxine is a software development kit, not a consumer product. Nvidia is hoping third-party software developers will use Maxine to improve their own video conferencing software. And the software comes with an important limitation: the device receiving a video stream needs an Nvidia GPU with tensor core technology. To support devices without an appropriate graphics card, Nvidia recommends that video frames be generated in the cloud—an approach that may or may not work well in practice.

But regardless of how Maxine fares in the marketplace, the concept seems likely to be important for video streaming services in the future. Before too long, most computing devices will be powerful enough to generate realtime video content using neural networks. Maxine and products like it could allow for higher-quality video streams with much lower bandwidth consumption.

Dueling neural networks

A generative adversarial network turns sketches of handbags into photorealistic images of handbags.
Enlarge / A generative adversarial network turns sketches of handbags into photorealistic images of handbags.

Maxine is built on a machine-learning technique called a generative adversarial network (GAN).

A GAN is a neural network—a complex mathematical function that takes numerical inputs and produces numerical outputs. For visual applications, the input to a neural network is typically a pixel-by-pixel representation of an image. One famous neural network, for example, took images as inputs and output the estimated probability that the image fell into each of 1,000 categories like "dalmatian" and "mushroom."

Neural networks have thousands—often millions—of tunable parameters. The network is trained by evaluating its performance against real-world data. The network is shown a real-world input (like a picture of a dog) whose correct classification is known to the training software (perhaps "dalmatian"). The training software then uses a technique called back-propagation to optimize the network's parameters. Values that pushed the network toward the right answer are boosted, while those that contributed to a wrong answer get dialed back. After repeating this process on thousands—even millions—of examples, the network may become quite effective at the task it's being trained for.

Training software needs to know the correct answer for each input. For this reason, classic machine-learning projects often required people to label thousands of examples by hand. But the training process can be greatly sped up if there's a way to automatically generate training data.

A generative adversarial network is a clever way to train a neural network without the need for human beings to label the training data. As the name implies, a GAN is actually two networks that "compete" against one another.

The first network is a generator that takes random data as an input and tries to produce a realistic image. The second network is a discriminator that takes an image and tries to determine whether it's a real image or a forgery created by the first network.

The training software runs these two networks simultaneously, with each network's results being used to train the other:

  • The discriminator's answers are used to train the generator. When the discriminator wrongly classifies a generator-created photo as genuine, that means the generator is doing a good job of creating realistic images—so parameters that led to that result are reinforced. On the other hand, if the discriminator classifies an image as a forgery, that's treated as a failure for the generator.
  • Meanwhile, training software shows the discriminator a random selection of images that are either real or created by the generator. If the discriminator guesses right, that's treated as a success, and the discriminator network's parameters are updated to reflect that.

At the start of training, both networks are bad at their jobs, but they improve over time. As the quality of the generator's images improve, the discriminator has to become more sophisticated to detect fakes. As the discriminator becomes more discriminating, the generative network gets trained to make photos that look more and more realistic.

The results can be spectacular. A website called ThisPersonDoesNotExist.com does exactly what it sounds like: it generates realistic photographs of human beings that don't exist.

The site is powered by a generative neural network called StyleGAN that was developed by researchers at Nvidia. Over the last decade, as Nvidia's graphics cards have become one of the most popular ways to do neural network computations, Nvidia has invested heavily in academic research into neural network techniques.

Applications for GANs have proliferated

Researchers used a conditional GAN to project how a face would age over time.
Enlarge / Researchers used a conditional GAN to project how a face would age over time.

The earliest GANs just tried to produce random realistic-looking images within a broad category like human faces. These are known as unconditional GANs. More recently, researchers have developed conditional GANs—neural networks that take an image (or other input data) and then try to produce a corresponding output image.

In some cases, the training algorithm provides the same input information to both the generator and the discriminator. In other cases, the generator's loss function—the measure of how well the network did for training purposes—combines the output of the discriminator with some other metric that judges how well the output fits the input data.

This approach has a wide range of applications. Researchers have used conditional GANs to generate works of art from textual descriptions, to generate photographs from sketches, to generate maps from satellite images, to predict how people will look when they're older, and a lot more.

This brings us back to Nvidia Maxine. Nvidia hasn't provided full details on how the technology works, but it did point us to a 2019 paper that described some of the underlying algorithms powering Maxine.

The paper describes a conditional GAN that takes as input a video of one person's face talking and a few photos of a second person's face. The generator creates a video of the second person making the same motions as the person in the original video.

Nvidia's experimental GAN created videos that showed one person (top) making the motions of a second person in an input video (left).
Enlarge / Nvidia's experimental GAN created videos that showed one person (top) making the motions of a second person in an input video (left).
Ting-Chun Wang et al, Nvidia.

Nvidia's new video conferencing software uses a slight modification of this technique. Instead of taking a video as input, Maxine takes a set of keypoints extracted from the source video—data points specifying the location and shape of the subject's eyes, mouth, nose, eyebrows, and other facial features. This data can be represented far more compactly than an ordinary video, which means it can be transmitted across the network with minimal bandwidth used. The network also sends a high-resolution video frame so that the recipient knows what the subject looks like. The receiver's computer then uses a conditional GAN to reconstruct the subject's face.

A key feature of the network Nvidia researchers described in 2019 is that it wasn't specific to one face. A single network could be trained to generate videos of different people based on the photos provided as inputs. The practical benefit for Maxine is that there's no need to train a new network for each user. Instead, Nvidia can provide a pre-trained generator network that can draw anyone's face. Using a pre-trained network requires far less computing power than training a new network from scratch.

Nvidia's approach makes it easy to manipulate output video in a number of useful ways. For example, a common problem with videoconferencing technology is for the camera to be off-center from the screen, causing a person to appear to be looking to the side. Nvidia's neural network can fix this by rotating the keypoints of a user's face so that they are centered. Nvidia isn't the first company to do this. Apple has been working on its own version of this feature for FaceTime. But it's possible that Nvidia's GAN-based approach will be more powerful, allowing modifications to the entire face rather than just the eyes.

Nvidia Maxine can also replace a subject's real head with an animated character who performs the same actions. Again, this isn't new—Snapchat popularized the concept a few years ago, and it has become common on video chat apps. But Nvidia's GAN-based approach could enable more realistic images that work in a wider range of head positions.

Maxine in the cloud?

Nvidia CEO Jen-Hsun Huang.
Enlarge / Nvidia CEO Jen-Hsun Huang.
Patrick T. Fallon/Bloomberg via Getty Images

Maxine isn't a consumer product. Rather it's a software development kit for building video conferencing software. Nvidia is providing developers with a number of different capabilities and letting them decide how to put them together into a usable product.

And at least the initial version of Maxine will come with an important limitation: it requires a recent Nvidia GPU on the receiving end of the video stream. Maxine is built atop tensor cores, compute units in newer Nvidia graphics cards that are optimized for machine-learning operations. This poses a challenge for a video-conferencing product, since customers are going to expect support for a wide variety of hardware.

When I asked an Nvidia rep about this, he argued that developers could run Maxine on a cloud server equipped with the necessary Nvidia hardware, then stream the rendered video to client devices. This approach allows developers to capture some but not all of Maxine's benefits. Developers can use Maxine to re-orient a user's face to improve eye contact, replace a user's background, and perform effects like turning a subject's face into an animated character. Using Maxine this way can also save bandwidth on a user's video uplink, since Maxine's keypoint extraction technology doesn't require an Nvidia GPU.

Still, Maxine's strongest selling point is probably its dramatically smaller bandwidth requirements. And the full bandwidth savings can only be realized if video generation occurs on client devices. That would require Maxine to support devices without Nvidia GPUs.

When I asked Nvidia whether it planned to add support for non-Nvidia GPUs, it declined to comment on future product plans.

Right now, Maxine is in the "early access" stage of development. Nvidia is offering access to a select group of early developers who are helping Nvidia refine Maxine's APIs. At some point in the future—again, Nvidia wouldn't say when—Nvidia will open the platform to software developers generally.

And of course, Nvidia is unlikely to maintain a monopoly on this approach to video conferencing. As far as I can tell, other major tech companies have not yet announced plans to use GANs to improve video conferencing. But Google, Apple, and Qualcomm have all been working to build more powerful chips to support machine learning on smartphones. It's a safe bet that engineers at these companies are exploring the possibility of Maxine-like video compression using neural networks. Apple may be particularly well-positioned to develop software like this given the tight integration of its hardware and software.

Channel Ars Technica