Machine learning —

Nvidia developed a radically different way to compress video calls

Nvidia Maxine uses Generative Adversarial Networks to re-create video frames.

Timothy B. Lee - Nov 19, 2020 12:57 pm UTC

Enlarge / Instead of transmitting an image for every frame, Maxine sends keypoint data that allows the receiving computer to re-create the face using a neural network.
Nvidia

Last month, Nvidia announced a new platform called Maxine that uses AI to enhance the performance and functionality of video conferencing software. The software uses a neural network to create a compact representation of a person's face. This compact representation can then be sent across the network, where a second neural network reconstructs the original image—possibly with helpful modifications.

Nvidia says that its technique can reduce the bandwidth needs of video conferencing software by a factor of 10 compared to conventional compression techniques. It can also change how a person's face is displayed. For example, if someone appears to be facing off-center due to the position of her camera, the software can rotate her face to look straight instead. Software can also replace someone's real face with an animated avatar.

Maxine is a software development kit, not a consumer product. Nvidia is hoping third-party software developers will use Maxine to improve their own video conferencing software. And the software comes with an important limitation: the device receiving a video stream needs an Nvidia GPU with tensor core technology. To support devices without an appropriate graphics card, Nvidia recommends that video frames be generated in the cloud—an approach that may or may not work well in practice.

But regardless of how Maxine fares in the marketplace, the concept seems likely to be important for video streaming services in the future. Before too long, most computing devices will be powerful enough to generate realtime video content using neural networks. Maxine and products like it could allow for higher-quality video streams with much lower bandwidth consumption.

Dueling neural networks

Enlarge / A generative adversarial network turns sketches of handbags into photorealistic images of handbags.
Phillip Isola et al.

Maxine is built on a machine-learning technique called a generative adversarial network (GAN).

A GAN is a neural network—a complex mathematical function that takes numerical inputs and produces numerical outputs. For visual applications, the input to a neural network is typically a pixel-by-pixel representation of an image. One famous neural network, for example, took images as inputs and output the estimated probability that the image fell into each of 1,000 categories like "dalmatian" and "mushroom."

Neural networks have thousands—often millions—of tunable parameters. The network is trained by evaluating its performance against real-world data. The network is shown a real-world input (like a picture of a dog) whose correct classification is known to the training software (perhaps "dalmatian"). The training software then uses a technique called back-propagation to optimize the network's parameters. Values that pushed the network toward the right answer are boosted, while those that contributed to a wrong answer get dialed back. After repeating this process on thousands—even millions—of examples, the network may become quite effective at the task it's being trained for.

Applications for GANs have proliferated

Enlarge / Researchers used a conditional GAN to project how a face would age over time.
Grigory Antipov, Moez Baccouche, and Jean-Luc Dugelay

The earliest GANs just tried to produce random realistic-looking images within a broad category like human faces. These are known as unconditional GANs. More recently, researchers have developed conditional GANs—neural networks that take an image (or other input data) and then try to produce a corresponding output image.

In some cases, the training algorithm provides the same input information to both the generator and the discriminator. In other cases, the generator's loss function—the measure of how well the network did for training purposes—combines the output of the discriminator with some other metric that judges how well the output fits the input data.

This approach has a wide range of applications. Researchers have used conditional GANs to generate works of art from textual descriptions, to generate photographs from sketches, to generate maps from satellite images, to predict how people will look when they're older, and a lot more.

This brings us back to Nvidia Maxine. Nvidia hasn't provided full details on how the technology works, but it did point us to a 2019 paper that described some of the underlying algorithms powering Maxine.

The paper describes a conditional GAN that takes as input a video of one person's face talking and a few photos of a second person's face. The generator creates a video of the second person making the same motions as the person in the original video.

Enlarge / Nvidia's experimental GAN created videos that showed one person (top) making the motions of a second person in an input video (left).
Ting-Chun Wang et al, Nvidia.

Nvidia's new video conferencing software uses a slight modification of this technique. Instead of taking a video as input, Maxine takes a set of keypoints extracted from the source video—data points specifying the location and shape of the subject's eyes, mouth, nose, eyebrows, and other facial features. This data can be represented far more compactly than an ordinary video, which means it can be transmitted across the network with minimal bandwidth used. The network also sends a high-resolution video frame so that the recipient knows what the subject looks like. The receiver's computer then uses a conditional GAN to reconstruct the subject's face.

A key feature of the network Nvidia researchers described in 2019 is that it wasn't specific to one face. A single network could be trained to generate videos of different people based on the photos provided as inputs. The practical benefit for Maxine is that there's no need to train a new network for each user. Instead, Nvidia can provide a pre-trained generator network that can draw anyone's face. Using a pre-trained network requires far less computing power than training a new network from scratch.

Nvidia's approach makes it easy to manipulate output video in a number of useful ways. For example, a common problem with videoconferencing technology is for the camera to be off-center from the screen, causing a person to appear to be looking to the side. Nvidia's neural network can fix this by rotating the keypoints of a user's face so that they are centered. Nvidia isn't the first company to do this. Apple has been working on its own version of this feature for FaceTime. But it's possible that Nvidia's GAN-based approach will be more powerful, allowing modifications to the entire face rather than just the eyes.

Nvidia Maxine can also replace a subject's real head with an animated character who performs the same actions. Again, this isn't new—Snapchat popularized the concept a few years ago, and it has become common on video chat apps. But Nvidia's GAN-based approach could enable more realistic images that work in a wider range of head positions.

Maxine in the cloud?

Enlarge / Nvidia CEO Jen-Hsun Huang.
Patrick T. Fallon/Bloomberg via Getty Images

Maxine isn't a consumer product. Rather it's a software development kit for building video conferencing software. Nvidia is providing developers with a number of different capabilities and letting them decide how to put them together into a usable product.

And at least the initial version of Maxine will come with an important limitation: it requires a recent Nvidia GPU on the receiving end of the video stream. Maxine is built atop tensor cores, compute units in newer Nvidia graphics cards that are optimized for machine-learning operations. This poses a challenge for a video-conferencing product, since customers are going to expect support for a wide variety of hardware.

When I asked an Nvidia rep about this, he argued that developers could run Maxine on a cloud server equipped with the necessary Nvidia hardware, then stream the rendered video to client devices. This approach allows developers to capture some but not all of Maxine's benefits. Developers can use Maxine to re-orient a user's face to improve eye contact, replace a user's background, and perform effects like turning a subject's face into an animated character. Using Maxine this way can also save bandwidth on a user's video uplink, since Maxine's keypoint extraction technology doesn't require an Nvidia GPU.

Still, Maxine's strongest selling point is probably its dramatically smaller bandwidth requirements. And the full bandwidth savings can only be realized if video generation occurs on client devices. That would require Maxine to support devices without Nvidia GPUs.

When I asked Nvidia whether it planned to add support for non-Nvidia GPUs, it declined to comment on future product plans.

Right now, Maxine is in the "early access" stage of development. Nvidia is offering access to a select group of early developers who are helping Nvidia refine Maxine's APIs. At some point in the future—again, Nvidia wouldn't say when—Nvidia will open the platform to software developers generally.

And of course, Nvidia is unlikely to maintain a monopoly on this approach to video conferencing. As far as I can tell, other major tech companies have not yet announced plans to use GANs to improve video conferencing. But Google, Apple, and Qualcomm have all been working to build more powerful chips to support machine learning on smartphones. It's a safe bet that engineers at these companies are exploring the possibility of Maxine-like video compression using neural networks. Apple may be particularly well-positioned to develop software like this given the tight integration of its hardware and software.

Promoted Comments

marsilies Ars Praefectus et Subscriptor
jump to post
From article:

"Still, Maxine's strongest selling point is probably its dramatically smaller bandwidth requirements. And the full bandwidth savings can only be realized if video generation occurs on client devices."

I think the main thing Maxine is aimed at right now is people that have asymmetric broadband connections, where their upload speed is a lot more constrained than their download speed, and the concept that in group chats, the person with the low speed upload connection shouldn't look noticeably worse than everyone else.

So the person on the poor connection can choose to view in, say, 480p if their connection is slow, while everyone else sees in 1080p, but when the focus is on them, everyone sees them at low resolution, with dropped frames, and/or a lot of compression artifacts. This AI "compression" allows that user to upload HD keyframes, plus the low-bandwidth information to generate the in-betweens, which then can be used to generate a traditional video stream that can be sent to everyone else at whatever bandwidth they can handle.

13224 posts | registered 2/26/2014
Twilight Sparkle Ars Tribunus Militum
jump to post
marsilies wrote:
overstitch wrote:
Isn't that less "video conferencing" and more "virtual reality" since it would essentially be generating the video data vs. streaming compressed images?

There's still real-word keyframes that the AI is working off of to generate the in-betweens, so it's a manipulated/altered reality instead of virtual. In some ways though, traditional inter-frame video compression was already manipulating reality, since it's often not showing you a full frame of what was actually shot, just what the compressor thought had changed since the last frame.

All lossy compression is virtual reality as it's not showing you the input, it's showing you an approximation of the input.

1503 posts | registered 5/23/2012
Trees Smack-Fu Master, in training et Subscriptor
jump to post
agkhayyam wrote:
show nested quotes
NeatOman wrote:
So this is an Emoji with your own face, whole new level to The Mandela effect

Office Manager; we might have to but we wont make any cuts to personnel

Ai compression; i believed you winked therefore you did

Office Worker; got it boss, lets stage a coup

Yeah, this is essentially a photo realistic Memoji. Us humans are great at solving existing problems by making WAY BIGGER problems. Oh, your video conferencing uses too much bandwidth? Let’s just make all video fake! It will be so good that no one will be able to tell the difference between real and fake video. Wonderful!

Or we could just actually force ISPs to behave like utilities, get rid of data caps and old garbage copper, and just stream ACTUAL VIDEO.

It's pretty clear that the deepfakes problem is out of the bag/barn/pick your escaped animal analogy. I don't see this materially adding to the problem, and it could be genuinely useful.

I do agree that the animal is no longer confined. However, nvidia devoting significant resources to it (developing hardware and specific SDKs) can greatly accelerate its growth. It was always going to get better with time, it’s just now going to be much less time.

46 posts | registered 5/23/2019
fuzzyfuzzyfungus Ars Tribunus Angusticlavius
jump to post
seandtoliver wrote:
I've been thinking about if it would be possible to add hardware to camera sensors to allow them to cryptographically sign their output. The manufacturers could publish the public keys of their sensors so that you can verify if an image or video was taken with a legitimate secure sensor and hasn't been modified.

One big problem would be post processing which would ruin the signatures of each image/frame. Maybe holomorphic encryption could get around this eventually?

Adding the usual silicon-root-of-trust-with-some-private-keys-burned-in would likely be doable enough(if slightly more expensive than usual because raw output of most camera sensors, especially at higher frame rates, is not small so you'd need a fairly beefy hardware encryption block; not just the flyweight 'can verify the signature on the first stage bootloader fast enough to not slow down the boot process much' crypto blocks.

The vastly bigger issue; as you mention; would be the fact that there are very, very, few cases where anyone actually uses or expects the raw sensor output. Consumer cameras tend not to even expose it, munging it into JPEG on-device for your convenience; classier gear allows you to access the raw sensor data so that you have more control and choice of software for munging it into an intermediate for final image format; and then there's frequently a bunch of legitimate (but definitely signature-breaking) color balance work and cropping and suchlike.

What could help with that(at least in certain circumstances where there are strict rules about getting evidence thrown out; less so for doctored-photo-social-media-hoaxes) is the fact that, even if the munged output is what normal humans who don't decode bayer pattern CCD spew in their heads use; preserving the entire transform chain from sensor output to resulting image isn't a terrifying amount of storage space and wouldn't be too challenging to define a standardized description language for (eg. save the .raw from the camera, which is signed, produce a log file that says "started with signed .raw file with public key X and hash Y; processed with Adobe Lightroom version Z with "LISTOFSETTINGSANDPARAMETERS"; performed crop operation that selected this region of the image; produced result with hash W")

That would make a challenge to the veracity of a final image simple and readily automated even though the sensor's signatures are going to be broken in processing: if you run the alleged workflow from the signed input and don't get the right answer you know something fishy is going on; and if the alleged workflow deviates from common values for white balance and such you can readily produce a 'normalized' workflow that uses more typical parameters and see if someone was trying to get deceptive by picking really weird color curves or concealing things by blowing out highlights or the like.

There may also be some wildly elegant technique that allows cryptographic properties to be preserved under certain transformations; but I'm hopelessly ill-qualified to comment on number theory witchcraft.

7251 posts | registered 7/2/2012

Promoted Comments

marsilies Ars Praefectus et Subscriptor
jump to post
From article:

"Still, Maxine's strongest selling point is probably its dramatically smaller bandwidth requirements. And the full bandwidth savings can only be realized if video generation occurs on client devices."

I think the main thing Maxine is aimed at right now is people that have asymmetric broadband connections, where their upload speed is a lot more constrained than their download speed, and the concept that in group chats, the person with the low speed upload connection shouldn't look noticeably worse than everyone else.

So the person on the poor connection can choose to view in, say, 480p if their connection is slow, while everyone else sees in 1080p, but when the focus is on them, everyone sees them at low resolution, with dropped frames, and/or a lot of compression artifacts. This AI "compression" allows that user to upload HD keyframes, plus the low-bandwidth information to generate the in-betweens, which then can be used to generate a traditional video stream that can be sent to everyone else at whatever bandwidth they can handle.

13224 posts | registered 2/26/2014
Twilight Sparkle Ars Tribunus Militum
jump to post
marsilies wrote:
overstitch wrote:
Isn't that less "video conferencing" and more "virtual reality" since it would essentially be generating the video data vs. streaming compressed images?

There's still real-word keyframes that the AI is working off of to generate the in-betweens, so it's a manipulated/altered reality instead of virtual. In some ways though, traditional inter-frame video compression was already manipulating reality, since it's often not showing you a full frame of what was actually shot, just what the compressor thought had changed since the last frame.

All lossy compression is virtual reality as it's not showing you the input, it's showing you an approximation of the input.

1503 posts | registered 5/23/2012
Trees Smack-Fu Master, in training et Subscriptor
jump to post
agkhayyam wrote:
show nested quotes
NeatOman wrote:
So this is an Emoji with your own face, whole new level to The Mandela effect

Office Manager; we might have to but we wont make any cuts to personnel

Ai compression; i believed you winked therefore you did

Office Worker; got it boss, lets stage a coup

Yeah, this is essentially a photo realistic Memoji. Us humans are great at solving existing problems by making WAY BIGGER problems. Oh, your video conferencing uses too much bandwidth? Let’s just make all video fake! It will be so good that no one will be able to tell the difference between real and fake video. Wonderful!

Or we could just actually force ISPs to behave like utilities, get rid of data caps and old garbage copper, and just stream ACTUAL VIDEO.

It's pretty clear that the deepfakes problem is out of the bag/barn/pick your escaped animal analogy. I don't see this materially adding to the problem, and it could be genuinely useful.

I do agree that the animal is no longer confined. However, nvidia devoting significant resources to it (developing hardware and specific SDKs) can greatly accelerate its growth. It was always going to get better with time, it’s just now going to be much less time.

46 posts | registered 5/23/2019
fuzzyfuzzyfungus Ars Tribunus Angusticlavius
jump to post
seandtoliver wrote:
I've been thinking about if it would be possible to add hardware to camera sensors to allow them to cryptographically sign their output. The manufacturers could publish the public keys of their sensors so that you can verify if an image or video was taken with a legitimate secure sensor and hasn't been modified.

One big problem would be post processing which would ruin the signatures of each image/frame. Maybe holomorphic encryption could get around this eventually?

Adding the usual silicon-root-of-trust-with-some-private-keys-burned-in would likely be doable enough(if slightly more expensive than usual because raw output of most camera sensors, especially at higher frame rates, is not small so you'd need a fairly beefy hardware encryption block; not just the flyweight 'can verify the signature on the first stage bootloader fast enough to not slow down the boot process much' crypto blocks.

The vastly bigger issue; as you mention; would be the fact that there are very, very, few cases where anyone actually uses or expects the raw sensor output. Consumer cameras tend not to even expose it, munging it into JPEG on-device for your convenience; classier gear allows you to access the raw sensor data so that you have more control and choice of software for munging it into an intermediate for final image format; and then there's frequently a bunch of legitimate (but definitely signature-breaking) color balance work and cropping and suchlike.

What could help with that(at least in certain circumstances where there are strict rules about getting evidence thrown out; less so for doctored-photo-social-media-hoaxes) is the fact that, even if the munged output is what normal humans who don't decode bayer pattern CCD spew in their heads use; preserving the entire transform chain from sensor output to resulting image isn't a terrifying amount of storage space and wouldn't be too challenging to define a standardized description language for (eg. save the .raw from the camera, which is signed, produce a log file that says "started with signed .raw file with public key X and hash Y; processed with Adobe Lightroom version Z with "LISTOFSETTINGSANDPARAMETERS"; performed crop operation that selected this region of the image; produced result with hash W")

That would make a challenge to the veracity of a final image simple and readily automated even though the sensor's signatures are going to be broken in processing: if you run the alleged workflow from the signed input and don't get the right answer you know something fishy is going on; and if the alleged workflow deviates from common values for white balance and such you can readily produce a 'normalized' workflow that uses more typical parameters and see if someone was trying to get deceptive by picking really weird color curves or concealing things by blowing out highlights or the like.

There may also be some wildly elegant technique that allows cryptographic properties to be preserved under certain transformations; but I'm hopelessly ill-qualified to comment on number theory witchcraft.

7251 posts | registered 7/2/2012

Timothy B. Lee Timothy is a senior reporter covering tech policy and the future of transportation. He lives in Washington DC.

Channel Ars Technica

← Previous story Next story →

Dueling neural networks

Further Reading

Applications for GANs have proliferated

Maxine in the cloud?

Promoted Comments

Promoted Comments

reader comments

Channel Ars Technica