What if Zoom Could Read Your Facial Expression?

May 16, 20222:49 PM

A man sits at a desk in front of a laptop and engages in a video call with one person. — LinkedIn Sales Solutions/Unsplash

Right now, companies are developing and selling AI products intended to tell your boss, or your teacher, how you’re feeling while on camera. Emotion AI is supposedly capable of taking our expressions and micro-expressions, capturing them via computer vision, and then spitting out some sort of score that says whether someone’s engaged with what’s being said in a virtual classroom or even responding well to a sales pitch. But what’s unclear is how well—or whether—it really works.

On Friday’s episode of What Next: TBD, I spoke with Kate Kaye, a reporter for Protocol, about whether AI really know what you’re feeling. Our conversation has been edited and condensed for clarity.

Lizzie O’Leary: Can you explain what emotion AI is?

Kate Kaye: What most of these technologies do is they take facial expression data, using a camera of some sort that is ingesting imagery data, and they’re applying computer vision to detect what kind of a facial expression that might be categorized as. Other types of data that’s used in emotion AI is tone of voice. And then, it might use things like just the text of the content.

Emotion AI is being used to track whether customer service workers are getting through to their clients, how audiences respond to ads, and whether drivers seem distracted behind the wheel. I was really struck in one of your stories about the human problem of sales during the pandemic. Sales is an occupation that requires so much attention to these little cues—to tone of voice, to the smile, to the eye that wanders—and it’s really difficult to do it online. Can you walk me through how these companies propose to utilize emotion AI in sales?

One company that I write about is called UNi4. They have software that they sell for sales, and their system actually works in real-time. Let’s say you’re on a Zoom call. What it first does is it asks to record. Then what it’s doing is, if I’m the salesperson, I’m seeing a box on my screen that is gauging in real-time the engagement level and the sentiment of either the one person or the room, and it’s measuring, “Oh, engagement just went up,” or “engagement just totally plummeted when you mentioned the price of the software you’re selling.”

Is it just engagement that they’re measuring or are they looking at other things as well?

This is just how this one company decided to productize it, where they’re measuring the engagement level and the sentiment. They also provide reports. Maybe you’re a sales manager, and you want to know what all of your salespeople did this month in terms of what kind of engagement level they had. This company provides a report and they have a little blurb that says, “Wow, engagement spiked up 10 percent this month!” If you’re managing hundreds of salespeople, and not only do you want to help them, you also want to keep tabs on who’s better at what they do outside of just their actual sales, it’s a way that you can have some of this stuff quantified.

The idea of recording and evaluating sales experiences isn’t that new. Let’s say you call some customer service line—you’re used to that little thing that says this interaction may be recorded for quality control. But this goes further in part because it is real-time. What does the evidence say about whether machine learning can accurately read emotions?

Well, there has been research done recently that suggest that it doesn’t.

In a large study released in 2019, a group of psychologists found that inferring emotions from facial movements is incredibly difficult. They spent two years looking at data and examined more than 1,000 underlying studies. Human feelings, they said, are simply too complex to be gauged from expression alone. If I scowl, I might be angry, but I also might just be confused or having trouble seeing something. Moreover, different expressions can mean different things depending on culture.

People being able to assess this stuff is highly questioned. And I think we have to remember that, on its own, as a base level question to ask before we ask whether or not technology can do it.

What were these models trained on in order to recognize what my face is doing, and what I might be thinking on the inside?

In the world of AI, there’s a whole labor force based all over the world in countries where the labor’s a lot less expensive, and people are just hired to label and annotate individual pieces of data that are then fed into AI systems to train them. They hired people and gave them guidance for what would be considered a happy face versus a sad face, or a confused look versus an engaged look, or whatever it might be. If there’s a discrepancy among multiple data labelers about what it should be, then they would only include it in the data set that’s used to train the AI if there’s agreement among more than one data labeler about how it should be labeled. The example that’s often used when we talk about data labeling is people looking at images of apples and bananas. Yeah, we know the difference between a dog and a cat, and an apple and a banana—nobody’s going to argue about that one. Facial expressions are a little different. Shouldn’t they be considered something different?

But it sounds like the train has already left the station in terms of products being created, whether or not that kind of reliance is accurate.

The train has left the station in terms of products being introduced, and you can probably assume that there’s going to be more. Zoom itself is considering integrating emotion AI-type assessments into a sales software that they just released. That doesn’t right now have it, but they told me “We’re seriously considering doing this.” If Zoom does it, that’s a game-changer.

It would especially be game changing for online classrooms, which are a priority for the companies making this technology. Intel and a company called Classroom Technologies are developing an emotion AI-based system to run on top of Zoom.

The idea is: Let’s help the teacher gauge whether or not students are picking up on what he or she is teaching, or whether or not a student might be confused about something, or is a student bored. At this stage, they told me “We don’t even know how it might be integrated. We just feel like what Intel is providing here is something that could be used as an additional signal for a teacher to use. And so, we don’t really know what it’s going to look like yet.”

Right now, this all feels like early proof of concept testing. What is the likelihood that something like what Intel is working on is going to become a product sold to educational markets?

A company like Intel, they can call it proof of concept, but that is part of the process of turning something into a product, ultimately, and vetting whether it should be or could be. It seems like whenever there are technologies that have been adopted, if there’s a way to add a new feature or find a new market for it, it’s going to happen. If we look at the fact that Zoom is possibly going to integrate emotion AI into a sales software … Zoom’s used for classrooms all over the world! If Zoom sees value in having this emotion AI stuff in a sales setting, maybe sometime down the road, it ends up as a feature in your classroom.

The companies making emotion AI software have been attracting a lot of deep-pocketed funders. Classroom Technologies is backed by investors, including NFL quarterback Tom Brady, AOL co-founder Steve Case, and Salesforce ventures. In all your reporting, I was struck by the valuations on these software companies, and the investors, and the amount of money that is behind this. It feels like it has momentum. Do you think that’s correct?

Just the idea of emotion AI being incorporated into this stuff isn’t necessarily what’s driving the momentum. But you can look at a company like UNi4, as I used as an example, which recently got a series E round of funding of 400 million dollars. That’s a lot of money. I think that you could look at the fact that they’re out there really promoting this emotion AI component of their technology as a key selling point to what they’re doing, and the fact that they have all this money behind them as a sign that it just feels so different from other worlds of technology that I’ve seen

This is your beat. Does emotion AI feel different from the other kinds of AI you cover?

I cover things that businesses use. A lot of times, what enterprises use, it’s like they’re doing data analytics and they just want to improve their efficiency as a company, or maybe they’re in manufacturing, and they want to predict when a piece of equipment needs maintenance. A lot of times, AI is somewhat mundane in terms of its use, and it might also incorporate data that has nothing to do with people, at least in any direct way. In this case, it feels different to me because we’re talking about a very physical, very personal component of who we are as people and our bodies. We think of this stuff as biometric data, and it has this really sterile terminology associated with it. The fact is, it’s usually referring to how we walk, how we talk, what our face looks like. This is who we are.

Future Tense is a partnership of Slate, New America, and Arizona State University that examines emerging technologies, public policy, and society.