One day in the fall of their sophomore year, Matthew Fernandez and Akash Krishnan were at Akash’s house in Portland, Ore., trying to come up with an idea for their school’s science fair. At Oregon Episcopal School, all students in 7th to 11th grade are required to enter a project in the Aardvark Science Expo (the aardvark is the school’s mascot), and these two had teamed up for the last three years. Temporarily defeated, they popped in a DVD of “I, Robot.”
There’s a scene in the movie when Will Smith, who plays a robot-hating cop, visits Bridget Moynahan, the impossibly gorgeous scientist, and they begin to argue. She gets angry. Her personal robot immediately walks into the room and asks: “Is everything all right, Ma’am? I detected elevated stress patterns in your voice.” It’s a minor exchange — a computer recognizing emotion in a human voice — in a movie full of futuristic robots wreaking havoc, but it was an aha moment for a desperate research team. Their reaction, as Matt describes it, was: “ ‘Hey, that’s really cool. I wonder if there’s any science there.’ ”
There was — it was just really hard. With emotion recognition, they stumbled onto a thorny problem. Computers have become very good at parsing an audio signal into specific words and identifying their meaning. But spoken language is more than just semantics. “If I say ‘happy’ and you say ‘happy,’ it’ll sound kind of similar, and a computer can try to match that up,” Matt explains. But it’s far from clear what elements in an audio signal indicate happiness or anger as a quality of voice. Trying to figure that out quickly consumed them. Matt stayed up late reading research papers, ignoring his other homework. Akash was up until 3 a.m. many nights, reading and programming. They spent long hours at each other’s houses or talking on Skype.
The research paper they submitted for the school expo was 30 pages of code and 60 pages of writing to explain it. “Emotion is innately meta information,” Matt says, “and that’s why it’s a real challenge. A lot of people base their algorithms off of speech-recognition systems because those have been established. But emotion is a really different task, and it’s a different goal.” For one, in speech recognition, sequence is essential; get the sounds out of order, and you mess up the words. In emotion recognition, the order isn’t nearly as important as various measures of energy and pitch. Determining what information to pay attention to in the audio signal and how to process it involves imagination, some sticky calculus and a lot of trial and error. “We tried to think of something new,” Akash says of the algorithm they built, “instead of using what other people tried to do.” The algorithm they came up with allows them to determine the emotion of a speaker by measuring 57 different features of an audio signal against a prerecorded signal that’s already been defined by a human listener as, say, “happy” or “angry.” Their algorithm doesn’t yet recognize confidence, or sarcasm, but what it does do (imperfectly, but better than the rest of the field) is detect fear, anger, joy and sadness in real time, without eating up so much processing power as to be impractical in a handheld device.
- Probing the minds of scientists on geek X-factor (newscientist.com)
- Google Science Fair: Roll up for the geekiest show on Earth | Alice Roberts (guardian.co.uk)