Machine-learning system doesn’t require costly hand-annotated data.
In recent years, computers have gotten remarkably good at recognizing speech and images: Think of the dictation software on most cellphones, or the algorithms that automatically identify people in photos posted to Facebook.
But recognition of natural sounds — such as crowds cheering or waves crashing — has lagged behind. That’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand, which is prohibitively expensive for all but the highest-demand applications.
Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present a sound-recognition system that outperforms its predecessors but didn’t require hand-annotated data during training.
Instead, the researchers trained the system on video. First, existing computer vision systems that recognize scenes and objects categorized the images in the video. The new system then found correlations between those visual categories and natural sounds.
“Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student in electrical engineering and computer science and one of the paper’s two first authors. “We’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabeled video to learn to understand sound.”
The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.
“Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author and a postdoc in the lab of MIT professor of electrical engineering and computer science Antonio Torralba. Torralba is the final co-author on the paper.
“We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details — ‘Is it a restaurant?’ — those details are missing. Even for annotation purposes, the task is really hard.”
Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.
When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message. Similarly, sound recognition could improve the situational awareness of autonomous robots.
“For instance, think of a self-driving car,” Aytar says. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance — which path it’s going to take — just purely based on sound.”
The researchers’ machine-learning system is a neural network, so called because its architecture loosely resembles that of the human brain. A neural net consists of processing nodes that, like individual neurons, can perform only rudimentary computations but are densely interconnected. Information — say, the pixel values of a digital image — is fed to the bottom layer of nodes, which processes it and feeds it to the next layer, which processes it and feeds it to the next layer, and so on. The training process continually modifies the settings of the individual nodes, until the output of the final layer reliably performs some classification of the data — say, identifying the objects in the image.
Vondrick, Aytar, and Torralba first trained a neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the Places data set created by Torralba’s group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.
Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. “It’s about 2 million unique videos,” Vondrick says. “If you were to watch all of them back to back, it would take you about two years.” Then they trained a second neural network on the audio from the same videos. The second network’s goal was to correctly predict the object and scene tags produced by the first network.
The result was a network that could interpret natural sounds in terms of image categories. For instance, it might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.
To compare the sound-recognition network’s performance to that of its predecessors, however, the researchers needed a way to translate its language of images into the familiar language of sound names. So they trained a simple machine-learning system to associate the outputs of the sound-recognition network with a set of standard sound labels.
For that, the researchers did use a database of annotated audio — one with 50 categories of sound and about 2,000 examples. Those annotations had been supplied by humans. But it’s much easier to label 2,000 examples than to label 2 million. And the MIT researchers’ network, trained first on unlabeled video, significantly outperformed all previous networks trained solely on the 2,000 labeled examples.
“With the modern machine-learning approaches, like deep learning, you have many, many trainable parameters in many layers in your neural-network system,” says Mark Plumbley, a professor of signal processing at the University of Surrey. “That normally means that you have to have many, many examples to train that on. And we have seen that sometimes there’s not enough data to be able to use a deep-learning system without some other help. Here the advantage is that they are using large amounts of other video information to train the network and then doing an additional step where they specialize the network for this particular task. That approach is very promising because it leverages this existing information from another field.”
Plumbley says that both he and colleagues at other institutions have been involved in efforts to commercialize sound recognition software for applications such as home security, where it might, for instance, respond to the sound of breaking glass. Other uses might include eldercare, to identify potentially alarming deviations from ordinary sound patterns, or to control sound pollution in urban areas. “I really think that there’s a lot of potential in the sound-recognition area,” he says.
The Latest on: Sound recognition system
via Google News
The Latest on: Sound recognition system
- Even Fox News is choosing not to cover Trump's rallies in full, but his falsehoods are still eye-poppingon September 3, 2020 at 9:08 pm
A version of this article first appeared in the "Reliable Sources" newsletter. You can sign up for free right here.
- Historical recognition of residential schools is a start. Now it’s up to Canadians to never forgeton September 3, 2020 at 10:39 am
The federal government’s designation around residential schools creates an opportunity for every Canadian to do their part and grapple with the country’s dark past ...
- 2021 Mercedes S-Class: Technology Has Taken Overon September 2, 2020 at 5:31 am
At some point in automotive history - maybe years ago, maybe right now - technology became more important than horsepower and performance. That notion might be best represented by the 2021 ...
- Intelligent Face Mask and Body Temperature Detection Systemon September 1, 2020 at 2:29 pm
Offers real-time face mask detection and accurate body temperature measurement for multiple people during the current COVID-19 pandemic.
- Residential school system to be recognized as matter of national historic significanceon September 1, 2020 at 2:00 am
The Portage La Prairie Residential School in Manitoba and the Shubenacadie Residential School in Nova Scotia will be named national historic sites ...
- Concerns about facial recognition system being set up by policeon August 31, 2020 at 5:44 pm
Concerns about a facial recognition system being set up by police. RNZ is reporting police have been quietly setting up the nine million dollar system that ...
- Increased overdose deaths in North Bay, Parry Sound region — health uniton August 31, 2020 at 10:32 am
A statement from the North Bay Parry Sound District Health Unit says the region has seen an increased number of overdose-related deaths since the start of the COVID-19 pandemic and an overall rise ...
- Automotive Infotainment System Market Trends, Size, Forecast - 2019-2025on August 31, 2020 at 7:33 am
The automotive infotainment system is the integration of navigation and entertainment system that allows these systems for accessing the audio controls and displays of the vehicle This allows ...
- Police setting up facial recognition system worth millionson August 30, 2020 at 8:10 pm
The police are quietly setting up an $8 million facial recognition system that can take a live feed from CCTV cameras and identify people from it. This would push New Zealand into new territory for ...
via Bing News