Deep learning shows its thinking by explaining the reasoning behind its predictions

A Duke team trained a computer to identify up to 200 species of birds from just a photo. Given a photo of a mystery bird (top), the A.I. spits out heat maps showing which parts of the image are most similar to typical species features it has seen before

New research aims to open the ‘black box’ of computer vision

It can take years of birdwatching experience to tell one species from the next. But using an artificial intelligence technique called deep learning, Duke University researchers have trained a computer to identify up to 200 species of birds from just a photo.

The real innovation, however, is that the A.I. tool also shows its thinking, in a way that even someone who doesn’t know a penguin from a puffin can understand.

The team trained their deep neural network — algorithms based on the way the brain works — by feeding it 11,788 photos of 200 bird species to learn from, ranging from swimming ducks to hovering hummingbirds.

The researchers never told the network “this is a beak” or “these are wing feathers.” Given a photo of a mystery bird, the network is able to pick out important patterns in the image and hazard a guess by comparing those patterns to typical species traits it has seen before.

Along the way it spits out a series of heat maps that essentially say: “This isn’t just any warbler. It’s a hooded warbler, and here are the features — like its masked head and yellow belly — that give it away.”

Duke computer science Ph.D. student Chaofan Chen and undergraduate Oscar Li led the research, along with other team members of the Prediction Analysis Lab directed by Duke professor Cynthia Rudin.

They found their neural network is able to identify the correct species up to 84% of the time — on par with some of its best-performing counterparts, which don’t reveal how they are able to tell, say, one sparrow from the next.

Rudin says their project is about more than naming birds. It’s about visualizing what deep neural networks are really seeing when they look at an image.

Similar technology is used to tag people on social networking sites, spot suspected criminals in surveillance cameras, and train self-driving cars to detect things like traffic lights and pedestrians.

The problem, Rudin says, is that most deep learning approaches to computer vision are notoriously opaque. Unlike traditional software, deep learning software learns from the data without being explicitly programmed. As a result, exactly how these algorithms ‘think’ when they classify an image isn’t always clear.

Rudin and her colleagues are trying to show that A.I. doesn’t have to be that way. She and her lab are designing deep learning models that explain the reasoning behind their predictions, making it clear exactly why and how they came up with their answers. When such a model makes a mistake, its built-in transparency makes it possible to see why.

For their next project, Rudin and her team are using their algorithm to classify suspicious areas in medical images like mammograms. If it works, their system won’t just help doctors detect lumps, calcifications and other symptoms that could be signs of breast cancer. It will also show which parts of the mammogram it’s homing in on, revealing which specific features most resemble the cancerous lesions it has seen before in other patients.

In that way, Rudin says, their network is designed to mimic the way doctors make a diagnosis. “It’s case-based reasoning,” Rudin said. “We’re hoping we can better explain to physicians or patients why their image was classified by the network as either malignant or benign.”

Learn more: THIS A.I. BIRDWATCHER LETS YOU ‘SEE’ THROUGH THE EYES OF A MACHINE

 

The Latest on: Deep learning

via Google News

 

The Latest on: Deep learning

via  Bing News

 

Mimicking how humans visualize and identify objects with artificial intelligence

The “computer vision” system can identify objects based on only partial glimpses, like by using these photo snippets of a motorcycle.

Researchers from UCLA Samueli School of Engineering and Stanford have demonstrated a computer system that can discover and identify the real-world objects it “sees” based on the same method of visual learning that humans use.

The system is an advance in a type of technology called “computer vision,” which enables computers to read and identify visual images. It is an important step toward general artificial intelligence systems–computers that learn on their own, are intuitive, make decisions based on reasoning and interact with humans in a more human-like way. Although current AI computer vision systems are increasingly powerful and capable, they are task-specific, meaning their ability to identify what they see is limited by how much they have been trained and programmed by humans.

The “computer vision” system can identify objects based on only partial glimpses, like by using these photo snippets of a motorcycle.

Even today’s best computer vision systems cannot create a full picture of an object after seeing only certain parts of it–and the systems can be fooled by viewing the object in an unfamiliar setting. Engineers are aiming to make computer systems with those abilities–just like humans can understand that they are looking at a dog, even if the animal is hiding behind a chair and only the paws and tail are visible. Humans, of course, can also easily intuit where the dog’s head and the rest of its body are, but that ability still eludes most artificial intelligence systems.

Current computer vision systems are not designed to learn on their own. They must be trained on exactly what to learn, usually by reviewing thousands of images in which the objects they are trying to identify are labeled for them.

Computers, of course, also cannot explain their rationale for determining what the object in a photo represents: AI-based systems do not build an internal picture or a common-sense model of learned objects the way humans do.

The engineers’ new method, described in the Proceedings of the National Academy of Sciences, shows a way around these shortcomings.

The approach is made up of three broad steps. First, the system breaks up an image into small chunks, which the researchers call “viewlets.” Second, the computer learns how these viewlets fit together to form the object in question. And finally, it looks at what other objects are in the surrounding area, and whether or not information about those objects is relevant to describing and identifying the primary object.

To help the new system “learn” more like humans, the engineers decided to immerse it in an internet replica of the environment humans live in.

“Fortunately, the internet provides two things that help a brain-inspired computer vision system learn the same way humans do,” said Vwani Roychowdhury, a UCLA professor of electrical and computer engineering and the study’s principal investigator. “One is a wealth of images and videos that depict the same types of objects. The second is that these objects are shown from many perspectives–obscured, bird’s eye, up-close–and they are placed in different kinds of environments.”

To develop the framework, the researchers drew insights from cognitive psychology and neuroscience.

“Starting as infants, we learn what something is because we see many examples of it, in many contexts,” Roychowdhury said. “That contextual learning is a key feature of our brains, and it helps us build robust models of objects that are part of an integrated worldview where everything is functionally connected.”

The researchers tested the system with about 9,000 images, each showing people and other objects. The platform was able to build a detailed model of the human body without external guidance and without the images being labeled.

The engineers ran similar tests using images of motorcycles, cars and airplanes. In all cases, their system performed better or at least as well as traditional computer vision systems that have been developed with many years of training.

Learn more: New AI system mimics how humans visualize and identify objects

 

The Latest on: Computer vision

via Google News

 

The Latest on: Computer vision

via  Bing News

 

Robots could one day be able to see well enough to be useful in people’s homes and offices

PhD student Lucas Manuelli worked with lead author Pete Florence to develop a system that uses advanced computer vision to enable a Kuka robot to pick up virtually any object.
Photo: Tom Buehler/CSAIL

Breakthrough CSAIL system suggests robots could one day be able to see well enough to be useful in people’s homes and offices.

Humans have long been masters of dexterity, a skill that can largely be credited to the help of our eyes. Robots, meanwhile, are still catching up.

Certainly there’s been some progress: For decades, robots in controlled environments like assembly lines have been able to pick up the same object over and over again. More recently, breakthroughs in computer vision have enabled robots to make basic distinctions between objects. Even then, though, the systems don’t truly understand objects’ shapes, so there’s little the robots can do after a quick pick-up.

In a new paper, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), say that they’ve made a key development in this area of work: a system that lets robots inspect random objects, and visually understand them enough to accomplish specific tasks without ever having seen them before.

The system, called Dense Object Nets (DON), looks at objects as collections of points that serve as sort of visual roadmaps. This approach lets robots better understand and manipulate items, and, most importantly, allows them to even pick up a specific object among a clutter of similar — a valuable skill for the kinds of machines that companies like Amazon and Walmart use in their warehouses.

For example, someone might use DON to get a robot to grab onto a specific spot on an object, say, the tongue of a shoe. From that, it can look at a shoe it has never seen before, and successfully grab its tongue.

“Many approaches to manipulation can’t identify specific parts of an object across the many orientations that object may encounter,” says PhD student Lucas Manuelli, who wrote a new paper about the system with lead author and fellow PhD student Pete Florence, alongside MIT Professor Russ Tedrake. “For example, existing algorithms would be unable to grasp a mug by its handle, especially if the mug could be in multiple orientations, like upright, or on its side.”

The team views potential applications not just in manufacturing settings, but also in homes. Imagine giving the system an image of a tidy house, and letting it clean while you’re at work, or using an image of dishes so that the system puts your plates away while you’re on vacation.

What’s also noteworthy is that none of the data was actually labeled by humans. Instead, the system is what the team calls “self-supervised,” not requiring any human annotations.

Two common approaches to robot grasping involve either task-specific learning, or creating a general grasping algorithm. These techniques both have obstacles: Task-specific methods are difficult to generalize to other tasks, and general grasping doesn’t get specific enough to deal with the nuances of particular tasks, like putting objects in specific spots.

The DON system, however, essentially creates a series of coordinates on a given object, which serve as a kind of visual roadmap, to give the robot a better understanding of what it needs to grasp, and where.

The team trained the system to look at objects as a series of points that make up a larger coordinate system. It can then map different points together to visualize an object’s 3-D shape, similar to how panoramic photos are stitched together from multiple photos. After training, if a person specifies a point on a object, the robot can take a photo of that object, and identify and match points to be able to then pick up the object at that specified point.

This is different from systems like UC-Berkeley’s DexNet, which can grasp many different items, but can’t satisfy a specific request. Imagine a child at 18 months old, who doesn’t understand which toy you want it to play with but can still grab lots of items, versus a four-year old who can respond to “go grab your truck by the red end of it.”

In one set of tests done on a soft caterpillar toy, a Kuka robotic arm powered by DON could grasp the toy’s right ear from a range of different configurations. This showed that, among other things, the system has the ability to distinguish left from right on symmetrical objects.

When testing on a bin of different baseball hats, DON could pick out a specific target hat despite all of the hats having very similar designs — and having never seen pictures of the hats in training data before.

“In factories robots often need complex part feeders to work reliably,” says Florence. “But a system like this that can understand objects’ orientations could just take a picture and be able to grasp and adjust the object accordingly.”

In the future, the team hopes to improve the system to a place where it can perform specific tasks with a deeper understanding of the corresponding objects, like learning how to grasp an object and move it with the ultimate goal of say, cleaning a desk.

Learn more: Robots can now pick up any object after inspecting it

 

 

The Latest on: Computer vision

via Google News

 

The Latest on: Computer vision

via  Bing News

 

Taking real words that someone spoke and turning them into realistic video of that individual saying what you want

via University of Washington

University of Washington researchers have developed new algorithms that solve a thorny challenge in the field of computer vision: turning audio clips into a realistic, lip-synced video of the person speaking those words.

As detailed in a paper to be presented Aug. 2 at  SIGGRAPH 2017, the team successfully generated highly-realistic video of former president Barack Obama talking about terrorism, fatherhood, job creation and other topics using audio clips of those speeches and existing weekly video addresses that were originally on a different topic.

“These type of results have never been shown before,” said Ira Kemelmacher-Shlizerman, an assistant professor at the UW’s Paul G. Allen School of Computer Science & Engineering. “Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps.”

In a visual form of lip-syncing, the system converts audio files of an individual’s speech into realistic mouth shapes, which are then grafted onto and blended with the head of that person from another existing video.

The team chose Obama because the machine learning technique needs available video of the person to learn from, and there were hours of presidential videos in the public domain. “In the future video, chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models,” Kemelmacher-Shlizerman said.

Because streaming audio over the internet takes up far less bandwidth than video, the new system has the potential to end video chats that are constantly timing out from poor connections.

“When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good,” said co-author and Allen School professor Steve Seitz. “So if you could use the audio to produce much higher-quality video, that would be terrific.”

By reversing the process — feeding video into the network instead of just audio — the team could also potentially develop algorithms that could detect whether a video is real or manufactured.

The new machine learning tool makes significant progress in overcoming what’s known as the “uncanny valley” problem, which has dogged efforts to create realistic video from audio. When synthesized human likenesses appear to be almost real — but still manage to somehow miss the mark — people find them creepy or off-putting.

“People are particularly sensitive to any areas of your mouth that don’t look realistic,” said lead author Supasorn Suwajanakorn, a recent doctoral graduate in the Allen School. “If you don’t render teeth right or the chin moves at the wrong time, people can spot it right away and it’s going to look fake. So you have to render the mouth region perfectly to get beyond the uncanny valley.”

graphic showing how the conversion from audio to video works

A neural network first converts the sounds from an audio file into basic mouth shapes. Then the system grafts and blends those mouth shapes onto an existing target video and adjusts the timing to create a new realistic, lip-synced video.University of Washington

Previously, audio-to-video conversion processes have involved filming multiple people in a studio saying the same sentences over and over to try to capture how a particular sound correlates to different mouth shapes, which is expensive, tedious and time-consuming. By contrast, Suwajanakorn developed algorithms that can learn from videos that exist “in the wild” on the internet or elsewhere.

“There are millions of hours of video that already exist from interviews, video chats, movies, television programs and other sources. And these deep learning algorithms are very data hungry, so it’s a good match to do it this way,” Suwajanakorn said.

Rather than synthesizing the final video directly from audio, the team tackled the problem in two steps. The first involved training a neural network to watch videos of an individual and translate different audio sounds into basic mouth shapes.

By combining previous research from the UW Graphics and Image Laboratory team with a new mouth synthesis technique, they were then able to realistically superimpose and blend those mouth shapes and textures on an existing reference video of that person. Another key insight was to allow a small time shift to enable the neural network to anticipate what the speaker is going to say next.

The new lip-syncing process enabled the researchers to create realistic videos of Obama speaking in the White House, using words he spoke on a television talk show or during an interview decades ago.

Currently, the neural network is designed to learn on one individual at a time, meaning that Obama’s voice — speaking words he actually uttered — is the only information used to “drive” the synthesized video. Future steps, however, include helping the algorithms generalize across situations to recognize a person’s voice and speech patterns with less data – with only an hour of video to learn from, for instance, instead of 14 hours.

“You can’t just take anyone’s voice and turn it into an Obama video,” Seitz said. “We very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”

Learn more: Lip-syncing Obama: New tools turn audio clips into realistic video

 

The Latest on: Audio-to-video conversion

via Google News and Bing News