Helping computers learn to tackle big-data problems outside their comfort zones
Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately1.
Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.
A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.
Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.
“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.
The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.
The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.
Learn more: Thinking outside the sample
The Latest on: Big data
via Google News
The Latest on: Big data
- Bill, Melinda Gates Says Big Data is 'Sexist' on February 19, 2019 at 12:48 pm
The namesake Bill & Melinda Gates Foundation co-chairs published their Annual Letter for 2019 this month and addressed a wide range of issues including DNA testing, renewable energy and a booming gene... […]
- Bill & Melinda Gates Say Data is 'Sexist,' Misleading Policymakers on Women's Issues Across Globe on February 19, 2019 at 12:48 pm
Bill and Melinda Gates described worldwide Big Data as inherently "sexist," as opposed to objective, with the Microsoft co-founder saying the gender divide is hindering progress on women's rights. The ... […]
- Epic CEO Lists Her “Groundbreaking” Big Data Goals for Healthcare on February 19, 2019 at 11:02 am
February 19, 2019 - Big data is everywhere in the healthcare industry. From images, socioeconomic data, and lab tests to clinical notes, medical device readouts, prescription drug information, more da... […]
- The Federal Government Enters the Big Data Fray on February 19, 2019 at 10:01 am
As the spotlight brightens on the financial services industry’s use of data, Senate Banking, Housing, and Urban Affairs Committee Chairman Mike Crapo (R-ID) and Ranking Member Sherrod Brown (D-OH) off... […]
- Autonomous Cars, Big Data, and Edge Computing: What You Need to Know on February 19, 2019 at 6:19 am
Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now. This article is featured in the new DZone Guide to Big Data ... […]
- Big Data Analytics In Healthcare market scrutinized in new research on February 19, 2019 at 4:21 am
The Insight partners has announced the addition of the "Big Data Analytics In Healthcare Market Outlook By Component, Deployment Type, Type and Application to 2027" report to their offering. Big Data ... […]
- The perils of Big Data: How crunching numbers can lead to moral blunders on February 18, 2019 at 3:11 am
Caitlin Rosenthal is assistant professor of History at the University of California, Berkeley and author of "Accounting for Slavery: Masters and Management." The last six months have been brutal for M... […]
- Big Data in Retail: Overcoming Challenges and Grabbing Opportunities on February 18, 2019 at 1:45 am
Retail has become an extremely competitive and complex environment that is always needed for an active online presence. Through eTail trailblazers, the online presence of traditional brick-and-mortar ... […]
- Drones and big data: the next frontier in the fight against wildlife extinction on February 17, 2019 at 11:00 pm
A recent Zoological Society of London conservation research project involved drones. Photograph: The Zoological Society of London Technology is playing an increasingly vital role in conservation and e... […]
- The Big Data Revolution Will Be Sampled: How 'Big Data' Has Come To Mean 'Small Sampled Data' on February 17, 2019 at 8:33 pm
One of the great ironies of the “big data” revolution is the way in which so much of the insight we draw from these massive datasets actually comes from small samples not much larger than the datasets ... […]
via Bing News