Helping computers learn to tackle big-data problems outside their comfort zones
Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately1.
Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.
A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.
Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.
“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.
The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.
The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.
Learn more: Thinking outside the sample
The Latest on: Big data
via Google News
The Latest on: Big data
Big Data & Issues & Opportunities: Free Flow of Data
on February 19, 2019 at 4:14 pm
In this seventh article in our series on "Big Data & Issues & Opportunities" (see our previous article here), we focus on the free flow of data in the context of big data processing. Where relevant, i... […]
Artificial intelligence and big data analytics for the insurance industry: hot topics in the regulation of InsurTech
on February 19, 2019 at 3:55 pm
The insurance industry is facing the possibility to improve the performance capacity of its business by relying on artificial intelligence and big data analytics. In this context, ‘client risk-profili... […]
Bill, Melinda Gates Says Big Data is 'Sexist'
on February 19, 2019 at 12:48 pm
The namesake Bill & Melinda Gates Foundation co-chairs published their Annual Letter for 2019 this month and addressed a wide range of issues including DNA testing, renewable energy and a booming gene... […]
Epic CEO Lists Her “Groundbreaking” Big Data Goals for Healthcare
on February 19, 2019 at 11:02 am
February 19, 2019 - Big data is everywhere in the healthcare industry. From images, socioeconomic data, and lab tests to clinical notes, medical device readouts, prescription drug information, more da... […]
Autonomous Cars, Big Data, and Edge Computing: What You Need to Know
on February 19, 2019 at 6:19 am
Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now. This article is featured in the new DZone Guide to Big Data ... […]
Big Data Analytics In Healthcare market scrutinized in new research
on February 19, 2019 at 4:21 am
The Insight partners has announced the addition of the "Big Data Analytics In Healthcare Market Outlook By Component, Deployment Type, Type and Application to 2027" report to their offering. Big Data ... […]
The perils of Big Data: How crunching numbers can lead to moral blunders
on February 18, 2019 at 3:11 am
Caitlin Rosenthal is assistant professor of History at the University of California, Berkeley and author of "Accounting for Slavery: Masters and Management." The last six months have been brutal for M... […]
Big Data in Retail: Overcoming Challenges and Grabbing Opportunities
on February 18, 2019 at 1:45 am
Retail has become an extremely competitive and complex environment that is always needed for an active online presence. Through eTail trailblazers, the online presence of traditional brick-and-mortar ... […]
Drones and big data: the next frontier in the fight against wildlife extinction
on February 17, 2019 at 11:00 pm
A recent Zoological Society of London conservation research project involved drones. Photograph: The Zoological Society of London Technology is playing an increasingly vital role in conservation and e... […]
The Big Data Revolution Will Be Sampled: How 'Big Data' Has Come To Mean 'Small Sampled Data'
on February 17, 2019 at 8:33 pm
One of the great ironies of the “big data” revolution is the way in which so much of the insight we draw from these massive datasets actually comes from small samples not much larger than the datasets ... […]
via Bing News