Helping computers learn to tackle big-data problems outside their comfort zones
Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately1.
Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.
A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.
Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.
“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.
The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.
The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.
Learn more: Thinking outside the sample
The Latest on: Big data
via Google News
The Latest on: Big data
- After Four Fast Releases, Java 13 Blends Big Features And Subtle Fixeson September 17, 2019 at 4:43 pm
but now big is terabytes, and core counts have similarly increased,” he said. In Java 13, speed demons have several new options. Dynamic class-data sharing will improve startup performance by ...
- Lawmakers Urge Aggressive Action From Regulators on Big Techon September 17, 2019 at 4:13 pm
“What the public sees is a facade with respect to Big Tech — and no immediate prospect of urgency ... including how they handle consumer data and their role in spreading disinformation during the 2016 ...
- Self-Driving Gold Mines? Monetizing Big Data From Autonomous Vehicles and the Impact of Regulationon September 17, 2019 at 3:56 pm
Autonomous vehicles (AVs), including self-driving cars, hold the promise of revolutionizing the way we experience transportation. Original equipment manufacturers of AVs and AV components (AV OEMs), ...
- What’s next for big data after a turbulent 2019?on September 17, 2019 at 2:19 pm
When thousands of data scientists and business analysts converge on New York next week for the year’s main data industry event, Strata, they will be looking for answers to the current chaos in ...
- Big tech wants one single federal data privacy law, but why?on September 17, 2019 at 7:03 am
CEOs who signed: Amazon, AT&T, Dell, IBM, SAP, Salesforce, Visa, Mastercard, and JP Morgan Chase.
- Data journalism solves big problems, but it’s an organizational mess. A new tool from the AP aims to fix that.on September 17, 2019 at 4:08 am
Data journalism has uncloaked offshore financial havens, caught police breaking the same rules of the road that they’re supposed to be enforcing and laid bare how American infrastructure carves out ...
- Big Data as a Service Market Size Worth $51.9 Billion By 2025: Grand View Research, Inc.on September 17, 2019 at 1:53 am
SAN FRANCISCO, Sept. 17, 2019 /PRNewswire/ -- The global big data as a service (BDaaS) market is estimated to reach USD 51.9 billion by 2025, registering a CAGR of 38.7% over the forecast period ...
- Big Data Market Industry Analysis by Product Type, Market by Region's and Forecast 2018-2023,on September 16, 2019 at 11:32 pm
Sep 17, 2019 (AmericaNewsHour) -- Global Big Data Market: By Component (Hardware, Software, Service), By Technology (Predictive Analytics, Machines Learning, Hadoop), Organization Size (Large ...
- Hadoop Big Data Analytics Market- Segmented By Product, Type, Application, And Region – Global Growth, Trends, And Forecast To 2023on September 16, 2019 at 11:06 pm
Sep 17, 2019 (AmericaNewsHour) -- Global Hadoop Big Data Analytics Market: By Component [Software and Services (Consulting & Development, Managed Services and Training & Support)]. By Application ...
- Spireon Announces Big Data Partnership with Snowflakeon September 16, 2019 at 1:49 pm
Opening FleetLocate Data to Snowflake Allows Customized Analytics at a Fraction of the Cost IRVINE, Calif., Sept. 16, 2019 /PRNewswire/ -- Spireon, the vehicle intelligence company, today ...
via Bing News