Helping computers learn to tackle big-data problems outside their comfort zones
Imagine combing through thousands of mugshots desperately looking for a match. If time is of the essence, the faster you can do this, the better. A*STAR researchers have developed a framework that could help computers learn how to process and identify these images both faster and more accurately1.
Peng Xi of the A*STAR Institute for Infocomm Research notes that the framework can be used for numerous applications, including image segmentation, motion segmentation, data clustering, hybrid system identification and image representation.
A conventional way that computers process data is called representation learning. This involves identifying a feature that allows the program to quickly extract relevant information from the dataset and categorize it — a bit like a shortcut. Supervised and unsupervised learning are two of the main methods used in representation learning. Unlike supervised learning, which relies on costly labeling of data prior to processing, unsupervised learning involves grouping or ‘clustering’ data in a similar manner to our brains, explains Peng.
Subspace clustering is a form of unsupervised learning that seeks to fit each data point into a low-dimensional subspace to find an intrinsic simplicity that makes complex, real-world data tractable. Existing subspace clustering methods struggle to handle ‘out-of-sample’, or unknown, data points and the large datasets that are common today.
“One of the challenges of the big-data era is to organize out-of-sample data using a machine learning model based on ‘in-sample’, or known, observational data,” explains Peng who, with his colleagues, has proposed three methods as part of a unified framework to tackle this issue. These methods differ in how they implement representation learning; one focuses on sparsity, while the other two focus on low rank and grouping effects. “By solving the large-scale data and out-of-sample clustering problems, our method makes big-data clustering and online learning possible,” notes Peng.
The framework devised by the team splits input data into ‘in-sample’ data or ‘out-of-sample’ data during an initial ‘sampling’ step. Next, the in-sample data is grouped into subspaces during the ‘clustering’ step, after which the out-of-sample data is assigned to the nearest subspace. These points are then designated as cluster members.
The team tested their approach on a range of datasets including different types of information, from facial images to text — both handwritten and digital — poker hands and forest coverage. They found that their methods outperformed existing algorithms and successfully reduced the computational complexity (and hence running time) of the task while still ensuring cluster quality.
Learn more: Thinking outside the sample
The Latest on: Big data
via Google News
The Latest on: Big data
Report uses big data on nutritional characteristics of online-ordered food
on May 21, 2019 at 12:32 am
Delivery workers take the meal ordered by passengers at a restaurant at Xi'an North Railway Station in Xi'an city, capital of northwest China's Shaanxi province, July 17, 2017. [Photo/Xinhua] A report ... […]
Data Analytics: Fashion and big data
on May 21, 2019 at 12:00 am
When people think of data analytics, beauty and fashion are not the first things that come to mind. But it is in these industries that real-time data analytics is becoming most crucial, says ... […]
How to Turn Google BigQuery Into A Powerful Marketing Data Warehouse
on May 20, 2019 at 10:03 pm
The Martech 5000 supergraphic highlights the big challenge facing marketers. As more products, tools, and platforms arise, so too does the amount of data marketers need to gather, monitor, and analyze ... […]
Lionel Messi News: Big Data reveals the player most similar to Messi
on May 20, 2019 at 8:11 pm
As reported by Sport in conjunction with sports data analysis company Driblab, the player most similar to Barcelona talisman Lionel Messi is Chelsea's Eden Hazard, who is a 92.7% match to the ... […]
The player most similar to Lionel Messi, per Big Data
on May 20, 2019 at 10:21 am
Lionel Messi is impossible to repeat or equal. However statistics permit comparisons. Using these, the company Driblab has an advanced model to help make decisions about which players to sign. On TV3 ... […]
2020 Digital Health Predictions: A Look Forward to the Promises of AI, Big Data, Femtech, and More
on May 20, 2019 at 7:55 am
Transformational Health experts lead a webinar on how the digital health market will evolve in the next year and how care coordination, security, data analytics, digital therapeutics, and ... […]
Big Data Testing Market is Booming Worldwide | Infosys, Cigniti Technologies, Testplant, Real-Time Technology
on May 20, 2019 at 6:35 am
May 20, 2019 (AB Digital via COMTEX) -- HTF MI Analyst have added a new research study on Title “Global Big Data Testing Market Report 2019” with detailed information of Product Types [On ... […]
China International Big Data Industry Expo 2019 to kick off in Guiyang
on May 20, 2019 at 3:46 am
GUIZHOU, China, May 20, 2019 /PRNewswire/ -- China International Big Data Industry Expo 2019, also known as 2019 Big Data Expo, will be held from May 26 to 29 in Guiyang, capital of southwest ... […]
Automating Data Monetization with Blockchain and Big data
on May 20, 2019 at 12:08 am
FREMONT, CA: Data-driven business models have grown in the fourth industrial revolution. The organization, which use modern technology like Big Data and IoT, collect large volumes of data from each ... […]
Big data reveals hidden subtypes of sepsis
on May 19, 2019 at 11:56 pm
Much like cancer, sepsis isn't simply one condition but rather many conditions that could benefit from different treatments, according to the results of a University of Pittsburgh School of ... […]
via Bing News