Crowdsourcing big-data analysis

“I think that the concept of massive and open data science can be really leveraged for areas where there’s a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses,” MIT graduate student Micah Smith says about FeatureHub.
Image: MIT News

Web-based system automatically evaluates proposals from far-flung data scientists.

In the analysis of big data sets, the first step is usually the identification of “features” — data points with particular predictive power or analytic utility. Choosing features usually requires some human intuition. For instance, a sales database might contain revenues and date ranges, but it might take a human to recognize that average revenues — revenues divided by the sizes of the ranges — is the really useful metric.

MIT researchers have developed a new collaboration tool, dubbed FeatureHub, intended to make feature identification more efficient and effective. With FeatureHub, data scientists and experts on particular topics could log on to a central site and spend an hour or two reviewing a problem and proposing features. Software then tests myriad combinations of features against target data, to determine which are most useful for a given predictive task.

In tests, the researchers recruited 32 analysts with data science experience, who spent five hours each with the system, familiarizing themselves with it and using it to propose candidate features for each of two data-science problems.

The predictive models produced by the system were tested against those submitted to a data-science competition called Kaggle. The Kaggle entries had been scored on a 100-point scale, and the FeatureHub models were within three and five points of the winning entries for the two problems.

But where the top-scoring entries were the result of weeks or even months of work, the FeatureHub entries were produced in a matter of days. And while 32 collaborators on a single data science project is a lot by today’s standards, Micah Smith, an MIT graduate student in electrical engineering and computer science who helped lead the project, has much larger ambitions.

FeatureHub — like its name — was inspired by GitHub, an online repository of open-source programming projects, some of which have drawn thousands of contributors. Smith hopes that FeatureHub might someday attain a similar scale.

“I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention,” he says. “I think that the concept of massive and open data science can be really leveraged for areas where there’s a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses.”

Smith and his colleagues presented a paper describing FeatureHub at the IEEE International Conference on Data Science and Advanced Analytics. His coauthors on the paper are his thesis advisor, Kalyan Veeramachaneni, a principal research scientist at MIT’s Laboratory for Information and Decision Systems, and Roy Wedge, who began working with Veeramachaneni’s group as an MIT undergraduate and is now a software engineer at Feature Labs, a data science company based on the group’s work.

FeatureHub’s user interface is built on top of a common data-analysis software suite called the Jupyter Notebook, and the evaluation of feature sets is performed by standard machine-learning software packages. Features must be written in the Python programming language, but their design has to follow a template that intentionally keeps the syntax simple. A typical feature might require between five and 10 lines of code.

The MIT researchers wrote code that mediates between the other software packages and manages data, pooling features submitted by many different users and tracking those collections of features that perform best on particular data analysis tasks.

It is time to use big data to tackle longstanding questions about plant diversity and forecast how plant life will fare

In the past, Veeramachaneni’s group has developed software that automatically generatesfeatures by inferring relationships between data from the manner in which they’re organized. When that organizational information is missing, however, the approach is less effective.

Still, Smith imagines, automatic feature synthesis could be used in conjunction with FeatureHub, getting projects started before volunteers have begun to contribute to them, saving the grunt work of enumerating the obvious features, and augmenting the best-performing sets of features contributed by humans.

Learn more: Crowdsourcing big-data analysis

The Latest on: Big-data analysis

[google_news title=”” keyword=”big-data analysis” num_posts=”10″ blurb_length=”0″ show_thumb=”left”]

Big Data Analytics in Retail Market (CAGR) of 21.88%, Role of Market Research in New Product Development Strategies for Success
on May 9, 2024 at 10:24 pm
The global big data analytics in retail market is projected to grow at a compound annual growth rate (CAGR) of 21.88% during the forecast period 2021-2027, according to the new report published by ...
Use Big Data to Revolutionize Women’s Health
on May 9, 2024 at 3:40 am
For far too long, the field of medical research and healthcare has been heavily slanted towards the male physiology as the default study subject. From clinical trials that predominantly used male ...
Neuromorphic Chip Market Size To Touch $2734.8 Million By 2031 Driven By Industrial Adoption And Big Data Analysis
on May 8, 2024 at 10:46 pm
Neuromorphic chips offer a revolutionary approach to computing by mimicking the human brain's structure and function. These chips leverage spiking neural networks (SNNs) that emulate the biological ...
Data Dynamics: Shaping the Future of Online Casino Analytics
on May 8, 2024 at 12:33 pm
The intricate symbiosis between technology and data analytics is fundamentally reshaping the strategies and operations of online casinos, marking a new era in online gambling. This discourse assesses ...
Unlocking big data in commercial real estate
on May 8, 2024 at 7:00 am
Strategists at a commercial real estate developer recently had a major “Aha!” moment: The markets they are targeting for new projects are precisely those where the competition is most intense, ...
Unveiling the Power of Big Data Analytics in the Casino Industry
on May 7, 2024 at 1:12 pm
The modern digital era is the age of data and this applies to casinos as well. Big data analytics has transformed the working process of casinos providing them with vital information about player ...
Pan Singh Dhoni: Trailblazing the Frontier of Data, Big Data, and AI Advancements
on May 7, 2024 at 7:24 am
Originally Published on 7th May 2024 In the dynamic and ever-changing realm of data, big data, and artificial intelligence (AI), Pan Singh Dhoni stands out as a pioneering figure. His expertise, which ...
Rethinking ‘Big Data’ — and the rift between business and data ops
on May 7, 2024 at 3:00 am
As an era, ‘Big Data’ may be over, but its underlying value (and tensions) live on, even as organizations seek to make the leap to an AI future.
Leveraging Big Data for Enhanced Cybersecurity Solutions
on May 3, 2024 at 4:01 am
In this contributed article, Alexander Norell of VikingCloud explores how big data analytics can significantly improve cybersecurity strategies by enabling more accurate threat detection and real-time ...
Aviation emissions: Big data analysis sparks new concerns
on May 2, 2024 at 8:08 am
The experts meticulously calculated greenhouse gas emissions from aviation for 197 countries, shedding light on previously unreported data.