Web-based system automatically evaluates proposals from far-flung data scientists.
In the analysis of big data sets, the first step is usually the identification of “features” — data points with particular predictive power or analytic utility. Choosing features usually requires some human intuition. For instance, a sales database might contain revenues and date ranges, but it might take a human to recognize that average revenues — revenues divided by the sizes of the ranges — is the really useful metric.
MIT researchers have developed a new collaboration tool, dubbed FeatureHub, intended to make feature identification more efficient and effective. With FeatureHub, data scientists and experts on particular topics could log on to a central site and spend an hour or two reviewing a problem and proposing features. Software then tests myriad combinations of features against target data, to determine which are most useful for a given predictive task.
In tests, the researchers recruited 32 analysts with data science experience, who spent five hours each with the system, familiarizing themselves with it and using it to propose candidate features for each of two data-science problems.
The predictive models produced by the system were tested against those submitted to a data-science competition called Kaggle. The Kaggle entries had been scored on a 100-point scale, and the FeatureHub models were within three and five points of the winning entries for the two problems.
But where the top-scoring entries were the result of weeks or even months of work, the FeatureHub entries were produced in a matter of days. And while 32 collaborators on a single data science project is a lot by today’s standards, Micah Smith, an MIT graduate student in electrical engineering and computer science who helped lead the project, has much larger ambitions.
FeatureHub — like its name — was inspired by GitHub, an online repository of open-source programming projects, some of which have drawn thousands of contributors. Smith hopes that FeatureHub might someday attain a similar scale.
“I do hope that we can facilitate having thousands of people working on a single solution for predicting where traffic accidents are most likely to strike in New York City or predicting which patients in a hospital are most likely to require some medical intervention,” he says. “I think that the concept of massive and open data science can be really leveraged for areas where there’s a strong social impact but not necessarily a single profit-making or government organization that is coordinating responses.”
Smith and his colleagues presented a paper describing FeatureHub at the IEEE International Conference on Data Science and Advanced Analytics. His coauthors on the paper are his thesis advisor, Kalyan Veeramachaneni, a principal research scientist at MIT’s Laboratory for Information and Decision Systems, and Roy Wedge, who began working with Veeramachaneni’s group as an MIT undergraduate and is now a software engineer at Feature Labs, a data science company based on the group’s work.
FeatureHub’s user interface is built on top of a common data-analysis software suite called the Jupyter Notebook, and the evaluation of feature sets is performed by standard machine-learning software packages. Features must be written in the Python programming language, but their design has to follow a template that intentionally keeps the syntax simple. A typical feature might require between five and 10 lines of code.
The MIT researchers wrote code that mediates between the other software packages and manages data, pooling features submitted by many different users and tracking those collections of features that perform best on particular data analysis tasks.
In the past, Veeramachaneni’s group has developed software that automatically generatesfeatures by inferring relationships between data from the manner in which they’re organized. When that organizational information is missing, however, the approach is less effective.
Still, Smith imagines, automatic feature synthesis could be used in conjunction with FeatureHub, getting projects started before volunteers have begun to contribute to them, saving the grunt work of enumerating the obvious features, and augmenting the best-performing sets of features contributed by humans.
Learn more: Crowdsourcing big-data analysis
The Latest on: Big-data analysis
[google_news title=”” keyword=”big-data analysis” num_posts=”10″ blurb_length=”0″ show_thumb=”left”]
- The Coolest Big Data System And Cloud Platform Companies Of The 2024 Big Data 100on April 23, 2024 at 7:41 am
And long-established software giants like Microsoft, Oracle and SAP provide foundational cloud systems, databases and other supporting software for big data initiatives, in addition to offering their ...
- Big Data Consulting Market Size Worth $36.7 Billion By 2030: IndustryARCon April 22, 2024 at 5:47 pm
The Big Data Consulting market size is forecast to reach $.36.7 billion by 2030, after growing at a CAGR of 13.9% during the forecast period 2024-2030. The market for Big Data Consulting industry is ...
- SAS named to CRN 2024 Big Data 100 liston April 22, 2024 at 8:02 am
SAS, a leader in data and AI, announced today that CRN®, a brand of The Channel Company®, included SAS on its 2024 Big Data 100 list in the Big Data Business Analytics category. This annual list ...
- The Coolest Data Analytics Companies Of The 2024 Big Data 100on April 22, 2024 at 7:29 am
This week CRN is running the Big Data 100 list in a series of slide shows, organized by technology category, spotlighting vendors of data analytics software, database systems, data warehouse and data ...
- The Key to Success: Harnessing the Power of Big Data for Business Growthon April 19, 2024 at 4:12 am
In today’s fast-paced and competitive business world, staying ahead of the curve is essential for success. And one way companies are doing just that is by harnessing the power of big data. From ...
- Big Data as a Service Market CAGR of 30%, Upcoming Trends, Size, Key Players, Revenue, Share, and Forecast to 2024 to 2032on April 18, 2024 at 4:25 pm
Big Data as a Service Market is valued approximately at USD 8.7 billion in 2018 and is anticipated to grow with a growth rate of more than 30% over the forecast period 2019-2026. Big Data as a Service ...
- Vietnam Big Data Analytics Market Gaining Momentum with Positive External Factorson April 17, 2024 at 1:30 pm
Request To Download Free Sample of This Strategic Report @ This country research report on Vietnam Market offers comprehensive insights into the market landscape, customer intelligence, and ...
- What Big Data Means for Your Small Businesson April 16, 2024 at 12:59 pm
The internet has only been around for three decades, but in that relatively short time, it’s become one of the most important tools at our collective disposal. As a small business owner, you can use ...
- 5 Free Online Data Analysis Courses In 2024on April 15, 2024 at 9:00 am
Data analysis is one of the most in-demand skills of 2024, with job demand soaring to 35%. Here are five free online courses so you can learn data analytics.
- Tourism Industry Big Data Analytics Market Sets Sights on US$ 486.6 Billion by 2033on April 11, 2024 at 9:40 pm
According to Future Market Insight, the global tourism industry big data analytics market is likely to reach US$ 486.6 billion by 2033, registering a CAGR of 8%. This is a substantial upsurge from its ...
via Google News and Bing News