- Business - 12:01
Gains in consumer confidence continue, depend on job growth - History - 11:01
Taiwanese president praises new fellowship fund at University of Michigan - Medicine - 11:00
Insertable Robot Offers New Approach to Minimally Invasive Surgery - Computer Science - 10:00
Is that smile real or fake? - Literature - May 24
UChicago to honor historian Black, theater director Bogart at Convocation - Agronomy - May 24
Diagnostic labs analyze anything from bugs to toenails - Medicine - May 24
UCLA launches first face transplantation program in western U.S - Administration - May 24
’Click It or Ticket’ Enforcement on Penn Campus - Medicine - May 24
Hormone Plays Surprise Role in Fighting Skin Infections - Pedagogy - May 24
Two SEAS profs envision the next big ideas in teaching and learning - Environmental Sciences - May 24
Columbia's Manhattanville Campus Earns LEED Platinum for Neighborhood Plan - Literature - May 24
Historic Greek Theatre safe, sound and superb after upgrades
By category
Official EventAdministration
Chemistry
Physics
Computer Science
Environmental Sciences
Earth Sciences
Life Sciences
Medicine
Business
Law
Literature
History
Arts
» » more
Tool detects patterns hidden in vast data sets
16 December 2011 - HARVARD

The new tool, developed by SEAS computer scientist Michael Mitzenmacher and colleagues at Harvard and other institutions, provides a new toolkit for recognizing relationships in large data sets.
Researchers from the Broad Institute and Harvard University have developed a tool that can tackle large data sets in a way that no other software program can.
Part of a suite of statistical tools called MINE, it can tease out multiple patterns hidden in health information from around the globe, statistics amassed from a season of major league baseball, data on the changing bacterial landscape of the gut, and much more.
The researchers report their findings in a paper appearing today in the journal Science.
From Facebook to physics to the global economy, the world is filled with data sets that could take a person hundreds of years to analyze by eye. Sophisticated computer programs can search these data sets with great speed, but fall short when researchers attempt to even-handedly detect different kinds of patterns in large data collections.
"There are massive data sets that we want to explore, and within them, there may be many relationships that we want to understand," said senior author Pardis Sabeti, Assistant Professor in the Department of Organismic and Evolutionary Biology and the Center for Systems Biology at Harvard and an associate member of the Broad Institute. "The human eye is the best way to find these relationships, but these data sets are so vast that we can’t do that. This toolkit gives us a way of mining the data to look for relationships."
The researchers tested their analytical toolkit on several large data sets, including one provided by Peter Turnbaugh, a Bauer Fellow at the Harvard Center for Systems Biology, who is interested in the trillions of microorganisms that live in the gut. Working with Turnbaugh, the research team harnessed MINE to make more than 22 million comparisons and narrowed in on a few hundred patterns of interest that had not been observed before.
"The goal of this statistic is to take data with a lot of different dimensions and many possible correlations and pick out the top ones," said senior author Michael Mitzenmacher, Gordon McKay Professor of Computer Science at the Harvard School of Engineering and Applied Sciences. Mitzenmacher’s research involves developing randomized algorithms to analyze complex systems and networks.
"We view this as an exploration tool," he said. "It can find patterns and rank them in an equitable way."
One of the tool’s greatest strengths is that it can detect a wide range of patterns and characterize them according to a number of different parameters a researcher might be interested in. Other statistical tools work well for searching for a specific pattern in a large data set, but they cannot score and compare different kinds of possible relationships. MINE, which stands for Maximal Information-based Nonparametric Exploration, is able to analyze a broad spectrum of patterns.
"Standard methods will see one pattern as signal and others as noise," said David Reshef, a co-first author of the paper who is currently a graduate student in the Harvard-MIT Health Sciences and Technology (HST) program and also worked on this project as a graduate student in the department of statistics at the University of Oxford. "There can potentially be a variety of different types of relationships in a given data set. What’s exciting about our method is that it looks for any type of clear structure within the data, attempting to find all of them."
Not only does MINE attempt to identify any pattern within the data, but it also attempts to do so with an eye toward capturing different types of patterns equally well.
"This ability to search for patterns in an equitable way offers tremendous exploratory potential in terms of searching for patterns without having to know ahead of time what to search for," said Reshef, whose brother Yakir is also a co-lead author of the paper.
MINE is especially powerful in exploring data sets with relationships that may harbor more than one important pattern. As a proof of concept, the researchers applied MINE to social, economic, health, and political data from the World Health Organization and its partners. When they compared the relationship between household income and female obesity, they found two contrasting trends in the data. Many countries follow a parabolic rate, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. But in the Pacific Islands, where female obesity is a sign of status, countries follow a steep trend, with the rate of obesity climbing as income increases.
"Many data sets will contain these types of complicated relationships that are guided by multiple drivers," said Sabeti. MINE is able to identify these. "This greatly extends our capability to find interesting relationships in data."
Researchers can use MINE to generate new ideas and connections that no one has thought to look for before.
"Our tool is a hypothesis generator," said co-lead author Yakir Reshef ’09, a Fulbright scholar at the Weizmann Institute of Science. "The standard paradigm is hypothesis-driven science, where you come up with a hypothesis based on your personal observations. But by exploring the data, you get ideas for hypotheses that would never have occurred to you otherwise."
In addition to testing the ability of the suite of tools to detect patterns in biological and health data, the researchers examined data collected from the 2008 baseball season.
"One question that we thought would be particularly interesting would be to see what things were most strongly associated with salary," said David Reshef. The researchers generated a list of relationships, finding that the strongest associations with salary were hits, total bases, and an aggregate statistic that reflects how many runs a player generated for a team.
"Given the stakes, baseball is so well documented. We’re curious to see what can be done in this realm with tools like MINE," David Reshef added.
Researchers from many different fields, including systems biology, computer science, statistics, and mathematics, all contributed to this project.
"People are getting better at combining data from different sources, and in some ways, this project is in the spirit of that," said Yakir Reshef. "The project brought together authors from many disciplines. It symbolizes the kind of collaborations that we hope people will use this for in the future."
Other authors who contributed to this work include Sharon Grossman ’08, a graduate student in the Harvard-MIT HST program; Hilary Finucane ’09, of the Weizmann Institute; Gilean McVean, of the University of Oxford; and Eric Lander, Professor of Systems Biology at Harvard Medical School, Professor of Biology at MIT, and founding director of the Broad Institute.
MINE is especially powerful in exploring data sets with relationships that may harbor more than one important pattern. As a proof of concept, the researchers applied MINE to social, economic, health, and political data from the World Health Organization and its partners. When they compared the relationship between household income and female obesity, they found two contrasting trends in the data. Many countries follow a parabolic rate, with obesity rates rising with income but peaking and tapering off after income reaches a certain level. But in the Pacific Islands, where female obesity is a sign of status, countries follow a steep trend, with the rate of obesity climbing as income increases.
"Many data sets will contain these types of complicated relationships that are guided by multiple drivers," said Sabeti. MINE is able to identify these. "This greatly extends our capability to find interesting relationships in data."
Researchers can use MINE to generate new ideas and connections that no one has thought to look for before.
"Our tool is a hypothesis generator," said co-lead author Yakir Reshef ’09, a Fulbright scholar at the Weizmann Institute of Science. "The standard paradigm is hypothesis-driven science, where you come up with a hypothesis based on your personal observations. But by exploring the data, you get ideas for hypotheses that would never have occurred to you otherwise."
In addition to testing the ability of the suite of tools to detect patterns in biological and health data, the researchers examined data collected from the 2008 baseball season.
"One question that we thought would be particularly interesting would be to see what things were most strongly associated with salary," said David Reshef. The researchers generated a list of relationships, finding that the strongest associations with salary were hits, total bases, and an aggregate statistic that reflects how many runs a player generated for a team.
"Given the stakes, baseball is so well documented. We’re curious to see what can be done in this realm with tools like MINE," David Reshef added.
Researchers from many different fields, including systems biology, computer science, statistics, and mathematics, all contributed to this project.
"People are getting better at combining data from different sources, and in some ways, this project is in the spirit of that," said Yakir Reshef. "The project brought together authors from many disciplines. It symbolizes the kind of collaborations that we hope people will use this for in the future."
Other authors who contributed to this work include Sharon Grossman ’08, a graduate student in the Harvard-MIT HST program; Hilary Finucane ’09, of the Weizmann Institute; Gilean McVean, of the University of Oxford; and Eric Lander, Professor of Systems Biology at Harvard Medical School, Professor of Biology at MIT, and founding director of the Broad Institute.
Links
Harvard UniversityLast job offers
- Law - 21.5
Doctoral Programme at the Law School of the University of Basel - Life Sciences - 18.4
Senior Expert - Genetic Biomarker Oncology (PhD) m/f - Business - 22.5
Research Associate - Civil Engineering - 15.5
Research Specialist - Beckman Institute (A1200274) - Life Sciences - 15.5
Staff Research Associate II - Medicine - 12.5
Research Specialist - Business - 4.5
Assistant Professor of Economics, Non Tenure Track, Fall 2012 - Business - 3.5
Post Doctoral Fellow




» Share this page: