This article discusses data mining that draws upon extensive work in areas such as statistics, machine learning, pattern recognition, databases, and high-performance computing to discover interesting and previously unknown information in data. More specifically, data mining is the analysis of 10 large data sets to find relationships and patterns that aren’t readily apparent, and to summarize the data in new and useful ways. Data mining technology has enabled earth scientists from NASA to discover changes in the global carbon cycle and climate system, and biologists to map and explore the human genome. Data mining is not restricted solely to vast banks of data with unlimited ways of analyzing it. Manufacturers, such as W.L. Gore (the maker of GoreTex) use commercially available data mining tools to warehouse and analyze their data, and improve their manufacturing process. Gore uses data mining tools from analytic software vendor SAS for statistical modeling in its manufacturing process.
Simulating natural phenomena, mapping the human genome, and discovering ways to improve product quality all have one thing in common: They generate tremendous amounts of data.
Having terabytes of data at your disposal greatly increases the chances that you can find the answer to even the toughest questions—if you don't mind searching for a needle in a giant digital haystack.
The method for an increasing number of scientists and engineers is data mining. Data mining draws upon extensive work in areas such as statistics, machine learning, pattern recognition, databases, and high-performance computing to discover interesting and previously unknown information in data.
More specifically, data mining is the analysis of often large data sets to find relationships and patterns that aren't readily apparent, and to summarize the data in new and useful ways.
Data mining technology has enabled earth scientists from NASA to discover changes in the global carbon cycle and climate system, and biologists to map and explore the human genome.
"We can create and collect complex data at tremendous rates, but unless we can manage and analyze that data, it's useless," said Ron Musick, formerly the lead researcher on Lawrence Livermore National Laboratory's Large-Scale Data Mining Pilot Project in Human Genome. "To maximize the value of this data, we have to be able to access it and manipulate it as structured, coherent, integrated information. Data mining helps automate the exploration and characterization of the data being generated, so the scientist or engineer is free to concentrate on interpreting and using the data."
Musick is currently employed as the lead scientist at iKuni Inc., a machine-learning start-up focused on behavior capture for the entertainment industry. As a member of Lawrence Livermore's Center for Applied Scientific Computing, he was the group leader for the Data Science Group, and the project lead for the DataFoundry project, which is an ongoing research effort to improve scientists' interactions with large data sets.
NASA scientists are using satellite data gathered from remote locations (such as Kilimanjaro) to discover changes in the global climate system.
Watching the Weather
Scientists working with NASA Ames Research Center in Moffett Field, Calif., are using satellite data to develop a detailed global picture of the interplay of natural disasters, human activities, and the rise of carbon dioxide in the Earth's atmosphere over the past 20 years.
"The new results come from data mining, which sorts through a huge amount of satellite and scientific data to detect patterns and events that otherwise would be overlooked," said Vipin Kumar, professor of computer science and engineering at the University of Minnesota, and director of the Army's High Performance Computing Center. Kumar was the principal investigator on a recently ended joint project among the University of Minnesota; California State University-Monterey Bay in Seaside; Boston University; and NASA Ames to develop data mining techniques to help earth scientists discover changes in the global carbon cycle and climate system.
"Our goal was to explore teleconnections, to figure out what causes disturbances in .the global carbon cycle," Kumar said. Teleconnections are atmospheric interactions between widely separated regions that have been identified through statistical correlations. For example, the El Niño teleconnection with the southwestern United States involves large-scale changes in climatic conditions that are linked to increased winter rainfall.
The project drew on data collected since roughly 1950. Since 1980, the data has been collected from NASA's Earth Observing Satellites and the Advanced Very High Resolution Radiometer aboard National Oceanic and Atmospheric Administration satellites. Prior to 1980, farmers , fishermen, and small regional development centers collected the data, which was largely anecdotal, Kumar said.
Data from the NOAA satellites was used to measure monthly changes in leafy plant cover worldwide. Boston University used unique NASA computer codes to produce global greenness values. These codes removed interfering data from atmospheric effects. When statistics showed there was much less greenness in specific areas for more than a year, scientists found a high probability of ecological disturbances.
"Watching for changes in the amount of absorption of sunlight by green plants is an effective way to look for ecological disasters," said Christopher Potter, a scientist at NASA Ames
The satellites generate roughly 1 terabyte of raw data per day, according to Kumar. The data consists of a sequence of global snapshots of the Earth, typically at monthly intervals, and includes various atmospheric, land, and ocean variables , such as sea surface temperature, pressure, precipitation, and net primary production, the net photosynthetic accumulation of carbon by plants. Sudden changes in net primary production can have a direct impact on a region's ecology.
The vast amount of data, and its complexity, meant that the project required specially designed algorithms to mine it. "When you have a small amount of data, it's possible to do the bulk of the analysis manually," Kumar said. "With terabytes of data, the analysis process has to be more data-driven and automated. There's no way to go through all that data manually."
Part of the challenge in analyzing this data is that it's not all in a form that can be easily analyzed, according to Kumar. This means that a fair amount of preprocessing must be performed to get the data into shape to be analyzed. "We needed to see what we could do with a mix of modest-quality data and very high-quality data," Kumar said.
The sheer size of the data warehouse created an additional challenge. All the data mining algorithms needed to be scalable, since more data is continuously added. A combination of desktop PCs equipped with 2 to 3 gigabytes of memory each, and supercomputers with 100 times the memory of a desktop PC, run the algorithms.
The typical data in the project are spatial time series data, where each time series references a location on Earth. The data mining algorithms look for spatio-temporal patterns that correlate extremes in weather with an ecological disturbance, according to Kumar. An ecological disturbance is an event that disrupts the physical make up of an ecosystem and how it works for longer than one growing season of native plants. Natural disturbances may include fires, hurricanes, floods, droughts, lava flows, and ice storms. Other natural disturbances are due to plant-eating insects and mammals, and diseasecausing microorganisms. Human-caused disturbances could happen as a result of logging, deforestation, draining wetlands, clearing, cheniical pollution, and introducing non-native species to an area.
According to Potter, "Ecosystem disturbances can contribute to the current rise of carbon dioxide in the atmosphere." Nine billion metric tons of carbon may have moved from the Earth's soil and surface life forms into the atmosphere in 18 years beginning in 1982 due to wildfires and other disturbances, according to the study. A metric ton is 2,205 pounds, equivalent to the weight of a small car. In comparison, fossil fuel emission of carbon dioxide to the atmosphere was about seven billion metric tons in 1990.
"In the new era of worldwide carbon accounting and management, we need an accurate method to tell us how much carbon dioxide is moving from the biosphere and into the atmosphere," Potter said. "Global satellite images go beyond the capability of human eyesight. All we need to do is look at the data with the proper formulas to filter out just what we need."
According to Kumar, "The more data we have, the more information we have, and the more we can correlate changes in weather to human actions. If we have the data, and the tools to mine it, we might be able to avoid ruining the planet."
"We can create and collect complex data at tremendous rates, but unless we can manage and analyze that data, it's useless."
Finding a single gene amid the vast stretches of DNA (left) that make up the human genome requires a set of powerful tools, such as data mining. Satellite images and data, such as the one above of the Vatnajokull Glacier in Iceland, assist researchers in tracking teleconnections.
Researchers are using data mining to track the impact of natural disasters, like hurricanes (above), on the global carbon cycle. The discovery of how various genetic components work, such as DNA proteins (facing page, top and bottom) was made possible through data mining.
Big Data, Little Source
Mapping the human genome meant creating a vast database of information that researchers needed to easily update, query, retrieve components from, and ultimately analyze. None of that would be possible without data mining technology, according to researchers.
Lawrence Livermore National Labs worked on a pilot project to address many of the mission-critical questions in large-scale data mining, as applied to the Human Genome study.
Among the challenges were the size of the data sets, the complexity of the data and algorithms, the data management demand, and the need to qualify the results of the data mining.
"Large-scale data sets pose a particular challenge," said Ron Musick. "The genomic research branched off from work we did with physics and astrophysics data. Astrophysics data sets range up to 1 or 2 terabytes. They're large enough so that they won't fit into a computer's core memory. This creates a need for scalable interfaces and input-output architectures for moving the data."
In the Livermore Lab's project, data mining queries against the genomics database were run on desktop PCs. When working with physics data, the data mining was performed on mini-clusters of pes, not because computing power was limited, but rather to get results more quickly. Musick said they striped the data across as many as 30 hard disks, so that a query could be run more quickly.
The pilot project relied on custom-built data mining algorithms because of the size of the data sets and complexity of the information.
"We couldn't use commercial software, unless we broke the dataset into independent chunks," Musick said. "In a lot of ways, we had to first define the language the system would use to solve the problem, in much the same way you need the appropriate terminology to be able to escribe how you hit a ball with a bat."
The National Center for Biotechnology Information has built on the research done at Livermore and other places to build a variety of publicly available database and data mining tools. These have been developed by a multi disciplinary group of computer scientists, mathematicians, biologists, physicians, and researchers. These tools allow researchers to generate testable hypotheses regarding the function or structure of a gene or protein by identifying similar sequences in better-characterized organisms.
One NCBI data mining tool, the Basic Local Alignment Search Tool, or BLAST, is used for comparing gene and protein sequences against others in public databases. BLAST performs sequence similarity searches against a DNA database of more than two million sequences in less than 15 seconds, according to an NCBI spokesperson. In essence, it's all about looking for patterns and correlations.
Finding those patterns can help researchers identify the genes linked to a particular disease or those that help suppress a disease, for example. The number of BLAST queries sent to the server continues to increase, growing from about 100,000 per weekday at the beginning of 2002 to about 140,000 per weekday in early 2004, according to the NCBI.
Data on a Smaller Scale
Data mining isn't restricted solely to vast banks of data with unlimited ways of analyzing it. Manufacturers such as W.L. Gore (the maker of GoreTex) use commercially available data mining tools to warehouse and analyze their data, and improve their manufacturing process.
Gore uses data mining tools from analytic software vendor SAS for statistical modeling in its manufacturing process. According to SAS, the software allows Gore to tweak the manufacturing process at a remote plant and to identify variations in output at specific points during the manufacturing process to fix or avoid production quality problems.
The University of Minnesota's Kumar is in the early stages of developing data mining techniques that could be used in the detection of defects in mechanical structures. He sees data mining technology as another tool for engineers to use in non-destructive analysis.
"Data mining has many uses," Kumar said. "The key is coming up with the right questions to ask. Once you have the question, you can develop the language to ask that question, then build a feature set and the algorithms to carry out your search. But it all starts with the data. You can't do data mining without the data."
And somewhere, in all that data, may just lay the needle you're looking for-or another you never knew existed.