Difference between revisions of "Data Mining"

Revision as of 07:42, 18 December 2011

This illustration shows the role Data Mining plays in processing information for business use.

Data mining is the act of analyzing data from various perspectives and summarizing it into useful information, which combines aspects of artificial intelligence, machine learning, statistics and database systems. Software is implemented as one of many analytical tools used to analyze data. Through data mining, data is presented to users from many different angles, in various categories and relationships. On a more technical term, data mining is the process of realizing correlations or patterns among large fields of relative databases.^[1]

Process of Data Mining

The process of Data Mining requires the extraction of useful information from large sets of data stored in a uniform way, often in a database. The type of information extracted is relative to the type of data available, and the purpose of the data mining project itself – the information extracted could be modeled or represented of the entire set of data for example if the objective was to draw a conclusion of the dataset as a whole. ^[2]

Data mining consists of multiple sub-tasks as outlined below.^[3]^[4]

Anomaly Detection

Also called deviation detection, this is the “data cleaning” process where data miners attempt to identify unusual circumstances within the data that may belie anomalies or errors. The process includes program-aided searches of the data, parsing the data through filters that check for certain cases, and searching for unexplainable patterns within the data. The anomaly detection process often requires checking how the data was aggregated, and if human error had an effect on the data present.

Clustering

A database of songs that has been visually represented as clusters of data which best represent the categories of songs present (ie. hip-hop, punk).

Situations arise within datasets where there is a desire to know if similarities exist within the data, but no mechanism for detecting them exists. Clustering procedures within data mining aim to find similar objects within the data or to analyze correlations between attributes of the data. Clustering is also used to find topographical similarities within data sets.

Classification

Use of mined data requires the input and output information to be formatted in a discrete way in order for the information to be useful. Classifying traits of both the input and output of the data minimizes ambiguity, and helps define unique qualities of the data, and the meaning of the data itself. This process also helps discover unique relationships that exist between the data present in the database (the input data), and information that is being presented as conclusive of the data set (the output data). Classification can occur in many formats, from labeling, to building models which fit the particular data set explicitly.

Regression

The building of a transparent model which fits the data, and adequately conveys the information contained within the data set is the objective of the regression step of data mining. The use of statistical methods is the most common approach within the regression step, because it yields discrete output. Regression modeling also attempts to minimize error within the other steps of the data mining process.

Summarization

This task aims at producing representative descriptions of the data which provides context for the data. The summary process can take multiple formats, the most common of which is numerical analysis, which provides statistical quantification for patterns in the data like the mean or standard deviation found within the data set. These results are often represented in graphical formats – such as histograms or scatter plots. Qualitative data can be summarized by giving a list of trends or frequencies within a set of data. These results are useful because they represent the entirety of the data set in a much more digestible format.

Data Mining Tools

As data mining is becoming more popular to sift through the massive amounts of data available online, new tools have been cropping up to make the job simpler. Google, for example, has released a free tool called Correlate^[5] aimed at helping corporations utilize the Google search database to draw conclusions on consumer behavior. MIT and Harvard have recently collaborated in the creation of MINE, a tool of unprecedented power which can analyze more data for more complex patterns than any tool previously available. It is said to approach the quality of human examination in terms of being able to pick up on nonstandard or deceptive patterns^[6].

Applications of Data Mining

Examples

A simple example of data mining is analyzing a large population, such as University of Michigan students, and determining simple characteristics that the data has, such as the proportion of the student body that is from each ethnic background.

Before the term "data mining" came into popular use, many businesses had already implemented its technology. They used powerful computers to comb through quantitative data from supermarket scanners, and analyzed the resulting data for market research purposes. This process have been immensely increasing the precision of analysis, and at the same time decreasing the cost of research.^[1]

Business

Data mining is commonly used in business ^[7] to determine what demographics are buying what products, and to help firms predict customer preferences. Online retailers often track the ways which their clients interact with the content on their site. This leads to an extensive collection of data that can be used by data mining procedures to model representative "personas" of future clients who will visit the site. This has meaningful value for companies, because information generated from data mining practices can lead to more "click-throughs" and potentially, more "conversions" (sales).

Science/Academics

Data Mining also plays a crucial role in scientific analysis. The objective of many large research studies is to find patterns in experimental data. Many data mining procedures allow for applicable modeling of data in a way that is much faster and easier than doing so by hand.

Ethical Implications

Inherently, data mining itself is a morally neutral action, as it is simply the practice of working with large sets of stored data, and does not account for the way which the data that is generated will inevitably be used. The ethics of data mining come under scrutiny when the type of data being mined is not something that individual’s subject to the mining feel should be disclosed to others to view. Issues also arise when precautionary measures are not taken to hide an individual’s personal identity, or when firms do not take adequate measures to protect their personal data as part of a larger set of data.

In an attempt to build accurate models, data miners in online environments occasionally utilize personal information of individuals who visit their site. The information used includes, but is not limited to, geographic identifiers, records of past behavior, or information that identifies the individual. Many users do not realize that their information is being used, and many question the ethics of using an individual’s private information without their consent. ^[8]

In systems that provide data on individuals, maintaining anonymity of data and informing those whose information is being used of what and how their data will be used is important. Allowing individuals the freedom to opt in or out of data mining processes is one way to ensure some form of ethical responsibility.

Mining data found on social networking sites poses a particular challenge, as a great deal of personal information is often found on an individual’s social media page. Social media firms like Facebook often sell advertisers information mined from individual’s personal pages. Firms often explicitly state that they hold the capacity to sell or use an individual's personal information if they agree to the terms & conditions of using the site. Many users still feel that this is not something which a social media firm should engage in. ^[9]^[10]

As more information related to individual’s health records becomes digitized, data mining will become significantly more useful within the healthcare field. Patient’s health records, while used in aggregate, will provide significant insight into causes of specific illnesses. The success of data mining in this instance hinges on the process of converting significant amounts of data into standardized digital formats.

@@ Line 29: / Line 29: @@
 As data mining is becoming more popular to sift through the massive amounts of data available online, new tools have been cropping up to make the job simpler. Google, for example, has released a free tool called Correlate<ref>http://www.fastcompany.com/1755287/google-correlate-tool-gives-marketers-powerful-new-data-mining-tools</ref> aimed at helping corporations utilize the Google search database to draw conclusions on consumer behavior. MIT and Harvard have recently collaborated in the creation of MINE, a tool of unprecedented power which can analyze more data for more complex patterns than any tool previously available. It is said to approach the quality of human examination in terms of being able to pick up on nonstandard or deceptive patterns<ref>http://www.decodedscience.com/data-mining-tool-advances-mine-ranks-multiple-patterns/7959</ref>.
+===See Also===
+*[[Geographic Information Systems]]
 ==Applications of Data Mining==

Difference between revisions of "Data Mining"

Revision as of 07:42, 18 December 2011

Contents

Process of Data Mining

Anomaly Detection

Clustering

Classification

Regression

Summarization

Data Mining Tools

See Also

Applications of Data Mining

Examples

Business

Science/Academics

Ethical Implications

References

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools