Data Mining

From SI410
Revision as of 00:36, 18 December 2011 by Cvanderc (Talk | contribs) (Process of Data Mining)

Jump to: navigation, search

(back to index)

This illustration shows the role Data Mining plays in processing information for business use.


Data mining is the act of analyzing data from various perspectives and summarizing it into useful information, which combines aspects of artificial intelligence, machine learning, statistics and database systems. Software is implemented as one of many analytical tools used to analyze data. Through data mining, data is presented to users from many different angles, in various categories and relationships. On a more technical term, data mining is the process of realizing correlations or patterns among large fields of relative databases.[1]

Process of Data Mining

The process of Data Mining requires the extraction of useful information from large sets of data stored in a uniform way, often in a database. The type of information extracted is relative to the type of data available, and the purpose of the data mining project itself – the information extracted could be modeled or represented of the entire set of data for example if the objective was to draw a conclusion of the dataset as a whole. [2]

Data mining consists of multiple sub-tasks as outlined below. [3][4]

Anomaly Detection

Also called deviation detection, this is the “data cleaning” process where data miners attempt to identify unusual circumstances within the data that may belie anomalies or errors. The process includes program-aided searches of the data, parsing the data through filters that check for certain cases, and searching for unexplainable patterns within the data. The anomaly detection process often requires checking how the data was aggregated, and if human error had an effect on the data present.

Association Rule Learning

Searching for general relationships between variables.

Clustering

A database of songs that has been visually represented as clusters of data which best represent the categories of songs present (ie. hip-hop, punk).

Situations arise within datasets where there is a desire to know if similarities exist within the data, but no mechanism for detecting them exists. Clustering procedures within data mining aim to find similar objects within the data or to analyze correlations between attributes of the data. Clustering is also used to find topographical similarities within data sets.

Classification

Use of mined data requires the input and output information to be formatted in a discrete way in order for the information to be useful. Classifying traits of both the input and output of the data minimizes ambiguity, and helps define unique qualities of the data, and the meaning of the data itself. This process also helps discover unique relationships that exist between the data present in the database (the input data), and information that is being presented as conclusive of the data set (the output data). Classification can occur in many formats, from labeling, to building models which fit the particular data set explicitly.

Regression

The building of a transparent model which fits the data, and adequately conveys the information contained within the data set is the objective of the regression step of data mining. The use of statistical methods is the most common approach within the regression step, because it yields discrete output. Regression modeling also attempts to minimize error within the other steps of the data mining process.

Summarization

This task aims at producing representative descriptions of the data which provides context for the data. The summary process can take multiple formats, the most common of which is numerical analysis, which provides statistical quantification for patterns in the data like the mean or standard deviation found within the data set. These results are often represented in graphical formats – such as histograms or scatter plots. Qualitative data can be summarized by giving a list of trends or frequencies within a set of data. These results are useful because they represent the entirety of the data set in a much more digestible format.

Applications

Data mining is commonly used in business to determine what demographics are buying what products and to try to predict customer decisions, and science to find patterns in experimental data.

Examples

A simple example of data mining is analyzing a large population, such as University of Michigan students, and determining simple characteristics that the data has, such as the proportion of the student body that is from each ethnic background.

Before the term "data mining" came into popular use, many businesses had already implemented its technology. They used powerful computers to comb through quantitative data from supermarket scanners, and analyzed the resulting data for market research purposes. This process have been immensely increasing the precision of analysis, and at the same time decreasing the cost of research.[5]

Ethical Implications

Data mining is the development of models of accumulated data. Sometimes, in an attempt to build an accurate statistical model, data miners tend to pry into private information in personal data records. While data mining itself is not inherently an ethical process, it has many applications that are ethically charged.

Particularly in mining social networking sites, a lot of personal information can be, and often is, accrued about an individual. Facebook uses these techniques to sell advertisers very specific target audiences. [6] As data mining has useful applications within the medical field, patient records could also be accessed in such a way. This raises issues about patient confidentiality and breach of privacy with regard to ordinarily private areas of people's personal life.

In systems that provide data from humans for such applications, maintaining anonymity of data and informing those involved of exactly what will happen to their data and how it will be used and allowing them to opt out of the process is a good way to keep such processes ethical.

References

  1. Palace, Bill. "What Is Data Mining?" Data Mining. Anderson Graduate School of Management at UCLA, Mar. 1996. Web. 16 Dec. 2011 <http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/index.htm>.
  2. http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf
  3. [1] Olaru, C., & Wehenkel, L. (1999). Data mining. IEEE Computer Applications in Power, 12(3), 19-25. doi:10.1109/67.773801
  4. [2] Shen Bin, Liu Yuan, & Wang Xiaoyi. (2010). Research on data mining models for the internet of things. 2010 International Conference on Image Analysis and Signal Processing (IASP) (pp. 127-132). Presented at the 2010 International Conference on Image Analysis and Signal Processing (IASP), IEEE. doi:10.1109/IASP.2010.5476146
  5. Palace, Bill. "What Is Data Mining?" Data Mining. Anderson Graduate School of Management at UCLA, Mar. 1996. Web. 16 Dec. 2011 <http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/index.htm>.
  6. http://www.facebook.com/advertising/

See also