Data Mining

From SI410
Revision as of 06:26, 16 December 2011 by Guo (Talk | contribs) (Ethical Implications)

Jump to: navigation, search

Data mining is the act of discovering patterns in large sets of data retrieved from a digital environment. It combines aspects of artificial intelligence, machine learning, statistics and database systems.


Example

An example of data mining is looking at a group of people, say from the University of Michigan, and make inferences about the population at the University.

Process

Data mining can only occur on a dataset large enough to contain patterns to discover. This data must be aggregated, and stored in a database. Data cleaning then takes place, to remove noisy or partially missing entries.

Data mining consists of six sub-tasks:[1]

  • Anomaly detection: Identifying unusual records that may be anomalies or errors.
  • Association rule learning: Searching for general relationships between variables.
  • Clustering: Detection groups or structures within the data that are similar.
  • Classification: Applying known structures to new data.
  • Regression: Finding a function to model the data with the least error.
  • Summarization: Providing context and reporting findings.

Applications

Data mining is commonly used in business to determine what demographics are buying what products and to try to predict customer decisions, and science to find patterns in experimental data.

Ethical Implications

Data mining is the development of models of accumulated data. Sometimes, in an attempt to build an accurate statistical model, data miners tend to pry into private information in personal data records. While data mining itself is not inherently an ethical process, it has many applications that are ethically charged.

Particularly in mining social networking sites, a lot of personal information can be, and often is, accrued about an individual. Facebook uses these techniques to sell advertisers very specific target audiences. [2] As data mining has useful applications within the medical field, patient records could also be accessed in such a way. This raises issues about patient confidentiality and breach of privacy with regard to ordinarily private areas of people's personal life.

In systems that provide data from humans for such applications, maintaining anonymity of data and informing those involved of exactly what will happen to their data and how it will be used and allowing them to opt out of the process is a good way to keep such processes ethical.

References

  1. http://www.kdnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf
  2. http://www.facebook.com/advertising/

See also