Data Mining

From SI410
Jump to: navigation, search
This illustration shows the role Data Mining plays in processing information for business use.

Data Mining is the act of analyzing data from various perspectives and summarizing it into useful information, which combines aspects of artificial intelligence, machine learning, statistics and database systems. Software is implemented as one of many analytical tools used to analyze data. Through data mining, data is presented to users from many different angles, in various categories and relationships. On a more technical term, data mining is the process of realizing correlations or patterns among large fields of relative databases.[1]

Data mining is used in many fields such as business, marketing, engineering, medicine, and the music and gaming industry. Businesses employ data mining techniques to gather information on their customers to better advertise and target future customers. Data mining is also used to recognize patterns and designate categories for work in gaming, medicine, and music. Because personal information is gathered and studied, ethical concerns arise from the lack of anonymity and privacy of the subjects used in data mining techniques as well as around receiving user consent.

Process of Data Mining

The process of Data Mining requires the extraction of useful information from large sets of data stored in a uniform way, often in a database. The type of information extracted is relative to the type of data available, and the purpose of the data mining project itself – the information extracted could be modeled or represented of the entire set of data for example if the objective was to draw a conclusion of the dataset as a whole. [2]

Data mining consists of multiple sub-tasks:[3][4]

Anomaly Detection

Also called deviation detection, this is the “data cleaning” process where data miners attempt to identify unusual circumstances within the data that may belie anomalies or errors. The process includes program-aided searches of the data, parsing the data through filters that check for certain cases, and searching for unexplainable patterns within the data. The anomaly detection process often requires checking how the data was aggregated, and if human error had an effect on the data present.


A database of songs that has been visually represented as clusters of data which best represent the categories of songs present (ie. hip-hop, punk).

Situations arise within datasets where there is a desire to know if similarities exist within the data, but no mechanism for detecting them exists. Clustering procedures within data mining aim to find similar objects within the data or to analyze correlations between attributes of the data. Clustering is also used to find topographical similarities within data sets.


Use of mined data requires the input and output information to be formatted in a discrete way in order for the information to be useful. Classifying traits of both the input and output of the data minimizes ambiguity, and helps define unique qualities of the data, and the meaning of the data itself. This process also helps discover unique relationships that exist between the data present in the database (the input data), and information that is being presented as conclusive of the data set (the output data). Classification can occur in many formats, from labeling, to building models which fit the particular data set explicitly.


The building of a transparent model which fits the data, and adequately conveys the information contained within the data set is the objective of the regression step of data mining. The use of statistical methods is the most common approach within the regression step, because it yields discrete output. Regression modeling also attempts to minimize error within the other steps of the data mining process.


This task aims at producing representative descriptions of the data which provides context for the data. The summary process can take multiple formats, the most common of which is numerical analysis, which provides statistical quantification for patterns in the data like the mean or standard deviation found within the data set. These results are often represented in graphical formats – such as histograms or scatter plots. Qualitative data can be summarized by giving a list of trends or frequencies within a set of data. These results are useful because they represent the entirety of the data set in a much more digestible format.

Data Mining Tools

As data mining is becoming more popular to sift through the massive amounts of data available online, new tools have been cropping up to make the job simpler. Google, for example, has released a free tool called Correlate[5] aimed at helping corporations utilize the Google search database to draw conclusions on consumer behavior. MIT and Harvard have recently collaborated in the creation of MINE, a tool of unprecedented power which can analyze more data for more complex patterns than any tool previously available. It is said to approach the quality of human examination in terms of being able to pick up on nonstandard or deceptive patterns[6].

Applications of Data Mining


A simple example of data mining is analyzing a large population, such as University of Michigan students, and determining simple characteristics that the data has, such as the proportion of the student body that is from each ethnic background.

Before the term "data mining" came into popular use, many businesses had already implemented its technology. They used powerful computers to comb through quantitative data from supermarket scanners, and analyzed the resulting data for market research purposes. This process have been immensely increasing the precision of analysis, and at the same time decreasing the cost of research.[1]


Data mining is commonly used in business [7] to determine what demographics are buying what products, and to help firms predict customer preferences. Online retailers often track the ways which their clients interact with the content on their site. This leads to an extensive collection of data that can be used by data mining procedures to model representative "personas" of future clients who will visit the site. This has meaningful value for companies, because information generated from data mining practices can lead to more "click-throughs" and potentially, more "conversions" (sales).


Data Mining also plays a crucial role in scientific analysis. The objective of many large research studies is to find patterns in experimental data. Many data mining procedures allow for applicable modeling of data in a way that is much faster and easier than doing so by hand.

Other Uses

Data Mining techniques have been used in a variety of other matters, such as in admissions processes in universities and political campaigns. [8] [9]

Ethical Implications

Data mining is an inherently morally-neutral action, as it is simply the practice of working with large stored data sets and the process itself does not account for the way that the data was generated or will inevitably be used. The ethics of data mining come under scrutiny when the type of data being mined is something that the individuals or specific entities subject to the mining feel should not be disclosed for others to view. Issues also arise when sufficiently robust restrictions are not in place to hide the private characteristics of an individual’s personal identity in the online environment. Likewise, when firms neglect to take adequately tested precautionary measures for protecting a customer’s data as part of a larger collective set of consumer information, they are effectively opening all their clients’ data up to the possibility of unprotected digital consumption and exploitation.

User Profiling and Identification

In an attempt to build accurate models, data miners in online environments occasionally utilize the personal information of individuals who visit their sites. The information generally collected includes, but is not limited to, geographic identifiers, records of past behavior, or information that can identify an individual by a range of other intimate attributes like user account details or IP addresses. Many users do not actively realize that their information is being collected and many question the ethics of using an individual’s private information without their outright and formal consent.[10] Additionally, data mining practices and tools have made the automated profiling of groups and individuals an easy task for a growing number of the technically literate. Using data mining to create customer profiling questions also raises ethical concerns regarding these practices due to the associated risks it creates for discrimination, de-individualization, and information asymmetries.[11] Data collection has the potential to reveal unintended information about subjects due to coordination of identifying information collected. While a data collection policy may explicitly exclude personal identifying information, when that information is combined with other behavior information that has been collected, the coordinated data can be used to identify a person because intersections of the data collected result in greater specificity.[12]

Distortion of Truth

Second to privacy concerns, distorting or misrepresenting trends in data that are not actually there together account for the other main ethical issue on the topic of data mining.[13] The result of human or programmatic mishaps, generating flawed associations between data points can be severely costly to reputations, economic footings, and the continued confidence in the reliability of the data analysis method.

Anonymity and Ownership

Many information systems contain data on individuals, so maintaining anonymity of the data and informing those whose information is being used, exactly how it is used and to what effect, is not only important but in many cases a legally binding matter. Allowing individuals the freedom to opt in or opt out of data mining processes is one way to ensure some form of ethical responsibility.

Discussion regarding the role of data ownership emerges as a natural consequence to its collection from disparate sources and its subsequent composite processing, analysis, and summarization. A contested point in some circumstances is the differentiation between raw source data (input) and the proprietary information product (output). Because the former, when in aggregate form, represents the constituent building blocks of the latter, the resulting interdependency spawns many debates surrounding who, if anyone, rightfully owns each intermediate information good and the finished analysis. Boiling down to where lines are drawn with respect to voluntary (or involuntary) data relinquishment and the perception of residual value, the ground is ripe for dispute and ethical ambiguity.

Use Case #1: Social Networking

Mining data found on social networking sites poses a particular challenge, as a great deal of personal information can be found on individuals’ social media pages. It is not uncommon for social media firms like Facebook and Twitter to sell the information mined from users’ personal pages to advertisers and academic researchers. Despite the fact that this data is anonymized and/or collected from a large pool of public-facing accounts, there are still a number of valid sources for apprehension regarding its execution and overall legitimacy.

To reduce users’ qualms surrounding this type of activity and to ensure that the ethical validity of all such experimentation meets a sound moral standard, committees in the form of institutional review boards (IRBs) are organized across both industry and the social science disciplines of higher-learning. They are entrusted with the responsibility for reviewing, monitoring, and approving of any biomedical or behavioral research involving humans.[14] Because the natural extension of examining humans’ behavioral tendencies places us in the digital realm and thus satisfies the need for the aforementioned requisite oversight, here too IRBs serve a vital role in fulfilling the third-party ethical assessment of data collection and data manipulation for the sake of research driven advertising or academic inquiry.

Firms often explicitly state in the terms and conditions of using their sites that they reserve the right to sell or use an individual's personally volunteered information toward serving their own business needs. Although this activity is controlled and kept within defined limits by consumer privacy laws, vigilant monitoring is necessary by the Federal Trade Commission (FTC) and other such organizations to ensure that businesses are conducting themselves both fairly and lawfully. Because a user agreement is usually only displayed once (typically during the registration process), if it all, the gravity and seriousness ascribed to its contents is frequently undervalued by users, often relegated as unimportant and a low priority.

As a majority of avid social media users are only concerned about a platform’s (superficially free) software and community assets -- expressed through its ability to cultivate social capital or the manner by which it advocates for a “more open and connected world” (e.g. Facebook) -- the embedded costs posed by data mining to users are subliminal and certainly not without their ramifications.[15] Many users feel that the pursuit is an irresponsible practice, irrespective of the safeguards and stipulations found in user agreements, and consider it to be an activity in which social media firms should not engage.[16] Often for that reason, or purely on the rationale that the sum total of one’s social contributions could be aggregated and packaged up for sale at any time to an outside party -- without due notification or option for opting out -- many people who choose not to participate on Web-based social platforms due so for concerns that mirror those convictions. Increasing the transparency of what data is mined and identifying when a user is particularly vulnerable (i.e. for disregarding privacy measures) has the potential to restore the necessary confidence in weary users.

Use Case #2: Business

Much like the value that can be derived from mining social sites, a wide array of businesses have a similar opportunity in leveraging data mining techniques for improving their respective processes and ultimately, economic returns. Reducing expenses and maximizing profits is a shared challenge among for-profit institutions. As a result, it is not uncommon for corporations spanning various disciplines to use mined data in conjunction with analytic and predictive methods for getting a more informed business perspective. Ethically, many of the issues that confront businesses with respect to data mining are similar to other contextual use cases: they revolve around the notion of privacy and the extent to which the resulting information can be shared with third-parties. Because the benefits of such activity can be used to “...reduce fraud, anticipate resource demand, increase acquisition and curb customer attrition,” the attractive nature of the process generally outweighs any lingering moral uncertainties.[17]

Use Case #3: Government

Privacy is a theme that pertains to almost every possible application of data mining as a mechanism for revealing information. This happens to be no different when discussed in the framework of government utilization. There are hundreds of available uses, with some of the most recently relevant being in the realm of predicting and and preventing acts of terrorism. The broad and sweeping nature of data mining, especially in the pursuit of foreign national intelligence acquisition, can have unintended effects despite its many arguably useful implementations for combating terror and crimes against humanity. Among others issues, this exceptionally thorough process has been proven to threaten the privacy of United States citizens who, without an otherwise legitimate underlying connection, were mined and found to falsely possess direct ties to criminal or extremist organizations. Contingently included in the mining as a result of accidental or indirect associations, the incorrect isolation and thus the practice itself -- predicated on a more is better than less mentality -- can have dangerous repercussions when not safely managed and implemented.[18] Furthermore, the construction of possibly erroneous correlative models built from the massive accumulation of mined data may, in some circumstances, display a statistically significant relationship where one does not actually exist. A pitfall of data analysis on a super large sample is that nearly any difference in quantitative measure, even minor, has the potential to reflect a casual association without a realistic or practical underpinning.[19] The potential for flawed outcomes must be taken into active consideration when drawing conclusions from comprehensive data mining activities and analysis.

Use Case #4: Healthcare

As more information related to individuals’ health records becomes digitized, data mining will become significantly more useful within the healthcare field. Patients’ health records, when used in aggregate, will provide significant insight into the causes of specific illnesses. The success of data mining in this context hinges on the process of converting significant amounts of data into standardized digital formats on software systems with low storage and search costs.

See also


  1. 1.0 1.1 Palace, Bill. "What Is Data Mining?" Data Mining. Anderson Graduate School of Management at UCLA, Mar. 1996. Web. 16 Dec. 2011.
  3. Olaru, C., & Wehenkel, L. (1999). Data mining. IEEE Computer Applications in Power, 12(3), 19-25. doi:10.1109/67.773801
  4. Shen Bin, Liu Yuan, & Wang Xiaoyi. (2010). Research on data mining models for the internet of things. 2010 International Conference on Image Analysis and Signal Processing (IASP) (pp. 127-132). Presented at the 2010 International Conference on Image Analysis and Signal Processing (IASP), IEEE. doi:10.1109/IASP.2010.5476146
  7. Data Mining in Sales Marketing and Finance. (2006). Introduction to Data Mining and its Applications (Vol. 29, pp. 411-438). Berlin, Heidelberg: Springer Berlin Heidelberg.
  11. Schermer, B.W. (2011). The limits of privacy in automated profiling and data mining. Computer Law and Security Report, 27(1), pp. 45-52.
  12. Furnas, Alexander (2012). Everything You Wanted to Know About Data Mining but Were Afraid to Ask. The Atlantic. Retrieved from
  13. Data Mining: Study Guide. (nd). North Carolina State University. Retrieved from
  14. Wikipedia: Institutional review board
  15. Facebook, Advertising
  17. SAS Institute. (2003). SAS data mining tools drive growing segment of business intelligence market [Press release]. Retrieved from
  18. Harris, Shane. Army project illustrates promise, shortcomings of data mining. (2005). Government Executive. Retrieved from
  19. Helberg, Clay (1996). Pitfalls of data analysis. Practical Assessment, Research & Evaluation, 5(5). Retrieved April 21, 2016 from