What is data mining?
Data mining is the process of sifting through enormous amounts of data kept in repositories, utilizing pattern recognition technology as well as mathematical and statistical tools, to identify new significant connections, patterns, and trends.
The practice of detecting patterns in data is known as data mining. The procedure must be fully automated or more commonly semi-automated. The patterns revealed must be relevant in the sense that they lead to some benefit, usually a financial benefit. The information is always present in large amounts.
"We are swamped in information yet famished for knowledge," wrote John Naisbitt in his book Megatrends  in 1984. Today's issue isn't that there isn't enough data and information coming in. In most fields, we are overloaded with information. Rather, the issue is that there aren't enough qualified human analysts capable of converting all of this data into knowledge, and then wisdom further up the classification tree. A lucky convergence of circumstances has propelled the continued amazing progress in the field of data mining and knowledge discovery.
- The rapid expansion of data collection
- The data is stored in data warehouses so that the entire company has access to a current, accurate database.
- The availability of expanded data access via intranets and Web navigation
- In a globalized economy, the drive to grow market share is intense
- The creation of commercial data mining software suites that are available off-the-shelf
- Massive increases in computing power and storage capacity
FALLACIES OF DATA MINING
Jen Que Louie, president of Nautilus Systems, Inc., highlighted four data mining fallacies before the U.S House of Representatives Subcommittee on Technology, Information Policy, Intergovernmental Relations, and Census.
- We can employ data mining techniques to uncover answers to our problems by letting them loose on our data repositories.
- The data mining process is self-contained and requires little to no human intervention.
- Data mining quickly pays for itself.
- Data mining software packages are simple to use and straightforward.
- Data mining will reveal the root reasons for our company's or research's problems.
- Data mining will automatically clean up a messed-up database.
Data mining process
Researchers and analysts are merely attempting to characterize the patterns and trends that exist within data. A pollster might find evidence indicating persons who have been laid off are less inclined to vote for the current president in the presidential election. Patterns and trends are frequently described in such a way that possible explanations are suggested. Those who have been laid off, for example, are now in a worse financial position than they were when the incumbent was elected and would prefer an alternative.
The only difference between estimation and classification is that the goal variable is numerical rather than categorical. Models are constructed using "full" records, which include both the target variable's value and the predictors. Then, based on the values of the predictors, estimations of the value of the target variable are created for new observations.
ü Calculating how much a randomly selected family of four will spend on back-to-school shopping this autumn.
ü Calculating the percentage loss in rotary movement suffered by a running back in the National Football League who has had a knee injury.
ü Estimating how many points Patrick Ewing will score per game if he is double-teamed in the playoffs.
ü Estimating a graduate student's grade-point average (GPA) based on the student's undergraduate GPA
Prediction is identical to categorization and estimate, with the exception that the outcomes are in the future. The following are some examples of prediction problems in business and research:
-Trying to forecast the price of a stock three months out
-Estimating the percentage increase in road deaths that will occur next year if the speed limit is raised.
-Based on a comparison of club statistics, predicting the victor of the baseball World Series this fall.
-Predicting whether a drug discovery molecule will result in a profitable new drug for a pharmaceutical company.
There is a goal categorical variable in classification, such as income bracket, which can be divided into three classes or categories: high income, moderate-income, and low income. The data mining model looks at a huge number of records, each of which includes data on the target variable as well as a collection of input or predictor factors.
Trying to figure out if a credit card transaction is fake.
-Assigning a new student to a specific track based on their special needs
-Determining if a mortgage application poses a favorable or bad credit risk
-Determining whether or if a disease is present
-Determining whether a will was drafted by the deceased or by someone else fraudulently
-Determining whether or not particular financial or personal behaviors signal the presence of a potential terrorist threat
Clustering is the classification of records, observations, or cases into groups of related items. A cluster is a group of records that are similar to one another but not to those in other clusters. Clustering varies from classification in that it does not have a target variable. A target variable's value is not classified, estimated, or predicted by the clustering job.
Clustering algorithms on the other way attempt to divide the entire data set into relatively homogeneous subgroups or clusters, with the similarity of records inside the cluster maximized and the similarity of records outside the cluster minimized.
The effort of discovering which attributes “go together” in data mining is known as the association task. The task of association, also known as affinity analysis or market basket analysis in the business world, aims to discover criteria for quantifying the relationship between two or more attributes. The association rules are of the form "If antecedent, then consequent," along with a measure of the rule's support and confidence.
In business and research, examples of association tasks include:
ü Investigating the percentage of a company's cell phone plan subscribers that accept an offer of a service upgrade.
ü Examining the percentage of children who are good readers because their parents read to them.
ü Predicting telecommunications network degradation
ü Identifying which commodities are frequently purchased together and which are never purchased together in a supermarket
ü estimating the number of cases in whom a new medicine may have serious side effects