Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms for processing large volumes of data. Data mining is the process of automatically discovering useful information in large data repositories. Data mining is an integral part of knowledge discovery in databases (KDD), which is the overall process of converting raw data into useful information.
Data mining techniques can be used to support a wide range of business intelligence applications such as customer profiling, targeted marketing, workflow management, store layout, and fraud detection.
Looking up individual records using a database management system or finding particular Web pages via a query to an Internet search engine are not data mining, but information retrieval tasks. Data mining techniques have been used to enhance information retrieval systems.
The data mining process consists of a series of transformation steps, from data preprocessing to postprocessing of data mining results.
Some of the methods used for pre-processing, including:
- sampling which selects a representative subset from a large population of data.
- transformation which manipulates raw data to produce a single input.
- normalization which organizes data for more efficient access.
- feature extraction which pulls out specified data that is significant in some particular context.
Preprocessing
The raw data is highly susceptible to noise, missing values, and inconsistency. The purpose of preprocessing is to transform the raw input data into an appropriate format for subsequent analysis.
Data pre-processing methods are divided into four categories:
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
The raw data tend to be incorrect, inconsistent missing value, duplicate and noisy. Data cleaning is the process of removing incorrect, duplicate, or incomplete data within a dataset. Data integration is the process of combining data from multiple sources to create unified sets of information. Data transformation is the process of converting data from one format into another format which is appropriate for mining. Data reduction is the process of obtaining a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data.
Post processing
The post-processing methods can be categorized into knowledge filtering, interpretation and explanation, evaluation and knowledge integration. Data Visualization and Data Summarization are the techniques used to interpret the knowledge extracted and to gain the insight of the data for better decision making.
The practical difficulties encountered by traditional data analysis techniques are
- Scalability
- High Dimensionality
- Heterogeneous & Complex Data
- Data Ownership & Distribution
- Non traditional analysis