DATA MINING
Lecture # 02
Instructor: Mr. Sharjeel Ahmed
Slide Elements
• Data Preprocessing
• Data Representation
• Data Summarization
• Data Cleaning
• Data Integration and Transformation
DATA PREPROCESSING
Data Preprocessing
• Data Preprocessing is the process of transforming raw data into an
understandable format.
• It involves cleaning, transforming, and organizing raw data into a
format suitable for analysis.
• The goal of data preprocessing is to improve the quality of the data
and to make it more suitable for the specific data mining task.
Data Preprocessing - Steps
1. Data Representation:
• Data representation involves selecting the appropriate format or
structure for your data.
• This step includes choosing the data types (e.g., numerical,
categorical) for each attribute or feature in your dataset.
• It also includes determining how to encode and represent text, dates,
and other information.
• Common techniques for data representation include one-hot encoding
for categorical data, normalization for numerical data, and text
vectorization for text data.
Data Preprocessing - Steps
2. Data Summarization:
• Data summarization focuses on reducing the complexity of the data
while retaining important information. This can be useful for
understanding the data's characteristics and detecting outliers.
• Summarization techniques include:
• Descriptive statistics: Calculating measures like mean, median,
standard deviation, and quartiles to provide an overview of the data.
• Data visualization: Creating plots and charts to visualize data
distributions and patterns.
• Dimensionality reduction: Techniques like Principal Component
Analysis (PCA) to reduce the number of features while preserving
data variance.
Data Preprocessing - Steps
3. Data Cleaning:
• Data cleaning is the process of identifying and correcting errors,
inconsistencies, and missing values in the dataset.
• Common data cleaning tasks include:
• Handling missing data: Imputing missing values or removing rows
or columns with too many missing values.
• Handling outliers: Detecting and addressing outliers that may
skew analysis results.
• Consistency checks: Ensuring data consistency and resolving
conflicting or duplicate records.
• Noise reduction: Reducing random or irrelevant variations in the
data.
Data Preprocessing - Steps
4. Data Integration and Transformation:
• Data integration involves combining data from multiple sources into a single,
unified dataset. This is often necessary when working with real-world data
collected from various systems or databases.
• Type of Integration are:
• Tight Coupling: Data is combined together into a physical Location.
Once you have combined data, you can not again access it separately.
• Loose Coupling: Data is not actually integrated. Only an interface is
created and data is combined through the interface and also accessed
through that interface. Data remains in actual database only.
• Data transformation includes converting data into a different format or
structure to make it more suitable for analysis.
Data Preprocessing - Steps
4. Data Integration and Transformation: (Cont.)
• Some common data integration and transformation techniques include:
• Merging datasets: Combining data from different sources based on common
keys or attributes.
• Aggregation: Summarizing data by grouping it based on certain attributes (e.g.,
calculating the total sales for each product category).
• Feature engineering: Creating new features by combining, transforming, or
extracting information from existing features.
• Scaling and normalization: Scaling data to ensure that different features have
similar ranges and distributions.
Data Preprocessing – Example
Scenario: You have a dataset of daily temperature records for different cities.
Step 1: Data Representation
• Convert city names into numerical codes (e.g., 1 for New York, 2 for Los
Angeles).
• Scale temperatures to a common range (e.g., Celsius) for consistency.
Step 2: Data Summarization
• Calculate the average temperature for each city to understand typical
temperatures.
• Create a simple line chart to visualize temperature variations over time.
Step 3: Data Cleaning
• Identify and address missing temperature values.
• Handle extreme outliers (e.g., incorrect temperature readings).
• Ensure date formats are consistent and valid.
Data Preprocessing – Example (Cont. )
Step 4: Data Integration and Transformation
• Combine the temperature data with additional information, such as city
populations.
• Create a new feature, like "temperature change from the previous day," to
capture trends.
• Scale temperature values within a common range for fair comparisons.
Following these steps, you've prepared the dataset for analysis, allowing you to
gain insights into temperature trends, make predictions, or conduct data mining
tasks. Data preprocessing enhances data quality and makes it suitable for
various analytical and machine learning applications.