Extracting Knowledge from Data
Data Preparation, Enrichment, Encoding, and Standardization
Presented by: Bejaoui Ahmed
Plan
• Why is Data Preparation Important?
• Data Preparation and Cleaning
• Data Enrichment
• Data Encoding
• Data Standardization
• Data Normalization
• Challenges in Data Preparation
• Future Trends
2
Introduction
Extracting knowledge from data involves going beyond basic analysis; it
requires that data be carefully prepared, enriched, encoded, and
standardized. This process improves data quality, increases model
accuracy, and enhances decision-making. Today, we’ll explore key steps
like data cleaning, enrichment, encoding, and standardization.
3
Why is Data Preparation Important?
Data often comes in raw form with inconsistencies, missing values, and errors.
Properly prepared data:
Increases model accuracy: Clean data improves prediction outcomes.
Saves time and resources: Reduces the need for troubleshooting during analysis.
Prevents unjust results: Findings and decisions can be distorted by inaccurate or
unclean data.
4
Data Preparation and Cleaning
1. Handling Missing Data 2. Handling Outliers
Use deletion to remove incomplete Use statistical methods like IQR (Interquartile
entries or imputation to fill gaps with Range) or Z-score to identify extreme values,
statistical estimates, balancing data then treat outliers by removing, transforming,
integrity and completeness. or replacing them as appropriate based on
domain knowledge.
3. Data Consistency 4. Removing Duplicates
Ensure uniform formats (e.g., dates, Identify and eliminate duplicate records
currencies) across the dataset. that may distort analysis results.
5
Data Enrichment
Adding new relevant data to enhance the existing dataset and improve
analysis.
Types of Data Enrichment:
External Data: Adding information Feature Engineering: Creating new
from other sources (e.g., social features from the existing data (e.g.,
media, weather data). combining date and time into one feature).
Benefits:
Enriched data provides deeper insights.
Improves model performance by adding relevant context or features.
6
Data Encoding
Converting categorical (non-numerical) data into numerical form so that
machine learning algorithms can use them.
Techniques:
Label Encoding:
• Assigns an integer to each category.
• Example: "Red" = 1, "Green" = 2, "Blue" = 3. Used for ordinal data.
One-Hot Encoding:
• Creates binary columns for each category.
• Example: "Color" column with values "Red," "Green," "Blue" becomes three binary
columns.
7
Data Encoding
Frequency Encoding:
Replaces categories with their frequency in the dataset.
Example:
A column with colors: "Red," "Green," "Blue" becomes "Red" = 50%,
"Green" = 30%, "Blue" = 20%.
8
Data Standardization
Rescaling data so that it has a mean of zero and a standard deviation of
one.
Why It’s Important:
Algorithms like k-Means, SVM(Support Vector Machine), and Gradient
Descent are sensitive to data scaling.
Standardization ensures that large-scale features don’t dominate smaller-
scale features.
9
Example of data standardization
10
Data Normalization
Rescaling data to a range between 0 and 1 without
changing its distribution.
When to Use:
• It is preferred when working with algorithms that
rely on distances, such as k-NN or neural
networks.
11
Example of data Normalization
12
Challenges in Data Preparation
• High Dimensionality:
Datasets with many features can lead to overfitting or long processing times.
• Incomplete or Inconsistent External Data:
Data enrichment may introduce inconsistencies or new missing values.
• Complexity in Encoding:
Some categorical features have too many levels, making encoding
computationally expensive.
13
Future Trends
Automated Data Cleaning (AutoML): Data-Centric AI: Prioritizes data quality
Uses AI to automatically clean and prepare improvements over model tuning, ensuring
data, saving time and improving data better model performance from well-
quality. prepared data
Real-Time Data Preparation: Enables Synthetic Data Generation: Creates
on-the-fly data cleaning and artificial, privacy-safe data to supplement
transformation, essential for streaming real datasets, improving model training
analytics and IoT. without compromising sensitive
information.
14
Conclusion
Data preparation, enrichment, encoding, and standardization are
foundational to effective data analysis and machine learning.
Prioritizing these steps ensures cleaner, more consistent data and
enhances model performance.
15
References
•Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.
•Han, J., Pei, J., & Kamber, M. (2011). Data Mining: Concepts and Techniques. Morgan
Kaufmann.
•Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and
TensorFlow. O'Reilly Media.
•Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling. Springer.
16