Lecture 2: Data
Preprocessing
PRESENTED BY : HALIMA TAHIR
Data Preprocessing
Before using data for analysis or building models, we need to prepare it
properly. Raw data is often messy, incomplete, or scattered. Data
preprocessing is like cleaning and organizing your room before starting
work — it makes data ready for use.
It mainly includes the following steps:
1. Data Representation
2. Data Summarization
3. Data Cleaning
4. Data Integration and Transformation
Data Representation
This means how data is stored and shown.
Example: Numbers, text, images, tables, graphs, etc.
If the data is not in a useful form, we convert it into a standard format so computers
can understand.
👉 Think of it as writing notes neatly in one notebook instead of random papers.
Data Summarization
Data can be huge, so we make summaries to understand it better.
Example: Instead of keeping marks of 1,000 students, we calculate
average marks, highest marks, and lowest marks.
Helps to quickly see patterns without going through all data.
👉 Like making short notes from a big chapter.
Data Cleaning
Real-world data usually has mistakes or missing values.
Example:
Some entries are empty (missing age).
Some are wrong (age written as 500).
Some are duplicates (same person added twice).
In cleaning, we fix errors, fill missing values, and remove duplicates.
👉 It’s like washing vegetables before cooking.
Data Integration and Transformation
Data often comes from many different sources (databases, Excel files,
websites).
We combine (integrate) them into one dataset.
Transformation means changing the data into a common format.
Example: Changing all dates to the same style (DD/MM/YYYY).
Scaling numbers (marks out of 100 converted to percentage).
👉 It’s like collecting ingredients from different shops and then
cutting/adjusting them before cooking.