Data Basics for ML
Data Basics for ML
LEARNING
TYPES OF DATA
Unstructured Data
No predefined format or structure.
Harder to store and analyze compared to structured data.
Examples:
Text data (emails, social media posts, customer reviews).
Multimedia files (images, videos, audio).
Sensor data, logs, and web pages.
E.g. "The delivery was late, and the product was damaged. Very disappointed!” (Customer review).
DATASETS AND FEATURES
Dataset
A dataset is a collection of data points organized in a structured format. It can be:
Tabular (structured datasets): Excel files, CSVs, SQL databases.
Non-tabular (unstructured datasets): Image collections, text files, video
archives.
Features (Attributes/Variables)
Features are individual measurable properties or characteristics of a dataset.
Numerical Features: Age, income, temperature.
Categorical Features: Gender, product category, country.
Textual Features: Customer reviews, chat messages.
Date/Time Features: Transaction timestamps, event logs.
EXAMPLE OF A DATASET WITH FEATURES
Customer ID = Identifier (not a feature).
Age, Gender, Country = Features.
Purchase Amount = Target variable (if predicting spending).
DATA PREPROCESSING
Before analyzing data, preprocessing is essential to ensure it is clean and ready for modeling.
2. Normalization
Machine learning algorithms perform better when numerical data is on a consistent scale.
Normalization (Min-Max Scaling)
Example of one-hot encoding - If country had values USA, Canada, UK, it would become:
EXPLORATORY DATA ANALYSIS
EDA helps in understanding data patterns and distributions using visualization and
summarizing techniques.
Visualization Techniques:
Right-skewed
distribution: The tail extends
towards higher values.
Mean (Red Dashed Line)
is greater than the Median
(Green Dashed Line),
indicating positive
skewness.
The majority of the data is
concentrated on the left, while
some extreme values pull the
mean to the right.
SUMMARIZING DATA
Skewness
Measures asymmetry of data.
Positive: Right-skewed.
Negative: Left-skewed.
Left-skewed
distribution: The tail extends
towards lower values.
Mean (Red Dashed Line)
is less than the Median
(Green Dashed Line),
indicating negative
skewness.
Most values are concentrated on
the right, but some extreme low
values pull the mean to the left.
SUMMARIZING DATA
Kurtosis:
Measures tails (outliers) in a distribution.
Mesokurtic (Blue -
Normal Distribution):
Moderate peak and tails.
Example: Normal distribution.
Leptokurtic (Red -
Heavy-Tailed
Distribution):
High peak and long tails,
meaning more extreme values.
Example: Financial market
returns.
Platykurtic (Green -
Light-Tailed
Distribution):
Flatter peak and shorter tails,
meaning fewer extreme values.
Example: Uniform distribution.
CONCLUSION