[go: up one dir, main page]

0% found this document useful (0 votes)
15 views23 pages

Data Basics for ML

The document provides an overview of data basics for machine learning, covering types of data, datasets, features, and the importance of data preprocessing. It details techniques for handling missing data, normalization, scaling, and encoding categorical variables, followed by methods for exploratory data analysis (EDA) including various visualization techniques. The conclusion emphasizes the significance of understanding data types, preprocessing, and EDA for deriving insights.

Uploaded by

amritaarajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views23 pages

Data Basics for ML

The document provides an overview of data basics for machine learning, covering types of data, datasets, features, and the importance of data preprocessing. It details techniques for handling missing data, normalization, scaling, and encoding categorical variables, followed by methods for exploratory data analysis (EDA) including various visualization techniques. The conclusion emphasizes the significance of understanding data types, preprocessing, and EDA for deriving insights.

Uploaded by

amritaarajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 23

DATA BASICS FOR MACHINE

LEARNING
TYPES OF DATA

Unstructured Data
No predefined format or structure.
Harder to store and analyze compared to structured data.
Examples:
Text data (emails, social media posts, customer reviews).
Multimedia files (images, videos, audio).
Sensor data, logs, and web pages.
E.g. "The delivery was late, and the product was damaged. Very disappointed!” (Customer review).
DATASETS AND FEATURES
Dataset
A dataset is a collection of data points organized in a structured format. It can be:
Tabular (structured datasets): Excel files, CSVs, SQL databases.
Non-tabular (unstructured datasets): Image collections, text files, video
archives.

Features (Attributes/Variables)
Features are individual measurable properties or characteristics of a dataset.
Numerical Features: Age, income, temperature.
Categorical Features: Gender, product category, country.
Textual Features: Customer reviews, chat messages.
Date/Time Features: Transaction timestamps, event logs.
EXAMPLE OF A DATASET WITH FEATURES
Customer ID = Identifier (not a feature).
Age, Gender, Country = Features.
Purchase Amount = Target variable (if predicting spending).
DATA PREPROCESSING
Before analyzing data, preprocessing is essential to ensure it is clean and ready for modeling.

1. Handling Missing Data


Missing data can negatively impact model performance.
Techniques to Handle Missing Data:
Remove missing values (if minimal).
Impute missing values:
Numerical: Mean, median, mode.
Categorical: Most frequent category or "Unknown".
DATA PREPROCESSING

2. Normalization
Machine learning algorithms perform better when numerical data is on a consistent scale.
Normalization (Min-Max Scaling)

Rescales values between 0 and 1.


Formula: X′= (X − Xmin) / (Xmax − Xmin)
DATA PREPROCESSING

3. Scaling (Z-score Scaling)


Standardizes data to have a mean of 0 and variance of 1.
Formula: X′ = (X − μ) / σ
DATA PREPROCESSING
4. Encoding Categorical Variables
One-Hot Encoding
Converts categorical variables into binary columns.

Label Encoding: Assigns numerical labels to categories.

Ordinal Encoding: Used when categories have a meaningful order.

Example of one-hot encoding - If country had values USA, Canada, UK, it would become:
EXPLORATORY DATA ANALYSIS
EDA helps in understanding data patterns and distributions using visualization and
summarizing techniques.

Visualization Techniques:

Histograms & Boxplots: Understanding data distribution.


Scatterplots & Correlation Heatmaps: Finding relationships between variables.
Bar Charts & Pie Charts: Analyzing categorical data.
VISUALIZATION TECHNIQUES
1. Histograms
Show the distribution of a single variable.
Use Case: Shows how values are distributed within a dataset.
Example: Age distribution in a customer dataset. The KDE (Kernel Density
Estimate) curve helps visualize the probability distribution of ages.
VISUALIZATION TECHNIQUES
2. Boxplots
Use Case: Identifies the data spread - outliers, median, and
quartiles.
Example: Distribution of customer purchase amounts. It highlights the
median, interquartile range (IQR), and potential outliers in
the dataset.
VISUALIZATION TECHNIQUES
3. Scatter Plot (Correlation Analysis)
Shows relationships between numerical variables.
Use Case: Checks relationships between two numerical variables.
Example: Relationship between advertising spend and sales
revenue.
VISUALIZATION TECHNIQUES
4. Correlation Heatmap
Use Case: Visualizes relationships between multiple numerical variables.
Example: Understanding correlations in a dataset (e.g., sales, advertising,
and customer engagement).

Sales Revenue has a moderate


positive correlation with both
Advertising Spend (0.59) and
Customer Engagement
(0.62).
Advertising Spend and
Customer Engagement have a
weak correlation (0.12), meaning they
don't directly influence each other
much.
VISUALIZATION TECHNIQUES
5. Pairplot (Multivariable Relationships)
Use Case: Shows scatter plots between pairs of variables for multiple features at once.
Example: Comparing sales revenue, ad spend, and customer visits.

Diagonal plots display histograms for each


individual variable.
There seems to be a positive
relationship between sales
revenue and customer visits, as
well as between sales revenue and
advertising spend.
VISUALIZATION TECHNIQUES
6. Bar Charts
Useful for comparing discrete categories (categorical data).

The x-axis represents


different categories (A to E).
The y-axis represents the
values associated with each
category.
The height of each bar shows
the value for that category.
VISUALIZATION TECHNIQUES
7. Pie Charts
These types of charts are useful for showing proportions within a whole.

Percentage distribution of different


categories:
Each slice represents a category with
a percentage value.
The size of each slice is
proportional to the data values.
VISUALIZATION TECHNIQUES
Summary
1. Histograms → Show the distribution of a single variable.
2. Boxplots → Identify outliers and data spread.
3. Scatter Plots → Show relationships between two numerical variables.
4. Heatmaps → Show correlation strength between variables.
5. Pairplots → Show multiple scatter plots at once for feature
relationships.
6. Bar charts → Show distribution of categorical variables.
7. Pie Charts → Show distribution of categorical variables as part of a
whole.
8.
EXPLORATORY DATA ANALYSIS
Summarizing Data:
Descriptive Statistics:
Mean, median, mode and standard deviation.
Help understand distributions.

Mean (Red Dashed Line): The


average value of the data.
Median (Green Dashed Line):
The middle value when data is
sorted.
Mode (Purple Dashed Line):
The most frequently occurring value.
Standard Deviation (Orange
Dotted Lines): Indicates how
spread out the data is (±1σ from the
mean).
SUMMARIZING DATA
Skewness
Measures asymmetry of data.
Positive: Right-skewed.
Negative: Left-skewed.

Right-skewed
distribution: The tail extends
towards higher values.
Mean (Red Dashed Line)
is greater than the Median
(Green Dashed Line),
indicating positive
skewness.
The majority of the data is
concentrated on the left, while
some extreme values pull the
mean to the right.
SUMMARIZING DATA
Skewness
Measures asymmetry of data.
Positive: Right-skewed.
Negative: Left-skewed.

Left-skewed
distribution: The tail extends
towards lower values.
Mean (Red Dashed Line)
is less than the Median
(Green Dashed Line),
indicating negative
skewness.
Most values are concentrated on
the right, but some extreme low
values pull the mean to the left.
SUMMARIZING DATA
Kurtosis:
Measures tails (outliers) in a distribution.
Mesokurtic (Blue -
Normal Distribution):
Moderate peak and tails.
Example: Normal distribution.
Leptokurtic (Red -
Heavy-Tailed
Distribution):
High peak and long tails,
meaning more extreme values.
Example: Financial market
returns.
Platykurtic (Green -
Light-Tailed
Distribution):
Flatter peak and shorter tails,
meaning fewer extreme values.
Example: Uniform distribution.
CONCLUSION

Understanding Data – Types of data, datasets,


and features.
Data Preprocessing – Cleaning, scaling, and
encoding data.
Exloratory Data Analysis – Visualization and
summary statistics for insights.

You might also like