Data Basics for ML

The document provides an overview of data basics for machine learning, covering types of data, datasets, features, and the importance of data preprocessing. It details techniques for handling missing data, normalization, scaling, and encoding categorical variables, followed by methods for exploratory data analysis (EDA) including various visualization techniques. The conclusion emphasizes the significance of understanding data types, preprocessing, and EDA for deriving insights.

Uploaded by

amritaarajput

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views23 pages

Data Basics for ML

Uploaded by

amritaarajput

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as KEY, PDF, TXT or read online on Scribd

You are on page 1/ 23

DATA BASICS FOR MACHINE

LEARNING
TYPES OF DATA

Unstructured Data
No predefined format or structure.
Harder to store and analyze compared to structured data.
Examples:
Text data (emails, social media posts, customer reviews).
Multimedia files (images, videos, audio).
Sensor data, logs, and web pages.
E.g. "The delivery was late, and the product was damaged. Very disappointed!” (Customer review).
DATASETS AND FEATURES
Dataset
A dataset is a collection of data points organized in a structured format. It can be:
Tabular (structured datasets): Excel files, CSVs, SQL databases.
Non-tabular (unstructured datasets): Image collections, text files, video
archives.

Features (Attributes/Variables)
Features are individual measurable properties or characteristics of a dataset.
Numerical Features: Age, income, temperature.
Categorical Features: Gender, product category, country.
Textual Features: Customer reviews, chat messages.
Date/Time Features: Transaction timestamps, event logs.
EXAMPLE OF A DATASET WITH FEATURES
Customer ID = Identifier (not a feature).
Age, Gender, Country = Features.
Purchase Amount = Target variable (if predicting spending).
DATA PREPROCESSING
Before analyzing data, preprocessing is essential to ensure it is clean and ready for modeling.

1. Handling Missing Data

Missing data can negatively impact model performance.
Techniques to Handle Missing Data:
Remove missing values (if minimal).
Impute missing values:
Numerical: Mean, median, mode.
Categorical: Most frequent category or "Unknown".
DATA PREPROCESSING

2. Normalization
Machine learning algorithms perform better when numerical data is on a consistent scale.
Normalization (Min-Max Scaling)

Rescales values between 0 and 1.

Formula: X′= (X − Xmin) / (Xmax − Xmin)
DATA PREPROCESSING

3. Scaling (Z-score Scaling)

Standardizes data to have a mean of 0 and variance of 1.
Formula: X′ = (X − μ) / σ
DATA PREPROCESSING
4. Encoding Categorical Variables
One-Hot Encoding
Converts categorical variables into binary columns.

Label Encoding: Assigns numerical labels to categories.

Ordinal Encoding: Used when categories have a meaningful order.

Example of one-hot encoding - If country had values USA, Canada, UK, it would become:
EXPLORATORY DATA ANALYSIS
EDA helps in understanding data patterns and distributions using visualization and
summarizing techniques.

Visualization Techniques:

Histograms & Boxplots: Understanding data distribution.

Scatterplots & Correlation Heatmaps: Finding relationships between variables.
Bar Charts & Pie Charts: Analyzing categorical data.
VISUALIZATION TECHNIQUES
1. Histograms
Show the distribution of a single variable.
Use Case: Shows how values are distributed within a dataset.
Example: Age distribution in a customer dataset. The KDE (Kernel Density
Estimate) curve helps visualize the probability distribution of ages.
VISUALIZATION TECHNIQUES
2. Boxplots
Use Case: Identifies the data spread - outliers, median, and
quartiles.
Example: Distribution of customer purchase amounts. It highlights the
median, interquartile range (IQR), and potential outliers in
the dataset.
VISUALIZATION TECHNIQUES
3. Scatter Plot (Correlation Analysis)
Shows relationships between numerical variables.
Use Case: Checks relationships between two numerical variables.
Example: Relationship between advertising spend and sales
revenue.
VISUALIZATION TECHNIQUES
4. Correlation Heatmap
Use Case: Visualizes relationships between multiple numerical variables.
Example: Understanding correlations in a dataset (e.g., sales, advertising,
and customer engagement).

Sales Revenue has a moderate

positive correlation with both
Advertising Spend (0.59) and
Customer Engagement
(0.62).
Advertising Spend and
Customer Engagement have a
weak correlation (0.12), meaning they
don't directly influence each other
much.
VISUALIZATION TECHNIQUES
5. Pairplot (Multivariable Relationships)
Use Case: Shows scatter plots between pairs of variables for multiple features at once.
Example: Comparing sales revenue, ad spend, and customer visits.

Diagonal plots display histograms for each

individual variable.
There seems to be a positive
relationship between sales
revenue and customer visits, as
well as between sales revenue and
advertising spend.
VISUALIZATION TECHNIQUES
6. Bar Charts
Useful for comparing discrete categories (categorical data).

The x-axis represents

different categories (A to E).
The y-axis represents the
values associated with each
category.
The height of each bar shows
the value for that category.
VISUALIZATION TECHNIQUES
7. Pie Charts
These types of charts are useful for showing proportions within a whole.

Percentage distribution of different

categories:
Each slice represents a category with
a percentage value.
The size of each slice is
proportional to the data values.
VISUALIZATION TECHNIQUES
Summary
1. Histograms → Show the distribution of a single variable.
2. Boxplots → Identify outliers and data spread.
3. Scatter Plots → Show relationships between two numerical variables.
4. Heatmaps → Show correlation strength between variables.
5. Pairplots → Show multiple scatter plots at once for feature
relationships.
6. Bar charts → Show distribution of categorical variables.
7. Pie Charts → Show distribution of categorical variables as part of a
whole.
8.
EXPLORATORY DATA ANALYSIS
Summarizing Data:
Descriptive Statistics:
Mean, median, mode and standard deviation.
Help understand distributions.

Mean (Red Dashed Line): The

average value of the data.
Median (Green Dashed Line):
The middle value when data is
sorted.
Mode (Purple Dashed Line):
The most frequently occurring value.
Standard Deviation (Orange
Dotted Lines): Indicates how
spread out the data is (±1σ from the
mean).
SUMMARIZING DATA
Skewness
Measures asymmetry of data.
Positive: Right-skewed.
Negative: Left-skewed.

Right-skewed
distribution: The tail extends
towards higher values.
Mean (Red Dashed Line)
is greater than the Median
(Green Dashed Line),
indicating positive
skewness.
The majority of the data is
concentrated on the left, while
some extreme values pull the
mean to the right.
SUMMARIZING DATA
Skewness
Measures asymmetry of data.
Positive: Right-skewed.
Negative: Left-skewed.

Left-skewed
distribution: The tail extends
towards lower values.
Mean (Red Dashed Line)
is less than the Median
(Green Dashed Line),
indicating negative
skewness.
Most values are concentrated on
the right, but some extreme low
values pull the mean to the left.
SUMMARIZING DATA
Kurtosis:
Measures tails (outliers) in a distribution.
Mesokurtic (Blue -
Normal Distribution):
Moderate peak and tails.
Example: Normal distribution.
Leptokurtic (Red -
Heavy-Tailed
Distribution):
High peak and long tails,
meaning more extreme values.
Example: Financial market
returns.
Platykurtic (Green -
Light-Tailed
Distribution):
Flatter peak and shorter tails,
meaning fewer extreme values.
Example: Uniform distribution.
CONCLUSION

Understanding Data – Types of data, datasets,

and features.
Data Preprocessing – Cleaning, scaling, and
encoding data.
Exloratory Data Analysis – Visualization and
summary statistics for insights.

Instant download Standards and Ethics for Counselling in Action 4th Edition Tim Bond pdf all chapter
100% (14)
Instant download Standards and Ethics for Counselling in Action 4th Edition Tim Bond pdf all chapter
81 pages
NSCOA LabGuide v24.02
No ratings yet
NSCOA LabGuide v24.02
81 pages
Enhancing Library Services With Cloud
No ratings yet
Enhancing Library Services With Cloud
6 pages
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Bring-Your-Own-Device ("Byod") Acceptable Use Policy
No ratings yet
Bring-Your-Own-Device ("Byod") Acceptable Use Policy
7 pages
Amit_Khilare_Used_Device_Data_PM_Project
No ratings yet
Amit_Khilare_Used_Device_Data_PM_Project
25 pages
ds unit 2 qb
No ratings yet
ds unit 2 qb
25 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Chapter 5
No ratings yet
Chapter 5
23 pages
Data Visualization
No ratings yet
Data Visualization
18 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
data mining 2
No ratings yet
data mining 2
64 pages
Unit 2
No ratings yet
Unit 2
20 pages
L4 Exploratory Analysis en
No ratings yet
L4 Exploratory Analysis en
42 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
Data Science
No ratings yet
Data Science
59 pages
1_L2_Intro_DAM
No ratings yet
1_L2_Intro_DAM
27 pages
Lect 3
No ratings yet
Lect 3
51 pages
5.1_exploratory_analysis_en
No ratings yet
5.1_exploratory_analysis_en
79 pages
Unit .......
No ratings yet
Unit .......
45 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Module 1
No ratings yet
Module 1
64 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
DAI_Data_Preprocessing_1_46233380_2025_06_12_17_18
No ratings yet
DAI_Data_Preprocessing_1_46233380_2025_06_12_17_18
14 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Ia - Eda
No ratings yet
Ia - Eda
10 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
02a EDA and Data Visualization
No ratings yet
02a EDA and Data Visualization
79 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
02 Data
No ratings yet
02 Data
65 pages
02 Data
No ratings yet
02 Data
62 pages
probability and stat unit 1
No ratings yet
probability and stat unit 1
12 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
02Data
No ratings yet
02Data
65 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Data Analytics and Interactive Dashboards using Python
No ratings yet
Data Analytics and Interactive Dashboards using Python
96 pages
DS Unit 1
No ratings yet
DS Unit 1
99 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02Data
No ratings yet
02Data
24 pages
Crash Course Data Science
No ratings yet
Crash Course Data Science
7 pages
Grey Minimalist Business Project Presentation
No ratings yet
Grey Minimalist Business Project Presentation
5 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
02Data
No ratings yet
02Data
66 pages
Edashsh
No ratings yet
Edashsh
7 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Unit 4
No ratings yet
Unit 4
21 pages
Data Science With Python - Lesson 02 - Data Analytics Overview
No ratings yet
Data Science With Python - Lesson 02 - Data Analytics Overview
54 pages
4-DataUnderstanding
No ratings yet
4-DataUnderstanding
51 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Networking Cybersecurity
No ratings yet
Networking Cybersecurity
8 pages
Binjal Patel: M.S. in Computer Science, May 2020
No ratings yet
Binjal Patel: M.S. in Computer Science, May 2020
2 pages
TBarCode DLL 11 Developer Manual
No ratings yet
TBarCode DLL 11 Developer Manual
19 pages
V2 of Why Scrum Isn T Making Your Company Very Agile 1565522970
No ratings yet
V2 of Why Scrum Isn T Making Your Company Very Agile 1565522970
32 pages
M.E. Electronics and Communication Engineering (Industry Integrated) Branch
No ratings yet
M.E. Electronics and Communication Engineering (Industry Integrated) Branch
49 pages
Numerical Methods - Chapter 1
No ratings yet
Numerical Methods - Chapter 1
24 pages
LO2 - Gather Data Through Formal and Informal Process
No ratings yet
LO2 - Gather Data Through Formal and Informal Process
8 pages
Eden's Internship Report
No ratings yet
Eden's Internship Report
32 pages
Leventhal-6809AssemblyLanguageProgramming Text
No ratings yet
Leventhal-6809AssemblyLanguageProgramming Text
579 pages
INTERNAL TABLE
No ratings yet
INTERNAL TABLE
11 pages
The New Strategy of Adobe Systems, Marketing Seminar
100% (1)
The New Strategy of Adobe Systems, Marketing Seminar
20 pages
REF DF SM Eng
No ratings yet
REF DF SM Eng
2 pages
2020 System of Systems Engineering Collaborators-Digital Engineering Toolchain
No ratings yet
2020 System of Systems Engineering Collaborators-Digital Engineering Toolchain
26 pages
Step by Step Guide For EFRIS Device and Thumbprint Registration
No ratings yet
Step by Step Guide For EFRIS Device and Thumbprint Registration
10 pages
Usabilla Presentation
No ratings yet
Usabilla Presentation
32 pages
International Journal of Information Management: Merve Bayramusta, V. Aslihan Nasir
No ratings yet
International Journal of Information Management: Merve Bayramusta, V. Aslihan Nasir
10 pages
VoLTE E2e Optimization
No ratings yet
VoLTE E2e Optimization
56 pages
Happiest Minds Technologies Campus Hiring FY24
No ratings yet
Happiest Minds Technologies Campus Hiring FY24
21 pages
Blue Eyes Technology (ABSTRACT)
90% (30)
Blue Eyes Technology (ABSTRACT)
19 pages
6.UX, Usability and UI in Mobile Computing
No ratings yet
6.UX, Usability and UI in Mobile Computing
19 pages
Presentation Cloud Computing by Sapan Shah
No ratings yet
Presentation Cloud Computing by Sapan Shah
25 pages
Zipher Text Coms Protocol
No ratings yet
Zipher Text Coms Protocol
51 pages
GIS_Intro-12Oct2023 (3)
No ratings yet
GIS_Intro-12Oct2023 (3)
28 pages
Python Amazon SP Api Readthedocs Io en v0.1.4
No ratings yet
Python Amazon SP Api Readthedocs Io en v0.1.4
27 pages
Control Board: UL325 - UL991
No ratings yet
Control Board: UL325 - UL991
19 pages
Fsolve - Optimization Toolbox
No ratings yet
Fsolve - Optimization Toolbox
6 pages