0% found this document useful (0 votes)

32 views33 pages

Slides On DataI

The document provides an introduction to data processing and machine learning, highlighting the differences between traditional programming and machine learning models. It discusses the importance of datasets, including supervised and unsupervised learning, and outlines various algorithms for regression and classification. Additionally, it addresses data quality, feature encoding, and model assessment techniques.

Uploaded by

duarte.denio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views33 pages

Slides On DataI

Uploaded by

duarte.denio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Introduction to

Data Processing I

Prof. Denio Duarte

duarte@uffs.edu.br
● Machine Learning
○ Build a model that describes the input data (dataset)
○ The model can be called program or hypothesis
Introduction
● Traditional programming

Input (data)
Computador Output
Program
Introduction
● Machine learning

Input (data)
Computador Program
Output
Introduction
● Comments
○ Data is the raw material for machine learning
algorithms
○ Algorithms build a model that describes the input data
○ The data quality affects the model quality

Fonte: https://www.r-bloggers.com/2019/08/new-course-learn-advanced-data-cleaning-in-r/
Introduction

Fonte: 7wData
Dataset
● Store the examples from the domain to be modeled
● Definitions
○ X={(x(1), y(1)), …, (x(m), y(m))}
■ m is the number of examples
■ x(i) is a tuple that represents the i-th example
● x(i)=(x1, x2, …, xn), n is the number of attributes (features)
of a given example (tuple)
■ y(i) is the label of example i
■ X is called the input and y is the output
Dataset
● Supervised
○ y is not empty, it means, every example is associated
with a label
● Unsupervised
○ y is empty, it means, examples are not associated with
any label
Dataset – example (supervised)
X y

Student Grade 1 Grade 2 Study Hours Result

x(1) Angelina 6.0 7.0 4 Pass

x(2) Meryl 9.0 8.4 6 Pass

x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3) Exam

x(4) Arnold 5.0 4.4 2 Pass

x(5) Brad 5.0 4 1 Fail

x(6) Sandra 3.4 2.0 0 Fail

Dataset – example (unsupervised)
X y

Student Grade 1 Grade 2 Study Hours

x(1) Angelina 6.0 7.0 4

x(2) Meryl 9.0 8.4 6

x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3)

x(4) Arnold 5.0 4.4 2

x(5) Brad 5.0 4 1

x(6) Sandra 3.4 2.0 0

Supervised Algorithms
● Rely on the labels to build the model
● Generalize the dataset based on the label values
– Regression
– Classification
Supervised Algorithms
● If y domain is continuous (y ∈ ℝ), the problem is a
y
regression problem
Student Grade 1 Grade 2 Study Hours Result

Angelina 6.0 7.0 4 7.2 Note:

Meryl 9.0 8.4 6 8.9 Every regression
problem can be
Tom 4.0 3.4 1 6.3 transformed into a
Arnold 5.0 4.4 2 7.0
classification
problem
Brad 5.0 4 1 4.9

Sandra 3.4 2.0 0 2.2

Supervised Algorithms
● If the y domain is discrete (classes), the problem is a
y
classification problem
Student Grade 1 Grade 2 Study Hours Result y ∈ {Pass, Exam, Fail}
Angelina 6.0 7.0 4 Pass

Meryl 9.0 8.4 6 Pass Transformation:

Tom 4.0 3.4 1 Exam >=7 Pass
< 5 Fail
Arnold 5.0 4.4 2 Pass Otherwise Exam
Brad 5.0 4 1 Fail

Sandra 3.4 2.0 0 Fail

Overall
Regression Intuition
● Given the wind speed (x1) and the number of people in
a room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy

100 2 5

50 42 25

45 31 22

60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ All features will be multiplied by a given weight, and
we will add a bias as know as slope
○ ϴo+ϴ1x1+ϴ2x2
wind speed # people energy
○ What are the best values for ϴ’s? 100 2 5

50 42 25

45 31 22

60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ ϴo= 0.5 ϴ1=0.2 e ϴ1 2×=0.3
∑ energy − yhat
4
○ x1=0.5+0.2x100+0.3x2 = 21.1 (21.1-5=16.1)
■ Not so close to the real value 5 wind speed # people energy y_hat
– Residual error: 6.55 1
4
×∑ energy− yhat
100 2 5 21.1
■ Which are the best ϴ’s? 50 42 25 23.1

45 31 22 18.8

60 35 18 23
Classification Intuiton
● Given the wind speed (x1) and the number of people in a
room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy

100 2 Baixa

50 42 Alta

45 31 Alta

60 35 Média
Classification Intuition
● Approach: building a set of rules to map each class
○ if attr > n then class1
else if attr2 < 5 then class2 else class3

wind speed # people energy

100 2 Baixa

50 42 Alta

45 31 Alta

60 35 Media
Classification Intuition
● Approach: building a set of rules to map each class
○ if x1 > 100 then Baixa
else if x2 > 40 then Alta
else if x1 > 50 then Média
else Alta
wind speed # people energy
○ Are there a better set of rules? 100 2 Baixa

50 42 Alta

45 31 Alta

60 35 Media
Be Aware
● The model cannot specialize the input data
○ Overfitting
● The model cannot generalize the input data
○ Underfitting

Fonte:https://abracd.org/overfitting-e-underfitting-em-machine-learning/
Assess the Model
● How to know if a built model is good?
○ Classification
■ Accuracy, precision , recall, F-Score, ...
○ Regression
■ R2 score, Mean Square Error (MSE), Mean Absolute Error
(MAE), ...
Dataset I
● We are interested in data
○ Features (attributes/variables) represent the propriety of
a given example
○ Features (attributes/variables) belong to a domain
●
Qualitative
●
Quantitative
import seaborn as sb
data=sb.load_dataset('tips')
data.head()
Dataset I
● The domain is associated with a type

float float string string string string int

Dataset I
● Most of the machine learning algorithms need features
as numbers
– Generally, non-numbers features are qualitative
float float string string string string int
Dataset I
● If a non-numeric attribute is qualitative, we can encode it
○ sex, smoker, day, and time are qualitative (or discrete)
■ We can encode them numerically
● no = 0
● yes = 1
● female = 0
● male = 1
● :
Dataset I
● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data[‘sex’])
print(laben.classes_)
[‘Male’,’Female’]
laben.fit(data[‘day’])
print(laben.classes_)
['Fri' 'Sat' 'Sun' 'Thur']
Dataset I

● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data['sex'])
data[‘sex’]=laben.transform(X[‘sex’])
Conjunto de Dados I
● Option 2:
○ If data is a dataframe (pandas) – And it is.
●
Verify the unique values of the attribute
data['sex'].unique()
['Male','Female']

#build a json (dictionary) type

dict={ 'sex': {'Male':0 , 'Female':1} ,
'smoker' : {'No': 0 , 'Yes': 1}}
data.replace(dict,inplace=True)
#inplace=True garantees the changement
#encode sex and smoker
Exercice
import numpy as np
import seaborn as sb
import pandas as pd
data=sb.load_dataset('tips')
print(data.columns)
y=data['tip'] # the amount of the tip is the label
X=data.drop(['tip'],axis=1) # the rest compose X
# Prepare the dataset to have only numeric attributes
Keep Pushing
X['day'].unique()
## ['Fri','Thur', 'Sat', 'Sun']
# Instead of encoding a feature, we can create
# new features based on the values of the original one
# first we create a new df with the values of the feature
days=pd.get_dummies(X['day'])
# days is composed of four columns Thur Fri Sat and Sun
# now we can replace the column day by days
X = pd.concat([X,days],axis=1) # add new columns
X.drop(['day'],inplace=True,axis=1) #delete day
# this approach can help the estimator learning
# better models
Keep Pushing
# We can delete one of the columns that represents a day
# in this case, if the other columns are 0, it means that
# the week day is the removed one, i.e., Thur
X['day'].unique()
## ['Fri','Thur', 'Sat', 'Sun']
days=pd.get_dummies(X['day'],drop_first=True)
# days, now, is composed of Fri Sat and Sun
# now we can replace the column day by days
X = pd.concat([X,days],axis=1) # add new columns
X.drop(['day'],inplace=True,axis=1) #delete day
It is your time (again)
● Transform the values of time into new features (as day)
● The label of tips dataset indicates that we have a
regression
○ Build a new dataset from tips.
○ In this new dataset, you are going to transform the values
of tip (label) into discrete ones (classes)
●
small, average, and big.
● DataFrame.to_csv(file_name, index=False)

Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
14 pages
Python Pandas and Machine Learning Guide
No ratings yet
Python Pandas and Machine Learning Guide
21 pages
Final ML
No ratings yet
Final ML
2 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Lecture 5
No ratings yet
Lecture 5
26 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Data Mining with Python Lab Guide
No ratings yet
Data Mining with Python Lab Guide
39 pages
Introduction To ML
No ratings yet
Introduction To ML
55 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
ML Imp QB
No ratings yet
ML Imp QB
34 pages
Supervised ML with Flask & Docker
No ratings yet
Supervised ML with Flask & Docker
30 pages
Topic 2
No ratings yet
Topic 2
47 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
Data Science and Machine Learning - Interview Questions
No ratings yet
Data Science and Machine Learning - Interview Questions
185 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
Week 01
No ratings yet
Week 01
37 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
DE - Python For Data Science - Machine Learning
No ratings yet
DE - Python For Data Science - Machine Learning
45 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
92 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Machinelearning
No ratings yet
Machinelearning
59 pages
Machine Learning
No ratings yet
Machine Learning
20 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
EE2211 CheatSheet
No ratings yet
EE2211 CheatSheet
15 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Unit 2
No ratings yet
Unit 2
19 pages
ML 01
No ratings yet
ML 01
24 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
14 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
1 An Introduction To Machine Learning With Scikit Learn
No ratings yet
1 An Introduction To Machine Learning With Scikit Learn
2 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
What Is Machine Learning - Python Data Science Handbook
No ratings yet
What Is Machine Learning - Python Data Science Handbook
11 pages
Tanu Raman ML Lab File
No ratings yet
Tanu Raman ML Lab File
21 pages
22K61A0654 2 Sasi Auto
No ratings yet
22K61A0654 2 Sasi Auto
24 pages
Top 90+ Data Science Interview Questions and Answers (2024)
No ratings yet
Top 90+ Data Science Interview Questions and Answers (2024)
38 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
KNN and Logistic Regression Guide
No ratings yet
KNN and Logistic Regression Guide
18 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Python Data Preprocessing & Regression
No ratings yet
Python Data Preprocessing & Regression
68 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Classification
No ratings yet
Classification
53 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
25 pages
BSC Computer Science Cs Semester 1 2023 April Problem Solving Using Computer and C Programming 2019 Pattern
No ratings yet
BSC Computer Science Cs Semester 1 2023 April Problem Solving Using Computer and C Programming 2019 Pattern
3 pages
Week 10 - Non Recursive Predictive Parsor
0% (1)
Week 10 - Non Recursive Predictive Parsor
41 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Entropy & Run Length Coding
No ratings yet
Entropy & Run Length Coding
45 pages
IVA Question Bank
No ratings yet
IVA Question Bank
8 pages
Data Structures for CS Students
No ratings yet
Data Structures for CS Students
7 pages
CS 188: Artificial Intelligence: Search
No ratings yet
CS 188: Artificial Intelligence: Search
55 pages
IPexercises
No ratings yet
IPexercises
3 pages
Ai Unit 3
No ratings yet
Ai Unit 3
27 pages
2 Polynomials 3 Division Algorithm of Polynomial
No ratings yet
2 Polynomials 3 Division Algorithm of Polynomial
12 pages
DSAL List of Laboratory Experiments
0% (1)
DSAL List of Laboratory Experiments
2 pages
Assignment 6 Solutions 13
No ratings yet
Assignment 6 Solutions 13
16 pages
Linear Time-Invariant Systems
No ratings yet
Linear Time-Invariant Systems
24 pages
Leetcode DSA Sheet by Fraz
No ratings yet
Leetcode DSA Sheet by Fraz
5 pages
4 Numerical Differentiation Integration
No ratings yet
4 Numerical Differentiation Integration
62 pages
Multi-Faults Classification in WSN A Deep Learning Approach
No ratings yet
Multi-Faults Classification in WSN A Deep Learning Approach
6 pages
Polynomial Problems from AwesomeMath
No ratings yet
Polynomial Problems from AwesomeMath
3 pages
Deep Reinforcement Learning for Aerospace
No ratings yet
Deep Reinforcement Learning for Aerospace
55 pages
Matlab Program: 1 Response of First Order System To Unit Step Input
No ratings yet
Matlab Program: 1 Response of First Order System To Unit Step Input
6 pages
Transportation, Assignment & Transshipment Problem
100% (1)
Transportation, Assignment & Transshipment Problem
7 pages
QB104744
No ratings yet
QB104744
4 pages
Klein G
No ratings yet
Klein G
196 pages
Algorithm Complexity Basics
No ratings yet
Algorithm Complexity Basics
9 pages
Pseudocodes Mcqs
No ratings yet
Pseudocodes Mcqs
6 pages
Unit 4
No ratings yet
Unit 4
11 pages
CFD Pressure-Velocity Coupling
No ratings yet
CFD Pressure-Velocity Coupling
10 pages
Numerical Differentiation - Summary PDF
No ratings yet
Numerical Differentiation - Summary PDF
8 pages
Daa 4
No ratings yet
Daa 4
26 pages
MATLAB Discrete Signal Sampling
No ratings yet
MATLAB Discrete Signal Sampling
4 pages
Allied Maths Second Internal
No ratings yet
Allied Maths Second Internal
2 pages

Slides On DataI

Uploaded by

Slides On DataI

Uploaded by

Introduction to

Prof. Denio Duarte

Student Grade 1 Grade 2 Study Hours Result

x(1) Angelina 6.0 7.0 4 Pass

x(2) Meryl 9.0 8.4 6 Pass

x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3) Exam

x(4) Arnold 5.0 4.4 2 Pass

x(5) Brad 5.0 4 1 Fail

x(6) Sandra 3.4 2.0 0 Fail

Student Grade 1 Grade 2 Study Hours

x(1) Angelina 6.0 7.0 4

x(2) Meryl 9.0 8.4 6

x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3)

x(4) Arnold 5.0 4.4 2

x(5) Brad 5.0 4 1

x(6) Sandra 3.4 2.0 0

Angelina 6.0 7.0 4 7.2 Note:

Sandra 3.4 2.0 0 2.2

Meryl 9.0 8.4 6 Pass Transformation:

Sandra 3.4 2.0 0 Fail

wind speed # people energy

float float string string string string int

#build a json (dictionary) type

You might also like