Introduction to
Data Processing I
Prof. Denio Duarte
duarte@uffs.edu.br
● Machine Learning
○ Build a model that describes the input data (dataset)
○ The model can be called program or hypothesis
Introduction
● Traditional programming
Input (data)
Computador Output
Program
Introduction
● Machine learning
Input (data)
Computador Program
Output
Introduction
● Comments
○ Data is the raw material for machine learning
algorithms
○ Algorithms build a model that describes the input data
○ The data quality affects the model quality
Fonte: https://www.r-bloggers.com/2019/08/new-course-learn-advanced-data-cleaning-in-r/
Introduction
Fonte: 7wData
Dataset
● Store the examples from the domain to be modeled
● Definitions
○ X={(x(1), y(1)), …, (x(m), y(m))}
■ m is the number of examples
■ x(i) is a tuple that represents the i-th example
● x(i)=(x1, x2, …, xn), n is the number of attributes (features)
of a given example (tuple)
■ y(i) is the label of example i
■ X is called the input and y is the output
Dataset
● Supervised
○ y is not empty, it means, every example is associated
with a label
● Unsupervised
○ y is empty, it means, examples are not associated with
any label
Dataset – example (supervised)
X y
Student Grade 1 Grade 2 Study Hours Result
x(1) Angelina 6.0 7.0 4 Pass
x(2) Meryl 9.0 8.4 6 Pass
x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3) Exam
x(4) Arnold 5.0 4.4 2 Pass
x(5) Brad 5.0 4 1 Fail
x(6) Sandra 3.4 2.0 0 Fail
Dataset – example (unsupervised)
X y
Student Grade 1 Grade 2 Study Hours
x(1) Angelina 6.0 7.0 4
x(2) Meryl 9.0 8.4 6
x(3) Tom 4.0 x1(3) 3.4 x2(3) 1 x3(3)
x(4) Arnold 5.0 4.4 2
x(5) Brad 5.0 4 1
x(6) Sandra 3.4 2.0 0
Supervised Algorithms
● Rely on the labels to build the model
● Generalize the dataset based on the label values
– Regression
– Classification
Supervised Algorithms
● If y domain is continuous (y ∈ ℝ), the problem is a
y
regression problem
Student Grade 1 Grade 2 Study Hours Result
Angelina 6.0 7.0 4 7.2 Note:
Meryl 9.0 8.4 6 8.9 Every regression
problem can be
Tom 4.0 3.4 1 6.3 transformed into a
Arnold 5.0 4.4 2 7.0
classification
problem
Brad 5.0 4 1 4.9
Sandra 3.4 2.0 0 2.2
Supervised Algorithms
● If the y domain is discrete (classes), the problem is a
y
classification problem
Student Grade 1 Grade 2 Study Hours Result y ∈ {Pass, Exam, Fail}
Angelina 6.0 7.0 4 Pass
Meryl 9.0 8.4 6 Pass Transformation:
Tom 4.0 3.4 1 Exam >=7 Pass
< 5 Fail
Arnold 5.0 4.4 2 Pass Otherwise Exam
Brad 5.0 4 1 Fail
Sandra 3.4 2.0 0 Fail
Overall
Regression Intuition
● Given the wind speed (x1) and the number of people in
a room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy
100 2 5
50 42 25
45 31 22
60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ All features will be multiplied by a given weight, and
we will add a bias as know as slope
○ ϴo+ϴ1x1+ϴ2x2
wind speed # people energy
○ What are the best values for ϴ’s? 100 2 5
50 42 25
45 31 22
60 35 18
Regression Intuition
● Let’s model the problem mathematically
○ ϴo= 0.5 ϴ1=0.2 e ϴ1 2×=0.3
∑ energy − yhat
4
○ x1=0.5+0.2x100+0.3x2 = 21.1 (21.1-5=16.1)
■ Not so close to the real value 5 wind speed # people energy y_hat
– Residual error: 6.55 1
4
×∑ energy− yhat
100 2 5 21.1
■ Which are the best ϴ’s? 50 42 25 23.1
45 31 22 18.8
60 35 18 23
Classification Intuiton
● Given the wind speed (x1) and the number of people in a
room (x1), how much energy is necessary to cool the
room (y)?
x1 x2 y
wind speed # people energy
100 2 Baixa
50 42 Alta
45 31 Alta
60 35 Média
Classification Intuition
● Approach: building a set of rules to map each class
○ if attr > n then class1
else if attr2 < 5 then class2 else class3
wind speed # people energy
100 2 Baixa
50 42 Alta
45 31 Alta
60 35 Media
Classification Intuition
● Approach: building a set of rules to map each class
○ if x1 > 100 then Baixa
else if x2 > 40 then Alta
else if x1 > 50 then Média
else Alta
wind speed # people energy
○ Are there a better set of rules? 100 2 Baixa
50 42 Alta
45 31 Alta
60 35 Media
Be Aware
● The model cannot specialize the input data
○ Overfitting
● The model cannot generalize the input data
○ Underfitting
Fonte:https://abracd.org/overfitting-e-underfitting-em-machine-learning/
Assess the Model
● How to know if a built model is good?
○ Classification
■ Accuracy, precision , recall, F-Score, ...
○ Regression
■ R2 score, Mean Square Error (MSE), Mean Absolute Error
(MAE), ...
Dataset I
● We are interested in data
○ Features (attributes/variables) represent the propriety of
a given example
○ Features (attributes/variables) belong to a domain
●
Qualitative
●
Quantitative
import seaborn as sb
data=sb.load_dataset('tips')
data.head()
Dataset I
● The domain is associated with a type
float float string string string string int
Dataset I
● Most of the machine learning algorithms need features
as numbers
– Generally, non-numbers features are qualitative
float float string string string string int
Dataset I
● If a non-numeric attribute is qualitative, we can encode it
○ sex, smoker, day, and time are qualitative (or discrete)
■ We can encode them numerically
● no = 0
● yes = 1
● female = 0
● male = 1
● :
Dataset I
● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data[‘sex’])
print(laben.classes_)
[‘Male’,’Female’]
laben.fit(data[‘day’])
print(laben.classes_)
['Fri' 'Sat' 'Sun' 'Thur']
Dataset I
● Option 1:
○ Use the method LabelEncoder from preprocessing
(sklearn)
from sklearn import preprocessing as pp
laben=pp.LabelEncoder()
laben.fit(data['sex'])
data[‘sex’]=laben.transform(X[‘sex’])
Conjunto de Dados I
● Option 2:
○ If data is a dataframe (pandas) – And it is.
●
Verify the unique values of the attribute
data['sex'].unique()
['Male','Female']
#build a json (dictionary) type
dict={ 'sex': {'Male':0 , 'Female':1} ,
'smoker' : {'No': 0 , 'Yes': 1}}
data.replace(dict,inplace=True)
#inplace=True garantees the changement
#encode sex and smoker
Exercice
import numpy as np
import seaborn as sb
import pandas as pd
data=sb.load_dataset('tips')
print(data.columns)
y=data['tip'] # the amount of the tip is the label
X=data.drop(['tip'],axis=1) # the rest compose X
# Prepare the dataset to have only numeric attributes
Keep Pushing
X['day'].unique()
## ['Fri','Thur', 'Sat', 'Sun']
# Instead of encoding a feature, we can create
# new features based on the values of the original one
# first we create a new df with the values of the feature
days=pd.get_dummies(X['day'])
# days is composed of four columns Thur Fri Sat and Sun
# now we can replace the column day by days
X = pd.concat([X,days],axis=1) # add new columns
X.drop(['day'],inplace=True,axis=1) #delete day
# this approach can help the estimator learning
# better models
Keep Pushing
# We can delete one of the columns that represents a day
# in this case, if the other columns are 0, it means that
# the week day is the removed one, i.e., Thur
X['day'].unique()
## ['Fri','Thur', 'Sat', 'Sun']
days=pd.get_dummies(X['day'],drop_first=True)
# days, now, is composed of Fri Sat and Sun
# now we can replace the column day by days
X = pd.concat([X,days],axis=1) # add new columns
X.drop(['day'],inplace=True,axis=1) #delete day
It is your time (again)
● Transform the values of time into new features (as day)
● The label of tips dataset indicates that we have a
regression
○ Build a new dataset from tips.
○ In this new dataset, you are going to transform the values
of tip (label) into discrete ones (classes)
●
small, average, and big.
● DataFrame.to_csv(file_name, index=False)