Machine Learning
Machine Learning
Technical requirements
The following requirements should be considered as they will help you better
understand the concepts, use them in your projects, and practice with the provided
code:
• Python library requirements:
1. sklearn
2. numpy
3. pandas
4. matplotlib
Data Collection
This phase involves the systematic gathering of datasets that will serve as the raw
material for model development. The quality and diversity of the data collected
directly impact the robustness and generalizability of the machine learning model.
Fortunately, there are thousands of open datasets to choose from, ranging across
all sorts of domains. Here are a few places you can look to get data:
• Popular open data repositories:
— UC Irvine Machine Learning Repository
— Kaggle datasets
— Amazon’s AWS datasets
There are 7 instances in the dataset. You notice that the age attributable has
only 6 non-null values, meaning that 1 district is missing this feature, you can
fix it by data imputation as we already learned.
The gender and group are their type is object. You can find out what
categories exist and how many districts belong to each category by using the
value.counts() method:
let’s make it more readable:
Let’s look at the other fields. The describe () method shows a summary of the
numerical attributes.
– Visualizations:
Another quick way to get a feel of the type of data you are dealing with is to
plot a histogram for each numerical attribute. A histogram shows the number
of instances (on the vertical axis) that have a given value range (on the
horizontal axis). You can either plot this one attribute at a time, or you can
call the hist () method on the whole dataset, and it will plot a histogram for
each numerical attribute (see Figure 1.5).
Hopefully you now have a better understanding of the kind of data you are dealing
with.
– Look for correlations (for only numerical features):
Since the dataset is not too large, you can easily compute the standard
correlation coefficient (also called Pearson’s r) between every pair of
attributes using the corr() method:
Another way to check for correlation between attributes is to use Pandas’
scatter_matrix function, which plots every numerical attribute against every other
numerical attribute. Since there are now 2 numerical attributes, you would get 22 =
4 plots. (Figure 1.6):
This scatter matrix plots every numerical attribute against every other numerical
attribute, plus a histogram of each numerical attribute’s values on the main
diagonal (top left to bottom right).
But it isn’t like a real data set, so let’s get another scatter matrix from a real dataset
(Figure 1.7).
Figure 1.7 scatter matrix for real dataset.
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that
there is a strong positive correlation.
When the coefficient is close to –1, it means that there is a strong negative
correlation. Finally, coefficients close to zero mean that there is no linear
correlation. (Figure 1.8) shows various plots along with the correlation coefficient
between their horizontal and vertical axes.
You decide to go for option 3 since it is the least destructive, but instead of the
preceding code, you will use handy Scikit-Learn class: SimpleImputer. The benefit
is that it will store median value of each feature: this will make it possible to impute
missing values not only on training set, but also on the validation set, test set, and
any new data fed to the model.
Missing values can also be replaced with the mean value (strategy = “mean”), or
with the most frequent value (strategy = “most_frequent” ), or with a constant value
(strategy = “constant”, fill_value = …..)
➔the difference between fit (), transform (), and fit_transform ():
— fit () learns the parameters from the data,
method is used to compute the necessary parameters (e.g., mean, standard
deviation, minimum and maximum values) from the training data.
— transform () Applies the learned transformation to the data, method is
used to apply the transformations (e.g., scaling, normalization) to the
data using the parameters calculated in the fit () method.
— fit_transform () Combines fit () and transform () into a single step.
In the preceding figure, the plots were generated using the values of features in
the diabetes dataset of the scikit-learn package, which was loaded via
sklearn.datasets.load_diabetes().
Preprocessing
— Data transformation (for categorical features)
When we train a machine learning model, the model needs to use numerical
values to calculate the loss function in each iteration of the optimization
process. Hence, we need to transform categorical variables into numerical
alternatives. There are multiple feature encoding techniques:
— OneHotEncoder
— OrdinalEncoder
— LabelEncoder
— TargetEncoder
This is an imaginary dataset.
Figure 1.3 –Manual calculations for one-hot, target, and label encoding
using a simple example dataset with four features and seven data points.
The difference between ordinal and nominal transformation is the meaning behind
the order of categories in ordinal variables. For example, if we are encoding grades
of students, A, B, C, and D could be transformed into 1, 2, 3, and 4, or 4, 3, 2, and 1,
but transforming them into 1, 3, 4, and 2 will not be acceptable as it is changing the
meaning behind the order of the grades.
First, we will use label encoding to encode the categorical features in the defined
DataFrame:
Then, The Same way we will use Ordinal encoding to encode the categorical, but
the fit_transform () expected a data frame df[“gender”] is series,and df [[“gender”]]
is Data Frame. :
Then, we will try to perform one-hot encoding for categorical feature
transformation:
Example on Standardlization: -
1 −1 2 𝑥−𝜇
𝐷𝑎𝑡𝑎 𝑠𝑒𝑡 = [ 2 0 0 ] 𝑧 =
𝜎
0 1 −1
1
𝜎2 = ∑𝑁
𝑖 = 1(𝑥𝑖 − 𝜇)
2
𝑁
1 2
Col 1: 𝜎 2 = ((1 − 1)2 + (2 − 1)2 + (0 − 1)2 ) =
3 3
2
𝜎 = √ = 0.816
3
Col 2: 𝜎 = 0.816 , Col 3: 𝜎 = 1.24
Apply the mathematic definition:
1−1 −1−0 1−1
0.816 0.816 1.24 0 −1.22 1.33
2−1 0−0 1−1
0.816 0.816 1.24
=> [ 1.22 0 −0.26]
0−1 1−0 1−1 −1.22 1.22 −1.06
0.816 0.816 1.24
To know the mean and standard deviation for this data set:
You need to make sure the scaling and normalizations you implement don’t cause
ties in your feature values, meaning data points don’t lose their difference based on
features that went under transformation.
Custom transformer
Although Scikit-Learn provides many useful transformers, you will need to write
your own for tasks such as custom cleanup operations or combining specific
attributes. You will want your transformer to work seamlessly with Scikit-Learn
functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not
inheritance), all you need is to create a class and implement three methods: fit ()
(returning self), transform (), and fit_transform(). You can get the last one for free by
simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as
a base class (and avoid *args and **kargs in your constructor) you will get two extra
methods (get_params () and set_params()) that will be useful for automatic
hyperparameter tuning. For example, here’s custom transformer that acts much
like the StandardScaler:
Shorthand:
If you don’t want to name the transformers, you can use the make_pipeline ()
Function instead; it takes transformers as positional arguments and create a
pipeline using the name of the transformers’ classes.
2- Access
Estimators stored as list. Sub-pipeline can extract using slicing
Pipe [:1] => return the first estimator.
Pipe [-1:] => return the last estimator.
Accessing a step by name or position.
Pipe[“key”] => return its estimator.
Example,
As you can see, there are many data transformation steps that need to be
executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class
to help with such sequences of transformations. Here is a small pipeline for the
numerical attributes, which will first impute then scale the input features:
Since listing all the column names is not very convenient, scikit-learn provides a
make_column_selector () function that returns a selector function you can use to
automatically select all the features of given type, such a numerical or categorical .
You can pass this selector function to ColumnTransformer instead of column
names or indices. Moreover, if you don’t care about naming the transformers, you
can use make_column_transformer (), which chooses the names for you, just like
make_pipeline () does. For example, the following code creates the same
ColumnTransformer as earlier, except the transformers are automatically named
“pipeline-1” and “pipeline-2” instead of “num”and”cat”:
We have a preprocessing pipeline that takes the entire training dataset and applies
each transformer to the appropriate columns, then concatenates the transformed
columns horizontally. this return a NumPy array, but you can get the column names
using preprocessing.get_feature_names_out()
Your project is going well and you’re almost ready to train some models! You now
want to create a single pipeline that will perform all the transformations you’ve
experimented with up to now. Let’s recap what the pipeline will do and why:
• Missing values in numerical features will be imputed by replacing them with
the median.
• Missing values in categorical features will be replaced by the most frequent
category.
• The categorical features will be one-hot-encoded, as most machine learning
algorithms only accept numerical inputs.
• All numerical features will be standardized, as most machine learning
algorithms prefer when all features have roughly the same scale.
• Feature engineering: a few ratio features will be computed and added.
Hopefully these will better correlate and thereby help the ML models.
• Features with a long tail will be replaced by theirs as most models prefer
features with roughly uniform or gaussian distributions.
2nd
3rd
4th
5th
Then train your models against the smaller training set and evaluate them against
the validation set.
From sklearn.medel_selection import cross_val_score