[go: up one dir, main page]

0% found this document useful (0 votes)
29 views30 pages

Machine Learning

Machine learning life cycle

Uploaded by

hamoelsyed2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views30 pages

Machine Learning

Machine learning life cycle

Uploaded by

hamoelsyed2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Machine learning

Technical requirements
The following requirements should be considered as they will help you better
understand the concepts, use them in your projects, and practice with the provided
code:
• Python library requirements:
1. sklearn
2. numpy
3. pandas
4. matplotlib

Before we start modeling


You need to know your objectives. You need to know what problems you want to
solve and then define smaller subproblems that would be machine learning
solvable.
Knowing the objective is important because it will determine how you will use to
evaluate your model, which algorithms you will select, which performance
measure you will use to evaluate your model, and how much effort you will spend
tweaking it.
Once you have identified your subproblems, you can find out how you can use
machine learning for each and go through a machine learning life cycle for the
defined subproblems. Each of the subproblems may need specific data processing
and machine learning modeling, and some of them could be easier to solve
compared to the rest.

Data Collection
This phase involves the systematic gathering of datasets that will serve as the raw
material for model development. The quality and diversity of the data collected
directly impact the robustness and generalizability of the machine learning model.
Fortunately, there are thousands of open datasets to choose from, ranging across
all sorts of domains. Here are a few places you can look to get data:
• Popular open data repositories:
— UC Irvine Machine Learning Repository
— Kaggle datasets
— Amazon’s AWS datasets

• Meta portals (they list open data repositories):


— http://dataportals.org/
— http://opendatamonitor.eu/
— http://quandl.com/

• Other pages listing many popular open data repositories:


— Wikipedia’s list of Machine Learning datasets
— Quora.com question
— Datasets subreddit

Exploratory Data Analysis (EDA)


Now, focus turns to understanding the underlying patterns and characteristics of
the collected data and take a quick look at the data structure.
Visualizations, summary statistics, and correlation analyses offer a
comprehensive view of the data, guiding practitioners toward informed choices
in feature engineering, model selection, and other critical aspects.
– Summary statistics:
This is an imaginary dataset
 The info () method is useful to get a quick description of the data,
particularly the total number of rows, each attribute’s type, and the
number of non-null values:

There are 7 instances in the dataset. You notice that the age attributable has
only 6 non-null values, meaning that 1 district is missing this feature, you can
fix it by data imputation as we already learned.
The gender and group are their type is object. You can find out what
categories exist and how many districts belong to each category by using the
value.counts() method:
let’s make it more readable:

Let’s look at the other fields. The describe () method shows a summary of the
numerical attributes.

– Visualizations:
Another quick way to get a feel of the type of data you are dealing with is to
plot a histogram for each numerical attribute. A histogram shows the number
of instances (on the vertical axis) that have a given value range (on the
horizontal axis). You can either plot this one attribute at a time, or you can
call the hist () method on the whole dataset, and it will plot a histogram for
each numerical attribute (see Figure 1.5).

Figure 1.5. A histogram for each numerical attribute

Hopefully you now have a better understanding of the kind of data you are dealing
with.
– Look for correlations (for only numerical features):
Since the dataset is not too large, you can easily compute the standard
correlation coefficient (also called Pearson’s r) between every pair of
attributes using the corr() method:
Another way to check for correlation between attributes is to use Pandas’
scatter_matrix function, which plots every numerical attribute against every other
numerical attribute. Since there are now 2 numerical attributes, you would get 22 =
4 plots. (Figure 1.6):

Figure 1.6. Scatter matrix

This scatter matrix plots every numerical attribute against every other numerical
attribute, plus a histogram of each numerical attribute’s values on the main
diagonal (top left to bottom right).
But it isn’t like a real data set, so let’s get another scatter matrix from a real dataset
(Figure 1.7).
Figure 1.7 scatter matrix for real dataset.
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that
there is a strong positive correlation.
When the coefficient is close to –1, it means that there is a strong negative
correlation. Finally, coefficients close to zero mean that there is no linear
correlation. (Figure 1.8) shows various plots along with the correlation coefficient
between their horizontal and vertical axes.

Figure 1.8. Standard correlation coefficient of various datasets (source: Wikipedia.


public domain image).
Data Cleaning and Preprocessing
Raw data is often messy and unstructured. Data cleaning involves addressing
issues such as missing values, outliers, and inconsistencies that could
compromise the accuracy and reliability of the machine learning model.
Preprocessing takes this a step further by standardizing formats, scaling values,
and encoding categorical variables, creating a consistent and well-organized
dataset.

• Data Cleaning: Address issues such as missing values, outliers, and


inconsistencies in the data.

• Data Preprocessing: Standardize formats, scale values, and encode


categorical variables for consistency.

Data cleaning (for numerical features)


Most machine learning algorithms cannot work with missing features, you
have three options to fix this:
— Get rid of the corresponding districts.
— Get rid of the whole attribute.
— Feature imputation for filling in missing values.
You can accomplish these easily using Pandas DataFrame’s dropna(), drop(),
fillna() methods:

You decide to go for option 3 since it is the least destructive, but instead of the
preceding code, you will use handy Scikit-Learn class: SimpleImputer. The benefit
is that it will store median value of each feature: this will make it possible to impute
missing values not only on training set, but also on the validation set, test set, and
any new data fed to the model.
Missing values can also be replaced with the mean value (strategy = “mean”), or
with the most frequent value (strategy = “most_frequent” ), or with a constant value
(strategy = “constant”, fill_value = …..)

there also more powerful imputers available in sklearn.impute package


(for numerical features only):

• KNNImputer provides imputation for filling in missing values using the k-


Nearest Neighbors approach.
• IterativeImputer trains a regression model per feature to predict the
missing values based on all other available features.
• MissingIndicator this transformation is useful in conjunction with
imputation. When using imputation, preserving the information about which
values had been missing can be informative.

➔the difference between fit (), transform (), and fit_transform ():
— fit () learns the parameters from the data,
method is used to compute the necessary parameters (e.g., mean, standard
deviation, minimum and maximum values) from the training data.
— transform () Applies the learned transformation to the data, method is
used to apply the transformations (e.g., scaling, normalization) to the
data using the parameters calculated in the fit () method.
— fit_transform () Combines fit () and transform () into a single step.

— Outlier removal (for numerical features)


Numerical variables in our datasets could have values that are far away from
the rest of the data. They could be real values that are dissimilar to the rest of
the data points. You can visually see and detect them using a boxplot
matplotlib.pyplot.boxplot (Figure 1.2).

Figure 1.2 – Outliers in histograms and boxplots

In the preceding figure, the plots were generated using the values of features in
the diabetes dataset of the scikit-learn package, which was loaded via
sklearn.datasets.load_diabetes().

Preprocessing
— Data transformation (for categorical features)
When we train a machine learning model, the model needs to use numerical
values to calculate the loss function in each iteration of the optimization
process. Hence, we need to transform categorical variables into numerical
alternatives. There are multiple feature encoding techniques:

— OneHotEncoder
— OrdinalEncoder
— LabelEncoder
— TargetEncoder
This is an imaginary dataset.
Figure 1.3 –Manual calculations for one-hot, target, and label encoding
using a simple example dataset with four features and seven data points.

Each of these techniques has its benefits and caveats.


For example:
— one hot encoder: increase the number of features, and the chance of
overfitting.
— Label encoder: assigns integer values to each category, which do not
necessarily have a meaning.
For example, considering Male as 1 and female as 0 is arbitrary and doesn’t
have any real meaning.
— Target encoder: alternative approach that considers the probabilities of
each category concerning the target.

The difference between ordinal and nominal transformation is the meaning behind
the order of categories in ordinal variables. For example, if we are encoding grades
of students, A, B, C, and D could be transformed into 1, 2, 3, and 4, or 4, 3, 2, and 1,
but transforming them into 1, 3, 4, and 2 will not be acceptable as it is changing the
meaning behind the order of the grades.
First, we will use label encoding to encode the categorical features in the defined
DataFrame:

Then, The Same way we will use Ordinal encoding to encode the categorical, but
the fit_transform () expected a data frame df[“gender”] is series,and df [[“gender”]]
is Data Frame. :
Then, we will try to perform one-hot encoding for categorical feature
transformation:

How do you name columns correctly?

In the order of remaining columns, F, M, H1, H2, H3.

Now, we will implement target encoding in Python, after installing the


category_encoders library, as the third encoding approach, as follows:

— Data Scaling (for numerical features)


The values of features, either originally numerical or after transformation,
could have different ranges. Many machine learning models perform better, or
at least their optimization processes converge faster, if their feature values get
scaled and normalized properly.
For example, if you have a feature ranging from 0.001 to 0.05 and another one
from 1,000 to 5,000, bringing both of them to a reasonable range such as [0, 1]
or [-1, 1] could help improve the speed of convergence or the performance of
your model.
For example, the values of a variable after using the StandardScalar class of
scikit-learn will be centered around zero with a standard deviation of one
(Table 1.1).

Scikit learn code Mathematical Definition Value Limits

Scikitlearn.preprocessing. 𝑧 = 𝑥 − 𝜇 No limit >99% of data


StandardScaler () 𝜎 between -3 and 3
𝜇: 𝑚𝑒𝑎𝑛
𝜎: 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
sklearn.preprocessing. 𝑥 − 𝑥𝑚𝑖𝑛 [0,1]
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
MinMaxScaler (), 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Normalization
sklearn.preprocessing. 𝑥 [-1,1]
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
MaxAbsScaler () |𝑥𝑚𝑎𝑥 |

sklearn.preprocessing. 𝑥 − 𝑄2 No limit Majority of data


𝑧=
RobustScaler () 𝐼𝑄𝑅 between -3 and 3
𝑄2 : 𝑚𝑒𝑑𝑖𝑎𝑛
𝐼𝑄𝑅: Interquartile range
Table 1.1 – Example of Python classes for scaling and normalizing feature values

Example on Standardlization: -

1 −1 2 𝑥−𝜇
𝐷𝑎𝑡𝑎 𝑠𝑒𝑡 = [ 2 0 0 ] 𝑧 =
𝜎
0 1 −1

Col1 Col2 Col3


1 + 2 −1 + 1 2 − 1
𝜇=[ , , ] = [1,0,0.33]
3 3 3

1
𝜎2 = ∑𝑁
𝑖 = 1(𝑥𝑖 − 𝜇)
2
𝑁
1 2
Col 1: 𝜎 2 = ((1 − 1)2 + (2 − 1)2 + (0 − 1)2 ) =
3 3
2
𝜎 = √ = 0.816
3
Col 2: 𝜎 = 0.816 , Col 3: 𝜎 = 1.24
Apply the mathematic definition:
1−1 −1−0 1−1
0.816 0.816 1.24 0 −1.22 1.33
2−1 0−0 1−1
0.816 0.816 1.24
=> [ 1.22 0 −0.26]
0−1 1−0 1−1 −1.22 1.22 −1.06
0.816 0.816 1.24

Let's write that but as code:

To know the mean and standard deviation for this data set:

Note: The RobustScaler class is less likely to be affected by outliers.


Actual data Normalization standardization

Figure 1.4 – Graph of the difference between Normalization and Standardization.

You need to make sure the scaling and normalizations you implement don’t cause
ties in your feature values, meaning data points don’t lose their difference based on
features that went under transformation.

Feature Engineering, Selection:


Feature engineering takes center stage as a transformative process that elevates
raw data into meaningful predictors. Simultaneously, feature selection refines this
pool of variables, identifying the most relevant ones to enhance model efficiency
and effectiveness.
Feature Engineering involves creating new features or transforming existing ones
to better capture patterns and relationships within the data. This creative process
requires domain expertise and a deep understanding of the problem at hand,
ensuring that the engineered features contribute meaningfully to the predictive
power of the model.
On the other hand, Feature Selection focuses on identifying the subset of features
that most significantly impact the model’s performance. This dual approach seeks
to strike a delicate balance, optimizing the feature set for predictive accuracy while
minimizing computational complexity.
The goal of feature selection is to reduce the number of features, or the
dimensionality of your data, and keep features that are information rich. For
example, if we have 20,000 features and 500 data points, there is a high chance
that most of the original 20,000 features are not informative when used to build a
supervised learning model. The following list explains some simple techniques for
feature selection:
• Keeping features with a high variance or MAD across the data point
• Keeping features with the highest number of unique values across the data points
• Keeping representative features from groups of highly correlated features
These processes can be conducted using all the data points or just training data to
avoid potential information leakage between the training and test data
Feature Engineering:
Create new features or transform existing ones to better capture patterns and
relationships.

How can we make feature engineering?


By multiplying, division, subtraction, addition and any other operation to get a new
row.

But the better way is making it by custom transformer:

Understanding the scikit-learn estimator API


In the previous sections, we used the Imputer class from scikit-learn to impute
missing values in our dataset. The Imputer class belongs to the so-called
transformer classes in scikit-learn, which are used for data transformation. The two
essential methods of those estimators are fit and transform. The fit method is used
to learn the parameters from the training data, and the transform method uses
those parameters to transform the data. Any data array that is to be transformed
needs to have the same number of features as the data array that was used to fit
the model.
Scikit-Learn’s API is remarkably well designed. The main design principles are:
• Consistency. All objects share a consistent and simple interface:
– Estimators. Any object that can estimate some parameters based on a
dataset is called an estimator (e.g., an imputer is an estimator). The
estimation itself is performed by the fit () method. Any other parameter
needed to guide the estimation process is considered a hyperparameter
(such as an imputer’s strategy), and it must be set as an instance variable
(generally via a constructor parameter). Estimator is conceptually very
similar to the transformer class. Estimators have a predict method.
– Transformers. Some estimators (such as an imputer) can also transform a
dataset; these are called transformers. Once again, the API is quite simple:
the transformation is performed by the transform () method with the dataset
to transform as a parameter. It returns the transformed dataset. This
transformation generally relies on the learned parameters, as is the case for
an imputer. All transformers also have a convenience method called
fit_transform() that is equivalent to calling fit() and then transform() (but
sometimes fit_transform() is optimized and runs much faster).
– Predictors. Finally, some estimators can make predictions given a dataset;
they are called predictors. For example, the LinearRegression model in the
previous chapter was a predictor: it predicted life satisfaction given a
country’s GDP per capita. A predictor has a predict () method that takes a
dataset of new instances and returns a dataset of corresponding predictions.
It also has a score () method that measures the quality of the predictions
given a test set (and the corresponding labels in the case of supervised
learning algorithms).
• Inspection. All the estimator’s hyperparameters are accessible directly via public
instance variables (e.g., imputer.strategy), and all the estimator’s learned
parameters are also accessible via public instance variables with an underscore
suffix (e.g., imputer.statistics_).
• Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy
sparse matrices, instead of homemade classes. Hyperparameters are just regular
Python strings or numbers.
• Composition. Existing building blocks are reused as much as possible. For
example, it is easy to create a Pipeline estimator from an arbitrary sequence of
transformers followed by a final estimator, as we will see.
The following figure illustrates how a transformer, fitted on the training data, is used
to transform a training dataset as well as a new test dataset:

The following figure illustrates how an estimator works:

Custom transformer
Although Scikit-Learn provides many useful transformers, you will need to write
your own for tasks such as custom cleanup operations or combining specific
attributes. You will want your transformer to work seamlessly with Scikit-Learn
functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not
inheritance), all you need is to create a class and implement three methods: fit ()
(returning self), transform (), and fit_transform(). You can get the last one for free by
simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as
a base class (and avoid *args and **kargs in your constructor) you will get two extra
methods (get_params () and set_params()) that will be useful for automatic
hyperparameter tuning. For example, here’s custom transformer that acts much
like the StandardScaler:

Make the previous feature engineering by custom transformer:


The Same Way!!!
Here are a few things to note:
• The scikit.utils.validation package contains several functions we can use to
validate the inputs.
• Scikit-learn pipelines require the fit () method to have two arguments x and y,
when is why we need the y = None argument even though we don’t use y.
• The fit () method must return self.
• All scikit-learn estimators set n_features_in_ in the fit () method, and they
ensure that the data passed to transform() or predict() has this no of features
Transformation pipeline to build a composite transformer
1- Build pipeline
Using a list of (key, value)
For example,

Shorthand:
If you don’t want to name the transformers, you can use the make_pipeline ()
Function instead; it takes transformers as positional arguments and create a
pipeline using the name of the transformers’ classes.

2- Access
Estimators stored as list. Sub-pipeline can extract using slicing
Pipe [:1] => return the first estimator.
Pipe [-1:] => return the last estimator.
Accessing a step by name or position.
Pipe[“key”] => return its estimator.
Example,
As you can see, there are many data transformation steps that need to be
executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class
to help with such sequences of transformations. Here is a small pipeline for the
numerical attributes, which will first impute then scale the input features:

The Pipeline constructor takes a list of name/estimator pairs defining a


sequence of steps. All but the last estimator must be transformers (i.e., they
must have a fit_transform() method). The names can be anything you like.
 Pipeline requires all steps must be transformers except the last one
can be anything:
• Clustering
• Transformer: behave like transformer.
• Predictor: will expose that method and given data x, use all step
except the last to transform the data then, give the transformed
data to the predict method.

Feature Union composite feature spaces


Feature union combines several transformers objects into a new transformer that
combines their output. During fitting, each of these is fit to the data independently,
the transformers, are applied in parallel. Feature union serves the same purpose
as pipeline.
Feature union and pipeline can be combined to create complex model.
Note: - A feature union has no way of checking whether two transformers might
produce identical. It only produces a union when the feature sets are disjoint.
Build
Using a list of (key, value)
For example,

Shorthand using make_union like pipeline.


So far, we have handled the categorical columns and the numerical columns
separately. It would be more convenient to have a single transformer capable of
handling all columns, applying the appropriate transformations to each column. for
this, you can a ColumnTransformer. Its constructor requires a list triplets (3-tuples
), each containing a name (which must be unique and not contain double
underscores), a transformer, and a list of names of columns that the transformer
should be applied to.
The result:

Since listing all the column names is not very convenient, scikit-learn provides a
make_column_selector () function that returns a selector function you can use to
automatically select all the features of given type, such a numerical or categorical .
You can pass this selector function to ColumnTransformer instead of column
names or indices. Moreover, if you don’t care about naming the transformers, you
can use make_column_transformer (), which chooses the names for you, just like
make_pipeline () does. For example, the following code creates the same
ColumnTransformer as earlier, except the transformers are automatically named
“pipeline-1” and “pipeline-2” instead of “num”and”cat”:
We have a preprocessing pipeline that takes the entire training dataset and applies
each transformer to the appropriate columns, then concatenates the transformed
columns horizontally. this return a NumPy array, but you can get the column names
using preprocessing.get_feature_names_out()

And wrap the data in a nice DataFrame:

Your project is going well and you’re almost ready to train some models! You now
want to create a single pipeline that will perform all the transformations you’ve
experimented with up to now. Let’s recap what the pipeline will do and why:
• Missing values in numerical features will be imputed by replacing them with
the median.
• Missing values in categorical features will be replaced by the most frequent
category.
• The categorical features will be one-hot-encoded, as most machine learning
algorithms only accept numerical inputs.
• All numerical features will be standardized, as most machine learning
algorithms prefer when all features have roughly the same scale.
• Feature engineering: a few ratio features will be computed and added.
Hopefully these will better correlate and thereby help the ML models.
• Features with a long tail will be replaced by theirs as most models prefer
features with roughly uniform or gaussian distributions.

Model Selection, training, evaluation and tuning


At last!!! You framed the problem, you got the data and explored it, you sampled a
training set and a test set, and you wrote a preprocessing pipeline to automatically
clean up and prepare your data for machine learning algorithms. You are now ready
to select and train a machine learning model.
Train and evaluate on the training set
Choose the model that you want to train the data. Say that you chose linear
regression. Firstly, check the performance that you chose to use the RMSE as your
performance measure, so you want to measure this regression model’s RMSE on
the whole training set using Scikit-Learn’s mean_squared_error() function.
If the error is high, it can mean that the features don’t provide enough information to
make good predictions, or that the model isn’t powerful enough.
There are many ways to fix underfitting are select a more powerful model.
You decide to try Decision tree regressor. Evaluate it on training set if it gives you a
high error, try another one.

Better evaluation by cross-validation


split the training set into a smaller training set and a validation set.
Here we split the training set into 5 nonoverlapping subsets called folds,
then it trains and evaluates the model 5 times, picking a different fold for
evaluation every time and using the other 4 folds for training, the result is
an array containing the 5 evaluation scores.
1st

2nd

3rd

4th
5th

Then train your models against the smaller training set and evaluate them against
the validation set.
From sklearn.medel_selection import cross_val_score

Why is there negative before the function??


To equate the negative signal in Scoring = “neg_root_mean_squared_error”
But why is there neg in this hyperparameter:
Scikit-Learn’s cross-validation features expect a utility function (greater is better)
rather than a cost function (lower is better), so the scoring function is the opposite
of the RMSE. It’s a negative value, so you need to switch the sign of the output to
get the RMSE scores.
Cv (cross-validation)=> number of folds => number of nonoverlapping subsets
Fine-Tune your model
Let`s assume that you have a shortlist of promising models. You now need to fine-
tune them. Let`s look at a few ways you can do that

You might also like