0% found this document useful (0 votes)

29 views30 pages

Machine Learning

Machine learning life cycle

Uploaded by

hamoelsyed2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views30 pages

Machine Learning

Machine learning life cycle

Uploaded by

hamoelsyed2005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Machine learning

Technical requirements
The following requirements should be considered as they will help you better
understand the concepts, use them in your projects, and practice with the provided
code:
• Python library requirements:
1. sklearn
2. numpy
3. pandas
4. matplotlib

Before we start modeling

You need to know your objectives. You need to know what problems you want to
solve and then define smaller subproblems that would be machine learning
solvable.
Knowing the objective is important because it will determine how you will use to
evaluate your model, which algorithms you will select, which performance
measure you will use to evaluate your model, and how much effort you will spend
tweaking it.
Once you have identified your subproblems, you can find out how you can use
machine learning for each and go through a machine learning life cycle for the
defined subproblems. Each of the subproblems may need specific data processing
and machine learning modeling, and some of them could be easier to solve
compared to the rest.

Data Collection
This phase involves the systematic gathering of datasets that will serve as the raw
material for model development. The quality and diversity of the data collected
directly impact the robustness and generalizability of the machine learning model.
Fortunately, there are thousands of open datasets to choose from, ranging across
all sorts of domains. Here are a few places you can look to get data:
• Popular open data repositories:
— UC Irvine Machine Learning Repository
— Kaggle datasets
— Amazon’s AWS datasets

• Meta portals (they list open data repositories):

— http://dataportals.org/
— http://opendatamonitor.eu/
— http://quandl.com/

• Other pages listing many popular open data repositories:

— Wikipedia’s list of Machine Learning datasets
— Quora.com question
— Datasets subreddit

Exploratory Data Analysis (EDA)

Now, focus turns to understanding the underlying patterns and characteristics of
the collected data and take a quick look at the data structure.
Visualizations, summary statistics, and correlation analyses offer a
comprehensive view of the data, guiding practitioners toward informed choices
in feature engineering, model selection, and other critical aspects.
– Summary statistics:
This is an imaginary dataset
 The info () method is useful to get a quick description of the data,
particularly the total number of rows, each attribute’s type, and the
number of non-null values:

There are 7 instances in the dataset. You notice that the age attributable has
only 6 non-null values, meaning that 1 district is missing this feature, you can
fix it by data imputation as we already learned.
The gender and group are their type is object. You can find out what
categories exist and how many districts belong to each category by using the
value.counts() method:
let’s make it more readable:

Let’s look at the other fields. The describe () method shows a summary of the
numerical attributes.

– Visualizations:
Another quick way to get a feel of the type of data you are dealing with is to
plot a histogram for each numerical attribute. A histogram shows the number
of instances (on the vertical axis) that have a given value range (on the
horizontal axis). You can either plot this one attribute at a time, or you can
call the hist () method on the whole dataset, and it will plot a histogram for
each numerical attribute (see Figure 1.5).

Figure 1.5. A histogram for each numerical attribute

Hopefully you now have a better understanding of the kind of data you are dealing
with.
– Look for correlations (for only numerical features):
Since the dataset is not too large, you can easily compute the standard
correlation coefficient (also called Pearson’s r) between every pair of
attributes using the corr() method:
Another way to check for correlation between attributes is to use Pandas’
scatter_matrix function, which plots every numerical attribute against every other
numerical attribute. Since there are now 2 numerical attributes, you would get 22 =
4 plots. (Figure 1.6):

Figure 1.6. Scatter matrix

This scatter matrix plots every numerical attribute against every other numerical
attribute, plus a histogram of each numerical attribute’s values on the main
diagonal (top left to bottom right).
But it isn’t like a real data set, so let’s get another scatter matrix from a real dataset
(Figure 1.7).
Figure 1.7 scatter matrix for real dataset.
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that
there is a strong positive correlation.
When the coefficient is close to –1, it means that there is a strong negative
correlation. Finally, coefficients close to zero mean that there is no linear
correlation. (Figure 1.8) shows various plots along with the correlation coefficient
between their horizontal and vertical axes.

Figure 1.8. Standard correlation coefficient of various datasets (source: Wikipedia.

public domain image).
Data Cleaning and Preprocessing
Raw data is often messy and unstructured. Data cleaning involves addressing
issues such as missing values, outliers, and inconsistencies that could
compromise the accuracy and reliability of the machine learning model.
Preprocessing takes this a step further by standardizing formats, scaling values,
and encoding categorical variables, creating a consistent and well-organized
dataset.

• Data Cleaning: Address issues such as missing values, outliers, and

inconsistencies in the data.

• Data Preprocessing: Standardize formats, scale values, and encode

categorical variables for consistency.

Data cleaning (for numerical features)

Most machine learning algorithms cannot work with missing features, you
have three options to fix this:
— Get rid of the corresponding districts.
— Get rid of the whole attribute.
— Feature imputation for filling in missing values.
You can accomplish these easily using Pandas DataFrame’s dropna(), drop(),
fillna() methods:

You decide to go for option 3 since it is the least destructive, but instead of the
preceding code, you will use handy Scikit-Learn class: SimpleImputer. The benefit
is that it will store median value of each feature: this will make it possible to impute
missing values not only on training set, but also on the validation set, test set, and
any new data fed to the model.
Missing values can also be replaced with the mean value (strategy = “mean”), or
with the most frequent value (strategy = “most_frequent” ), or with a constant value
(strategy = “constant”, fill_value = …..)

there also more powerful imputers available in sklearn.impute package

(for numerical features only):

• KNNImputer provides imputation for filling in missing values using the k-

Nearest Neighbors approach.
• IterativeImputer trains a regression model per feature to predict the
missing values based on all other available features.
• MissingIndicator this transformation is useful in conjunction with
imputation. When using imputation, preserving the information about which
values had been missing can be informative.

➔the difference between fit (), transform (), and fit_transform ():
— fit () learns the parameters from the data,
method is used to compute the necessary parameters (e.g., mean, standard
deviation, minimum and maximum values) from the training data.
— transform () Applies the learned transformation to the data, method is
used to apply the transformations (e.g., scaling, normalization) to the
data using the parameters calculated in the fit () method.
— fit_transform () Combines fit () and transform () into a single step.

— Outlier removal (for numerical features)

Numerical variables in our datasets could have values that are far away from
the rest of the data. They could be real values that are dissimilar to the rest of
the data points. You can visually see and detect them using a boxplot
matplotlib.pyplot.boxplot (Figure 1.2).

Figure 1.2 – Outliers in histograms and boxplots

In the preceding figure, the plots were generated using the values of features in
the diabetes dataset of the scikit-learn package, which was loaded via
sklearn.datasets.load_diabetes().

Preprocessing
— Data transformation (for categorical features)
When we train a machine learning model, the model needs to use numerical
values to calculate the loss function in each iteration of the optimization
process. Hence, we need to transform categorical variables into numerical
alternatives. There are multiple feature encoding techniques:

— OneHotEncoder
— OrdinalEncoder
— LabelEncoder
— TargetEncoder
This is an imaginary dataset.
Figure 1.3 –Manual calculations for one-hot, target, and label encoding
using a simple example dataset with four features and seven data points.

Each of these techniques has its benefits and caveats.

For example:
— one hot encoder: increase the number of features, and the chance of
overfitting.
— Label encoder: assigns integer values to each category, which do not
necessarily have a meaning.
For example, considering Male as 1 and female as 0 is arbitrary and doesn’t
have any real meaning.
— Target encoder: alternative approach that considers the probabilities of
each category concerning the target.

The difference between ordinal and nominal transformation is the meaning behind
the order of categories in ordinal variables. For example, if we are encoding grades
of students, A, B, C, and D could be transformed into 1, 2, 3, and 4, or 4, 3, 2, and 1,
but transforming them into 1, 3, 4, and 2 will not be acceptable as it is changing the
meaning behind the order of the grades.
First, we will use label encoding to encode the categorical features in the defined
DataFrame:

Then, The Same way we will use Ordinal encoding to encode the categorical, but
the fit_transform () expected a data frame df[“gender”] is series,and df [[“gender”]]
is Data Frame. :
Then, we will try to perform one-hot encoding for categorical feature
transformation:

How do you name columns correctly?

In the order of remaining columns, F, M, H1, H2, H3.

Now, we will implement target encoding in Python, after installing the

category_encoders library, as the third encoding approach, as follows:

— Data Scaling (for numerical features)

The values of features, either originally numerical or after transformation,
could have different ranges. Many machine learning models perform better, or
at least their optimization processes converge faster, if their feature values get
scaled and normalized properly.
For example, if you have a feature ranging from 0.001 to 0.05 and another one
from 1,000 to 5,000, bringing both of them to a reasonable range such as [0, 1]
or [-1, 1] could help improve the speed of convergence or the performance of
your model.
For example, the values of a variable after using the StandardScalar class of
scikit-learn will be centered around zero with a standard deviation of one
(Table 1.1).

Scikit learn code Mathematical Definition Value Limits

Scikitlearn.preprocessing. 𝑧 = 𝑥 − 𝜇 No limit >99% of data

StandardScaler () 𝜎 between -3 and 3
𝜇: 𝑚𝑒𝑎𝑛
𝜎: 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
sklearn.preprocessing. 𝑥 − 𝑥𝑚𝑖𝑛 [0,1]
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
MinMaxScaler (), 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Normalization
sklearn.preprocessing. 𝑥 [-1,1]
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 =
MaxAbsScaler () |𝑥𝑚𝑎𝑥 |

sklearn.preprocessing. 𝑥 − 𝑄2 No limit Majority of data

𝑧=
RobustScaler () 𝐼𝑄𝑅 between -3 and 3
𝑄2 : 𝑚𝑒𝑑𝑖𝑎𝑛
𝐼𝑄𝑅: Interquartile range
Table 1.1 – Example of Python classes for scaling and normalizing feature values

Example on Standardlization: -

1 −1 2 𝑥−𝜇
𝐷𝑎𝑡𝑎 𝑠𝑒𝑡 = [ 2 0 0 ] 𝑧 =
𝜎
0 1 −1

Col1 Col2 Col3

1 + 2 −1 + 1 2 − 1
𝜇=[ , , ] = [1,0,0.33]
3 3 3

1
𝜎2 = ∑𝑁
𝑖 = 1(𝑥𝑖 − 𝜇)
2
𝑁
1 2
Col 1: 𝜎 2 = ((1 − 1)2 + (2 − 1)2 + (0 − 1)2 ) =
3 3
2
𝜎 = √ = 0.816
3
Col 2: 𝜎 = 0.816 , Col 3: 𝜎 = 1.24
Apply the mathematic definition:
1−1 −1−0 1−1
0.816 0.816 1.24 0 −1.22 1.33
2−1 0−0 1−1
0.816 0.816 1.24
=> [ 1.22 0 −0.26]
0−1 1−0 1−1 −1.22 1.22 −1.06
0.816 0.816 1.24

Let's write that but as code:

To know the mean and standard deviation for this data set:

Note: The RobustScaler class is less likely to be affected by outliers.

Actual data Normalization standardization

Figure 1.4 – Graph of the difference between Normalization and Standardization.

You need to make sure the scaling and normalizations you implement don’t cause
ties in your feature values, meaning data points don’t lose their difference based on
features that went under transformation.

Feature Engineering, Selection:

Feature engineering takes center stage as a transformative process that elevates
raw data into meaningful predictors. Simultaneously, feature selection refines this
pool of variables, identifying the most relevant ones to enhance model efficiency
and effectiveness.
Feature Engineering involves creating new features or transforming existing ones
to better capture patterns and relationships within the data. This creative process
requires domain expertise and a deep understanding of the problem at hand,
ensuring that the engineered features contribute meaningfully to the predictive
power of the model.
On the other hand, Feature Selection focuses on identifying the subset of features
that most significantly impact the model’s performance. This dual approach seeks
to strike a delicate balance, optimizing the feature set for predictive accuracy while
minimizing computational complexity.
The goal of feature selection is to reduce the number of features, or the
dimensionality of your data, and keep features that are information rich. For
example, if we have 20,000 features and 500 data points, there is a high chance
that most of the original 20,000 features are not informative when used to build a
supervised learning model. The following list explains some simple techniques for
feature selection:
• Keeping features with a high variance or MAD across the data point
• Keeping features with the highest number of unique values across the data points
• Keeping representative features from groups of highly correlated features
These processes can be conducted using all the data points or just training data to
avoid potential information leakage between the training and test data
Feature Engineering:
Create new features or transform existing ones to better capture patterns and
relationships.

How can we make feature engineering?

By multiplying, division, subtraction, addition and any other operation to get a new
row.

But the better way is making it by custom transformer:

Understanding the scikit-learn estimator API

In the previous sections, we used the Imputer class from scikit-learn to impute
missing values in our dataset. The Imputer class belongs to the so-called
transformer classes in scikit-learn, which are used for data transformation. The two
essential methods of those estimators are fit and transform. The fit method is used
to learn the parameters from the training data, and the transform method uses
those parameters to transform the data. Any data array that is to be transformed
needs to have the same number of features as the data array that was used to fit
the model.
Scikit-Learn’s API is remarkably well designed. The main design principles are:
• Consistency. All objects share a consistent and simple interface:
– Estimators. Any object that can estimate some parameters based on a
dataset is called an estimator (e.g., an imputer is an estimator). The
estimation itself is performed by the fit () method. Any other parameter
needed to guide the estimation process is considered a hyperparameter
(such as an imputer’s strategy), and it must be set as an instance variable
(generally via a constructor parameter). Estimator is conceptually very
similar to the transformer class. Estimators have a predict method.
– Transformers. Some estimators (such as an imputer) can also transform a
dataset; these are called transformers. Once again, the API is quite simple:
the transformation is performed by the transform () method with the dataset
to transform as a parameter. It returns the transformed dataset. This
transformation generally relies on the learned parameters, as is the case for
an imputer. All transformers also have a convenience method called
fit_transform() that is equivalent to calling fit() and then transform() (but
sometimes fit_transform() is optimized and runs much faster).
– Predictors. Finally, some estimators can make predictions given a dataset;
they are called predictors. For example, the LinearRegression model in the
previous chapter was a predictor: it predicted life satisfaction given a
country’s GDP per capita. A predictor has a predict () method that takes a
dataset of new instances and returns a dataset of corresponding predictions.
It also has a score () method that measures the quality of the predictions
given a test set (and the corresponding labels in the case of supervised
learning algorithms).
• Inspection. All the estimator’s hyperparameters are accessible directly via public
instance variables (e.g., imputer.strategy), and all the estimator’s learned
parameters are also accessible via public instance variables with an underscore
suffix (e.g., imputer.statistics_).
• Nonproliferation of classes. Datasets are represented as NumPy arrays or SciPy
sparse matrices, instead of homemade classes. Hyperparameters are just regular
Python strings or numbers.
• Composition. Existing building blocks are reused as much as possible. For
example, it is easy to create a Pipeline estimator from an arbitrary sequence of
transformers followed by a final estimator, as we will see.
The following figure illustrates how a transformer, fitted on the training data, is used
to transform a training dataset as well as a new test dataset:

The following figure illustrates how an estimator works:

Custom transformer
Although Scikit-Learn provides many useful transformers, you will need to write
your own for tasks such as custom cleanup operations or combining specific
attributes. You will want your transformer to work seamlessly with Scikit-Learn
functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not
inheritance), all you need is to create a class and implement three methods: fit ()
(returning self), transform (), and fit_transform(). You can get the last one for free by
simply adding TransformerMixin as a base class. Also, if you add BaseEstima tor as
a base class (and avoid *args and **kargs in your constructor) you will get two extra
methods (get_params () and set_params()) that will be useful for automatic
hyperparameter tuning. For example, here’s custom transformer that acts much
like the StandardScaler:

Make the previous feature engineering by custom transformer:

The Same Way!!!
Here are a few things to note:
• The scikit.utils.validation package contains several functions we can use to
validate the inputs.
• Scikit-learn pipelines require the fit () method to have two arguments x and y,
when is why we need the y = None argument even though we don’t use y.
• The fit () method must return self.
• All scikit-learn estimators set n_features_in_ in the fit () method, and they
ensure that the data passed to transform() or predict() has this no of features
Transformation pipeline to build a composite transformer
1- Build pipeline
Using a list of (key, value)
For example,

Shorthand:
If you don’t want to name the transformers, you can use the make_pipeline ()
Function instead; it takes transformers as positional arguments and create a
pipeline using the name of the transformers’ classes.

2- Access
Estimators stored as list. Sub-pipeline can extract using slicing
Pipe [:1] => return the first estimator.
Pipe [-1:] => return the last estimator.
Accessing a step by name or position.
Pipe[“key”] => return its estimator.
Example,
As you can see, there are many data transformation steps that need to be
executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class
to help with such sequences of transformations. Here is a small pipeline for the
numerical attributes, which will first impute then scale the input features:

The Pipeline constructor takes a list of name/estimator pairs defining a

sequence of steps. All but the last estimator must be transformers (i.e., they
must have a fit_transform() method). The names can be anything you like.
 Pipeline requires all steps must be transformers except the last one
can be anything:
• Clustering
• Transformer: behave like transformer.
• Predictor: will expose that method and given data x, use all step
except the last to transform the data then, give the transformed
data to the predict method.

Feature Union composite feature spaces

Feature union combines several transformers objects into a new transformer that
combines their output. During fitting, each of these is fit to the data independently,
the transformers, are applied in parallel. Feature union serves the same purpose
as pipeline.
Feature union and pipeline can be combined to create complex model.
Note: - A feature union has no way of checking whether two transformers might
produce identical. It only produces a union when the feature sets are disjoint.
Build
Using a list of (key, value)
For example,

Shorthand using make_union like pipeline.

So far, we have handled the categorical columns and the numerical columns
separately. It would be more convenient to have a single transformer capable of
handling all columns, applying the appropriate transformations to each column. for
this, you can a ColumnTransformer. Its constructor requires a list triplets (3-tuples
), each containing a name (which must be unique and not contain double
underscores), a transformer, and a list of names of columns that the transformer
should be applied to.
The result:

Since listing all the column names is not very convenient, scikit-learn provides a
make_column_selector () function that returns a selector function you can use to
automatically select all the features of given type, such a numerical or categorical .
You can pass this selector function to ColumnTransformer instead of column
names or indices. Moreover, if you don’t care about naming the transformers, you
can use make_column_transformer (), which chooses the names for you, just like
make_pipeline () does. For example, the following code creates the same
ColumnTransformer as earlier, except the transformers are automatically named
“pipeline-1” and “pipeline-2” instead of “num”and”cat”:
We have a preprocessing pipeline that takes the entire training dataset and applies
each transformer to the appropriate columns, then concatenates the transformed
columns horizontally. this return a NumPy array, but you can get the column names
using preprocessing.get_feature_names_out()

And wrap the data in a nice DataFrame:

Your project is going well and you’re almost ready to train some models! You now
want to create a single pipeline that will perform all the transformations you’ve
experimented with up to now. Let’s recap what the pipeline will do and why:
• Missing values in numerical features will be imputed by replacing them with
the median.
• Missing values in categorical features will be replaced by the most frequent
category.
• The categorical features will be one-hot-encoded, as most machine learning
algorithms only accept numerical inputs.
• All numerical features will be standardized, as most machine learning
algorithms prefer when all features have roughly the same scale.
• Feature engineering: a few ratio features will be computed and added.
Hopefully these will better correlate and thereby help the ML models.
• Features with a long tail will be replaced by theirs as most models prefer
features with roughly uniform or gaussian distributions.

Model Selection, training, evaluation and tuning

At last!!! You framed the problem, you got the data and explored it, you sampled a
training set and a test set, and you wrote a preprocessing pipeline to automatically
clean up and prepare your data for machine learning algorithms. You are now ready
to select and train a machine learning model.
Train and evaluate on the training set
Choose the model that you want to train the data. Say that you chose linear
regression. Firstly, check the performance that you chose to use the RMSE as your
performance measure, so you want to measure this regression model’s RMSE on
the whole training set using Scikit-Learn’s mean_squared_error() function.
If the error is high, it can mean that the features don’t provide enough information to
make good predictions, or that the model isn’t powerful enough.
There are many ways to fix underfitting are select a more powerful model.
You decide to try Decision tree regressor. Evaluate it on training set if it gives you a
high error, try another one.

Better evaluation by cross-validation

split the training set into a smaller training set and a validation set.
Here we split the training set into 5 nonoverlapping subsets called folds,
then it trains and evaluates the model 5 times, picking a different fold for
evaluation every time and using the other 4 folds for training, the result is
an array containing the 5 evaluation scores.
1st

2nd

3rd

4th
5th

Then train your models against the smaller training set and evaluate them against
the validation set.
From sklearn.medel_selection import cross_val_score

Why is there negative before the function??

To equate the negative signal in Scoring = “neg_root_mean_squared_error”
But why is there neg in this hyperparameter:
Scikit-Learn’s cross-validation features expect a utility function (greater is better)
rather than a cost function (lower is better), so the scoring function is the opposite
of the RMSE. It’s a negative value, so you need to switch the sign of the output to
get the RMSE scores.
Cv (cross-validation)=> number of folds => number of nonoverlapping subsets
Fine-Tune your model
Let`s assume that you have a shortlist of promising models. You now need to fine-
tune them. Let`s look at a few ways you can do that

MKTG 12th Edition by Charles W Lamb Ebook and TestBank Bundle Full Download
No ratings yet
MKTG 12th Edition by Charles W Lamb Ebook and TestBank Bundle Full Download
404 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered (Autosaved)
80 pages
FINAL LECTURE 3,4.pptx - AutoRecovered
No ratings yet
FINAL LECTURE 3,4.pptx - AutoRecovered
73 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
B Lab Manual Machine Learning SEM-7 CSE 2024
No ratings yet
B Lab Manual Machine Learning SEM-7 CSE 2024
49 pages
Lecture1-Introduction To Data Mining
No ratings yet
Lecture1-Introduction To Data Mining
46 pages
Unit 4
No ratings yet
Unit 4
42 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
Unit 2
No ratings yet
Unit 2
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Apptitude Handbook
No ratings yet
Apptitude Handbook
132 pages
Giao Tiếp Liên Văn Hóa
No ratings yet
Giao Tiếp Liên Văn Hóa
30 pages
Hunting A Job and Passing An Interview
No ratings yet
Hunting A Job and Passing An Interview
55 pages
Sph3u Chapter 01 Motion in A Straight Line
No ratings yet
Sph3u Chapter 01 Motion in A Straight Line
22 pages
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
No ratings yet
Complete Guide To Exploratory Data Analysis With Python Plotly - by Anar Abiyev - Mar, 2022 - Medium
11 pages
Asset-V1 SNUx+SNU024.022.1x+2T2022+type@asset+block@Week2
No ratings yet
Asset-V1 SNUx+SNU024.022.1x+2T2022+type@asset+block@Week2
40 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Datascience
No ratings yet
Datascience
26 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Nurs FPX 4060 Assessment 2 Community Resources
No ratings yet
Nurs FPX 4060 Assessment 2 Community Resources
8 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Data
No ratings yet
Data
36 pages
Attribute Types
No ratings yet
Attribute Types
11 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
PP DWDM 4 5
No ratings yet
PP DWDM 4 5
26 pages
DWDM Unit 1 Chap2 PDF
No ratings yet
DWDM Unit 1 Chap2 PDF
21 pages
Jee Mains 2022
No ratings yet
Jee Mains 2022
5 pages
Career Development - Personal Branding - Eng - Nada Abdelatty
100% (2)
Career Development - Personal Branding - Eng - Nada Abdelatty
19 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Government Junior High School Profile, Sy 2020-2021
No ratings yet
Government Junior High School Profile, Sy 2020-2021
65 pages
App Development - Best Roadmap & Free Resources
No ratings yet
App Development - Best Roadmap & Free Resources
11 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
COmplications of Hypertension
No ratings yet
COmplications of Hypertension
32 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
A Study of The Relationship Between School Culture and Student Achievement
No ratings yet
A Study of The Relationship Between School Culture and Student Achievement
24 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Guidelines On The JMC Free TVL Assessment
No ratings yet
Guidelines On The JMC Free TVL Assessment
23 pages
Artikel 3
No ratings yet
Artikel 3
29 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
YogiBot Research Paper
No ratings yet
YogiBot Research Paper
7 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
A
No ratings yet
A
2 pages
C Practical Research 2 Week 3
No ratings yet
C Practical Research 2 Week 3
8 pages
A2 Speaking Worksheet 2 (PDF - Io) Subir
No ratings yet
A2 Speaking Worksheet 2 (PDF - Io) Subir
3 pages
RichinsDawson-1992-A Consumer Values Orientation For Materialism and Its Measurement - Scale Development and Validation
No ratings yet
RichinsDawson-1992-A Consumer Values Orientation For Materialism and Its Measurement - Scale Development and Validation
15 pages
l12 Inuit Culture Celebration
No ratings yet
l12 Inuit Culture Celebration
3 pages
Class 10 Mid Term Syllabus 202324
No ratings yet
Class 10 Mid Term Syllabus 202324
4 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Practical Research Mod 4
No ratings yet
Practical Research Mod 4
3 pages
An Ideal Working Partner British English Student
No ratings yet
An Ideal Working Partner British English Student
5 pages
Natal Chiron Aspects: Brandi Jasmine
No ratings yet
Natal Chiron Aspects: Brandi Jasmine
9 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Adverbs of Frequency and Time Expressions
No ratings yet
Adverbs of Frequency and Time Expressions
3 pages
Colleges Hire Consultants To Hook' Students
No ratings yet
Colleges Hire Consultants To Hook' Students
2 pages
Pageant Guidelines and Mechanics
100% (1)
Pageant Guidelines and Mechanics
3 pages
Use of Artificial Intelligence in Education 4
No ratings yet
Use of Artificial Intelligence in Education 4
5 pages
Measuring Your Effort: Advanced 5k Training Plan - Is This Plan Suitable For Me?
No ratings yet
Measuring Your Effort: Advanced 5k Training Plan - Is This Plan Suitable For Me?
4 pages
Examen Final Unidad 3 Vocabulario
No ratings yet
Examen Final Unidad 3 Vocabulario
2 pages
Checklist For Controlled Trials
No ratings yet
Checklist For Controlled Trials
2 pages
LP 1 - 08222022
No ratings yet
LP 1 - 08222022
3 pages
Lesson Plan 13 11
No ratings yet
Lesson Plan 13 11
2 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Machine Learning

Uploaded by

Machine Learning

Uploaded by

Machine learning

Before we start modeling

• Meta portals (they list open data repositories):

• Other pages listing many popular open data repositories:

Exploratory Data Analysis (EDA)

Figure 1.5. A histogram for each numerical attribute

Figure 1.6. Scatter matrix

Figure 1.8. Standard correlation coefficient of various datasets (source: Wikipedia.

• Data Cleaning: Address issues such as missing values, outliers, and

• Data Preprocessing: Standardize formats, scale values, and encode

Data cleaning (for numerical features)

there also more powerful imputers available in sklearn.impute package

• KNNImputer provides imputation for filling in missing values using the k-

— Outlier removal (for numerical features)

Figure 1.2 – Outliers in histograms and boxplots

Each of these techniques has its benefits and caveats.

How do you name columns correctly?

In the order of remaining columns, F, M, H1, H2, H3.

Now, we will implement target encoding in Python, after installing the

— Data Scaling (for numerical features)

Scikit learn code Mathematical Definition Value Limits

Scikitlearn.preprocessing. 𝑧 = 𝑥 − 𝜇 No limit >99% of data

sklearn.preprocessing. 𝑥 − 𝑄2 No limit Majority of data

Col1 Col2 Col3

Let's write that but as code:

Note: The RobustScaler class is less likely to be affected by outliers.

Figure 1.4 – Graph of the difference between Normalization and Standardization.

Feature Engineering, Selection:

How can we make feature engineering?

But the better way is making it by custom transformer:

Understanding the scikit-learn estimator API

The following figure illustrates how an estimator works:

Make the previous feature engineering by custom transformer:

The Pipeline constructor takes a list of name/estimator pairs defining a

Feature Union composite feature spaces

Shorthand using make_union like pipeline.

And wrap the data in a nice DataFrame:

Model Selection, training, evaluation and tuning

Better evaluation by cross-validation

Why is there negative before the function??

You might also like