[go: up one dir, main page]

0% found this document useful (0 votes)
70 views56 pages

End-to-End Machine Learning Project Guide

The document outlines the steps involved in an end-to-end machine learning project, specifically focusing on building a model to predict housing prices using California census data. It covers data acquisition, preparation, model selection, and performance evaluation, emphasizing the importance of understanding the problem and the assumptions made. Additionally, it discusses techniques for data cleaning, handling categorical attributes, and feature scaling to ensure effective model training.

Uploaded by

muqeet8058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views56 pages

End-to-End Machine Learning Project Guide

The document outlines the steps involved in an end-to-end machine learning project, specifically focusing on building a model to predict housing prices using California census data. It covers data acquisition, preparation, model selection, and performance evaluation, emphasizing the importance of understanding the problem and the assumptions made. Additionally, it discusses techniques for data cleaning, handling categorical attributes, and feature scaling to ensure effective model training.

Uploaded by

muqeet8058
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Machine Learning

Siraj

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 1


Lecture Content
• End to End ML Project

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 2


End-to-End ML Project
Look at the big picture.
Get the data.
Discover and visualize the data to gain insights.
Prepare the data for Machine Learning algorithms.
• Select a model and train it.
• Fine-tune your model.
• Present your solution.
• Launch, monitor, and maintain your system

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 3


Look at the big picture
• Welcome to the Machine Learning Housing Corporation!
• Your first task is to use California census data to build a model of
housing prices in the state.
• This data includes metrics such as the population, median
income, and median housing price for each block group in
California.
• Block groups are the smallest geographical unit for which the US
Census Bureau publishes sample data (a block group typically
has a population of 600 to 3,000 people).
• We will call them “districts” for short. Your model should learn
from this data and be able to predict the median housing price in
any district, given all the other metrics.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 4
Frame
the
Problem

• Building a model is probably not the end goal.


• How does the company expect to use and benefit from this
model?
• Knowing the objective is important because it will determine
how you frame the problem, which algorithms you will
select, which performance measure you will use to evaluate
your model, and how much effort you will spend tweaking it.
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 5
Check the Assumptions
• It is good practice to list and verify the assumptions that
have been made so far
• For example, the district prices that your system
outputs are going to be fed into a downstream Machine
Learning system, and you assume that these prices are
going to be used as such.
• But what if the downstream system converts the prices
into categories (e.g., “cheap,” “medium,” or
“expensive”) and then uses those categories instead of
the prices themselves?

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 6


Get the Data
• Time to get your hands dirty.
• Download the data:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 7


Visualize
the Data

Department of Computer Science, BUITEMS Sunday, May 18, 2025 8


Get the Data
• Or alternatively:
• You can download the CSV file, upload it to Jupyter and:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 9


Take a Quick Look at the Data
Structure
• Each row represents one district.
• There are 10 attributes (you can see the first 6 in the
screenshot): longitude , latitude , housing_median_age ,
total_rooms , total_bedrooms , population , households ,
median_income , median_house_value , and
ocean_proximity

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 10


[Link]()

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 11


Looking at the data
• There are 20,640 instances in the dataset, which means
that it is fairly small by Machine Learning standards, but
it’s perfect to get started.
• Notice that the total_bedrooms attribute has only
20,433 nonnull values, meaning that 207 districts are
missing this feature. We will need to take care of this
later.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 12


ocean_proximity Attribute
• All attributes are numerical, except the ocean_proximity
field.
• Its type is object , so it could hold any kind of Python
object.
• But since you loaded this data from a CSV file, you know
that it must be a text attribute.
• Looking at top five rows, values in the ocean_proximity
column were repetitive, meaning it’s a categorical
attribute.
• You can find out what categories exist and how many
districts
Sunday, May 18, 2025 belong to Department
each ofcategory by using the
Computer Science, BUITEMS 13
value_counts()

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 14


Histogram of Numerical Data
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 15
Create Test Set
• Creating a test set is theoretically simple: pick some
instances randomly, typically 20% of the dataset (or
less if your dataset is very large), and set them aside

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 16


You can then use this function like
this:
• 80% going to the training data

• 20% going to test data

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 17


Problemo!!!
• If you run the program again, it will generate a different
test set!
• Over time, you (or your Machine Learning algorithms)
will get to see the whole dataset, which is what you
want to avoid.
• One solution is to save the test set on the first run and
then load it in subsequent runs.
• Another option is to set the random number generator’s
seed (e.g., with [Link](42) ) before calling
[Link]() so that it always generates
the same shuffled indices
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 18
Solution:
• But both these solutions will break the next time you
fetch an updated dataset
• To have a stable train/test split even after updating the
dataset, a common solution is to use each instance’s
identifier to decide whether or not it should go in the
test set (assuming instances have a unique and
immutable identifier).
• compute a hash of each instance’s identifier and put
that instance in the test set if the hash is lower than or
equal to 20% of the maximum hash value.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 19


• This ensures that the test set will remain consistent
across multiple runs, even if you refresh the dataset.

• Adding Identifier:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 20


The Easy Way 
• Scikit-Learn provides a few functions to split datasets
into multiple subsets in various ways.
• The simplest function is train_test_split() , which does
pretty much the same thing as the function
split_train_test() , with a couple of additional features.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 21


Stratified Sampling
• Stratified sampling is a method of obtaining a
representative sample from a population that
researchers have divided into relatively similar
subpopulations (strata).
• The population is divided into homogeneous subgroups
called strata, and the right number of instances are
sampled from each stratum to guarantee that the test
set is representative of the overall population.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 22


median_income (attribute)
• Suppose median income is a very important attribute to
predict median housing prices.
• Ensure that the test set is representative of the various
categories of incomes in the whole dataset.
• Since the median income is a continuous numerical
attribute, you first need to create an income category
attribute.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 23


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 24
Scikit-Learn’s StratifiedShuffleSplit
class
• stratified sampling based on the income category:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 25


Select a Performance Measure
• A typical performance measure for regression problems
is the Root Mean Square Error (RMSE).
• It gives an idea of how much error the system typically
makes in its predictions, with a higher weight for large
errors.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 26


Prepare the data for Machine
Learning algorithms.
• It’s time to prepare the data for your Machine Learning
algorithms.
• Steps involved in preparing data are:
• Data Cleaning
• Handling Text and Categorical Attributes
• Custom Transformers
• Feature Scaling
• Transformation Pipelines

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 27


Data Cleaning
• Most Machine Learning algorithms cannot work with
missing features
• total_bedrooms attribute has some missing values
• You have three options:
• Get rid of the corresponding districts.
• Get rid of the whole attribute.
• Set the values to some value (zero, the mean, the median,
etc.)

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 28


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 29
Doing it Scikit-Learn Way:
• Scikit-Learn provides a handy class to take care of
missing values: Imputer
• First, you need to create an Imputer instance, specifying
that you want to replace each attribute’s missing values
with the median of that attribute:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 30


• Since the median can only be computed on numerical
attributes
• Create a copy of the data without the text attribute
ocean_proximity

• Now you can fit the imputer instance to the training


data using the fit() method

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 31


• The imputer has simply computed the median of each
attribute and stored the result in its statistics_ instance
variable.
• Only the total_bedrooms attribute had missing values,
but we cannot be sure that there won’t be any missing
values in new data after the system goes live, so it is
safer to apply the imputer to all the numerical attributes

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 32


• Now you can use this “trained” imputer to transform the
training set by replacing missing values by the learned
medians:

• The result is a plain Numpy array containing the


transformed features. If you want to put it back into a
Pandas DataFrame, it’s simple:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 33


Handling Text and Categorical
Attributes
• Earlier we left out the categorical attribute
ocean_proximity because it is a text attribute so we
cannot compute its median.
• Most Machine Learning algorithms prefer to work with
numbers anyway, so let’s convert these text labels to
numbers.
• Scikit-Learn provides a transformer for this task called
LabelEncoder:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 34


• This is better: now we can use this numerical data in
any ML algorithm. You can look at the mapping that this
encoder has learned using the classes_ attribute
• (“<1H OCEAN” is mapped to 0, “INLAND” is mapped to
1, etc.):

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 35


OneHotEncoder
• Scikit-Learn provides a OneHotEncoder encoder to
convert integer categorical values into one-hot vectors.
• Let’s encode the categories as one-hot vectors.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 36


• Notice that the output is a SciPy sparse matrix, instead of
a NumPy array.
• This is very useful when you have categorical attributes
with thousands of categories.
• After onehot encoding we get a matrix with thousands of
columns, and the matrix is full of zeros except for one 1
per row.
• Using up tons of memory mostly to store zeros would be
very wasteful, so instead a sparse matrix only stores the
location of the nonzero elements.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 37


• You can use it mostly like a normal 2D array, but if you
really want to convert it to a (dense) NumPy array, just
call the toarray() method:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 38


Alternative Way
• We can apply both transformations (from text
categories to integer categories, then from integer
categories to one-hot vectors) in one shot using the
LabelBinarizer class:

Note that this returns a dense NumPy array by


default. You can get a sparse matrix instead by
passing sparse_output=True to the LabelBinarizer
constructor.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 39


Custom Transformers
• Although Scikit-Learn provides many useful transformers,
you will need to write your own for tasks such as custom
cleanup operations or combining specific attributes.
• You will want your transformer to work seamlessly with
Scikit-Learn functionalities (such as pipelines), and since
Scikit-Learn relies on duck typing (not inheritance), all you
need is to create a class and implement three methods:
• fit()(returning self)
• transform()
• fit_transform().

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 40


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 41
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 42
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 43
Feature Scaling
• One of the most important transformations you need to apply to
your data is feature scaling.
• With few exceptions, Machine Learning algorithms don’t perform
well when the input numerical attributes have very different scales.
• This is the case for the housing data:
• the total number of rooms ranges from about 6 to 39,320, while the
median incomes only range from 0 to 15. Note that scaling the target
values is generally not required.
• There are two common ways to get all attributes to have the same
scale: min-max scaling and standardization.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 44


Min-Max Scaling
• Min-max scaling (many people call this normalization) is
quite simple: values are shifted and rescaled so that
they end up ranging from 0 to 1.
• We do this by subtracting the min value and dividing by
the max minus the min.
• Scikit-Learn provides a transformer called MinMaxScaler
for this.
• It has a feature_range hyperparameter that lets you
change the range if you don’t want 0–1 for some reason.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 45


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 46
Standardization
• Standardization is quite different: first it subtracts the mean value (so
standardized values always have a zero mean), and then it divides by
the variance so that the resulting distribution has unit variance.
• Unlike min-max scaling, standardization does not bound values to a
specific range, which may be a problem for some algorithms (e.g.,
neural networks often expect an input value ranging from 0 to 1).
• However, standardization is much less affected by outliers.
• For example, suppose a district had a median income equal to 100 (by
mistake).
• Min-max scaling would then crush all the other values from 0–15 down
to 0–0.15, whereas standardization would not be much affected.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 47


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 48
StandardScaler
• Scikit-Learn provides a transformer called
StandardScaler for standardization

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 49


Transformation Pipelines
• As you can see, there are many data transformation
steps that need to be executed in the right order.
• Fortunately, Scikit-Learn provides the Pipeline class to
help with
such sequences of transformations.
• Here is a small pipeline for the numerical attributes:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 50


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 51
• The Pipeline constructor takes a list of name/estimator pairs
defining a sequence of steps.
• All but the last estimator must be transformers (i.e., they
must have a fit_transform() method). The names can be
anything you like.
• When you call the pipeline’s fit() method, it calls
fit_transform() sequentially on all transformers, passing the
output of each call as the parameter to the next call, until it
reaches the final estimator, for which it just calls the fit()
method.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 52


Sunday, May 18, 2025 Department of Computer Science, BUITEMS 53
• And you can run the whole pipeline simply:

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 54


• Each subpipeline starts with a selector transformer:
• It simply transforms the data by selecting the desired
attributes (numerical or categorical), dropping the rest,
and converting the resulting DataFrame to a NumPy
array.

Sunday, May 18, 2025 Department of Computer Science, BUITEMS 55


Try It Out!
Sunday, May 18, 2025 Department of Computer Science, BUITEMS 56

You might also like