Framing a
Machine Learning Problem
Facilitators:
Rahman, Brian, Eva, Andrew, George,
Mark, Peter, Confred
Today`s Agenda
Defining a ML problem and proposing a
solution;
Identifying good ML problems
Deciding on ML
Formulating a problem as an ML problem
ML Bootcamp Sept 16 - Oct 7, 2023
Defining a ML problem and proposing a
solution
ML Bootcamp Sept 16 - Oct 7, 2023
Defining a ML problem
ML – process of training a software (or model)
to make predictions by learning from data
Branches of ML
Supervised learning
Unsupervised / self-supervised learning
Reinforcement learning
ML Bootcamp Sept 16 - Oct 7, 2023
Kinds of ML problems
Supervised and unsupervised ML problems fall under
multiple categories
ML Problem Type Description Example
Classification Predict label for previously Identify image of dog from that of
unseen example cat, bicycle from motor bike
Regression Predict numerical values Predicting price of houses
Clustering Group similar examples Most relevant documents
(unsupervised)
Association rule Infer likely association If you buy a bed, you are likely to
learning patterns in data buy a mattress too (unsupervised)
Structured output Create complex output Image recognition bounding
boxes
Ranking Identify position on a scale Search result ranking in a search
or status ML Bootcamp Sept 16 - Oct 7,engine
2023
Check Your Understanding
https://developers.google.com/machine-
learning/problem-framing/cases#check-
your-understanding
ML Bootcamp Sept 16 - Oct 7, 2023
The ML Mindset
"Machine Learning changes the way
you think about a problem. The focus
shifts from a mathematical science to a
natural science, running experiments
and using statistics, not logic, to
analyse its results." - Peter Norvig -
Google Research Director
ML Bootcamp Sept 16 - Oct 7, 2023
Experimental Design
Scientific method
It is helpful to think of the ML process as an
experiment where we run test after test after test
to converge on a workable model
Like an experiment, the process can be exciting,
challenging, and ultimately worthwhile
ML Bootcamp Sept 16 - Oct 7, 2023
Step Example
1. Set the research goal I want to predict how heavy traffic will be
on a given day.
2. Make a hypothesis I think the weather forecast is an
informative signal for traffic prediction!
3. Collect the required data Collect historical traffic data and weather
data on each day
4. Test your hypothesis Train a model using this data to predict
traffic.
5. Analyze the results you get Is this model better than existing systems
for traffic prediction?
6. Draw a conclusion I should (not) use this model to make
traffic predictions, because of X, Y, and Z.
7. Refine your hypothesis and Time of year could be a helpful signal for
repeat traffic
ML Bootcamp Sept prediction?
16 - Oct 7, 2023
Identifying good problems for ML
Characteristics of a good ML problem
Clear use case
* Start with the problem, not the solution. Make sure you aren't treating ML as a
hammer for your problems
Focus on problems that would be difficult to solve with traditional
programming e.g,
Smart Reply – automated email reply, saves user time
Google Photos – find a specific photo by keyword search without
manual tagging
* ML solves problems by examining patterns in data/adapting with them
Ask yourself the following questions,
What is the problem being faced?
Would it be a good problem for ML?
ML Bootcamp Sept 16 - Oct 7, 2023
Identifying good problems for ML
Characteristics of a good ML problem...
Know the problem before focusing on the data
* Be prepared to have your assumptions challenged
Once you`ve clear understanding of problem, list potential
solutions to test in order to generate the best model
Understand that you`ll have to try out a few solutions before you
land on a good working model
EDA helps you understand your data, but you can't yet
claim that patterns you find generalize until you check
those patterns against previously unseen data
Failure to check could lead you in the wrong direction or reinforce
stereotypes or bias
ML Bootcamp Sept 16 - Oct 7, 2023
Identifying good problems for ML
Characteristics of a good ML problem...
Data, data, more data
* ML requires a lot of relevant data
Data collected specifically for your task is most useful
In practice, secondary data is used in majority of applications
How much is a lot? - depends on the ML problem
but more data will improve your model (e.g, robustness) and
it's predictive power. A good rule of thumb is to have at least
000`s of examples for basic linear models, and 100`s of
000`s for neural networks. If you have less data, consider a
non-ML solution first and/or transfer learning methods
ML Bootcamp Sept 16 - Oct 7, 2023
Identifying good problems for ML
Characteristics of a good ML problem...
Predictive Power
* Your features should contain predictive power
Ensure your data set contains relevant features that
correlate with the phenomenon being investigated
e.g, is bedroom count a good predictor for house prices?
Don`t try out features arbitrarily without a hypothesis
Your goal is to build a model that generalizes well to
previously unseen samples and this is possible only
if you use the right features
ML Bootcamp Sept 16 - Oct 7, 2023
Identifying good problems for ML
Characteristics of a good ML problem...
Predictions vs. Decisions
* Aim to make decisions, not just predictions
Your product take action on output of ML model
ML better at making decisions than deriving insight from
data (for the latter, use statistical approaches)
Ensure predictions allow you to take a useful action e.g,
a model that predicts likelihood of clicking certain videos
could allow a system to pre-fetch the videos most likely to
be clicked
ML Bootcamp Sept 16 - Oct 7, 2023
Examples of prediction / decision pairs
Prediction Decision
What video the learner Show those videos in the
wants to watch next recommendation bar
Probability someone will If P(click) > 0.12, prefetch
click on a search result. the web page
What fraction of a video If a small fraction, don't
ad the user will watch show the user the ad
ML Bootcamp Sept 16 - Oct 7, 2023
Hard ML problems
Clustering
What does each cluster
mean in an unsupervised
learning problem? E.g, if
your model indicates that
the user is in the blue
cluster, you'll have to
determine what the blue
cluster represents
Semi-supervised learning
may help
ML Bootcamp Sept 16 - Oct 7, 2023
Hard ML problems...
Anomaly detection
how do you decide what constitutes an anomaly
to get labeled data?
ML Bootcamp Sept 16 - Oct 7, 2023
Hard ML problems...
Causation
ML can identify correlations – mutual
relationships or connections between two or
more things. Determining causation (one event
or factor causing another) is harder. It is easy to
see that something happened, but much harder
to understand why it happened
You can't determine causation from only
observational data – you need to run
experiments
ML Bootcamp Sept 16 - Oct 7, 2023
Hard ML problems...
No data
if you have no data to train a model, then ML
cannot help you. Without data, use a simple,
heuristic, rule-based system
Some new products with no training data start
with a heuristic rule system, and obtain training
data only after users interact with it
ML Bootcamp Sept 16 - Oct 7, 2023
Deciding to use ML
Set yourself up for success by thinking about these
things before trying to frame a problem for ML
Start clearly / simply – what would you like the ML model to
do for you?
e.g. I want the ML model to predict the price of a house
What is your ideal outcome?
e.g tourism recommendations – my ideal outcome is to suggest
tourism destinations that tourists find attractive and worth their
time and money
Success and failure metrics
Quantify it, measurable, what output would you like the ML model
to produce (based on type of ML problem),
ML Bootcamp Sept 16 - Oct 7, 2023
Formulate problem as an ML problem
1) Suggested approach for framing ML problem
1) Articulate your problem
2) Start simple
3) Identify your data sources
4) Design your data for the model
5) Determine where data will comes from
6) Determine easily obtained inputs
7) Ability to Learn
8) Think about potential Bias
ML Bootcamp Sept 16 - Oct 7, 2023
Articulate your problem
Is it a classification, regression, clustering,
anomaly detection problem?
ML Bootcamp Sept 16 - Oct 7, 2023
Articulate your problem
Write down a succint problem statement
e.g. Our problem is best framed as 3-class, single-
label classification, which predicts whether a video
will be in one of three classes—{very popular,
somewhat popular, not popular}—28 days after
being uploaded
ML Bootcamp Sept 16 - Oct 7, 2023
Start simple
Simply the problem further if possible e.g,
We will predict whether an uploaded video is likely
to become popular or not (binary classification)
We will predict an uploaded video’s popularity in
terms of the number of views it will receive within a
28 day window (regression)
Start by using the simplest model (baseline) possible for
your ML problem
ML Bootcamp Sept 16 - Oct 7, 2023
Identify your data sources
Provide answers to the following questions about your
labels:
How much labeled data do you have?
What is the source of your label?
Is your label closely connected to the decision you will be
making?
Example
Our data set consists of 100,000 examples about past
uploaded videos with popularity data and video descriptions.
ML Bootcamp Sept 16 - Oct 7, 2023
Design your Data for the Model
Identify the data that your ML system should
use to make predictions (input -> output),
Title Channel Upload time Uploaders recent Output
videos (label)
My silly cat Alice 2018-03-21 08:00 Another cat video, Very popular
yet another cat
A snake video Bob 2018-04-03 12;00 None Not popular
ML Bootcamp Sept 16 - Oct 7, 2023
Determine Where Data Comes From
Assess how much work it will take to develop a data
pipeline to construct each column for a row. When does
the example output become available for training
purposes?
Example
We applied the labels {very popular, somewhat popular, not
popular} to each video that fell within a determined range of
views and "thumbs ups" and determined keyword descriptions
for each video. Hand-generating descriptions is not sustainable,
so we are considering adding a keyword description to the
upload form.
ML Bootcamp Sept 16 - Oct 7, 2023
Determine Easily Obtained Inputs
Pick 1-3 inputs that are easy to obtain and that
you believe would produce a reasonable, initial
outcome
Consider the engineering cost to develop a data
pipeline to prepare the inputs, and the expected
benefit of having each input in the model
ML Bootcamp Sept 16 - Oct 7, 2023
Ability to Learn
Will the ML model be able to learn? List aspects
of your problem that might cause difficulty
learning. For example:
The data set doesn't contain enough positive labels.
The training data doesn't contain enough examples.
The labels are too noisy.
The system memorizes the training data, but has
difficulty generalizing to new cases.
ML Bootcamp Sept 16 - Oct 7, 2023
Think About Potential Bias
Many datasets are biased in some way. These
biases may adversely affect training and the
predictions made e.g,
A biased data source may not translate across
multiple contexts
The training sets may not be representative of the
ultimate users of the models and may therefore
provide them with a negative experience
ML Bootcamp Sept 16 - Oct 7, 2023
Conclusion
It is important to frame your problem properly
for ML
Not all problems require or need to be solved
using ML
ML Bootcamp Sept 16 - Oct 7, 2023
Quiz
Complete the quiz at this link
https://elearning.umu.ac.ug/mod/quiz/attempt.
php?attempt=15240&cmid=17874
ML Bootcamp Sept 16 - Oct 7, 2023