Exercises:
Introduction to Machine
Learning
Version 2023-08
Exercises: Introduction to Machine Learning 2
Licence
This manual is © 2023, Simon Andrews, Laura Biggins.
This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0
licence. This means that you are free:
• to copy, distribute, display, and perform the work
• to make derivative works
Under the following conditions:
• Attribution. You must give the original author credit.
• Non-Commercial. You may not use this work for commercial purposes.
• Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work
only under a licence identical to this one.
Please note that:
• For any reuse or distribution, you must make clear to others the licence terms of this work.
• Any of these conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.
Full details of this licence can be found at
http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
Exercises: Introduction to Machine Learning 3
Exercise 1: Running machine learning models
In this exercise we have given you a filtered subset of the data in GSE1133, which is a microarray study
measuring gene expression across a panel of around 90 different tissues.
The aim of the model is to try to predict which genes are involved in development. This is defined based
on the “Developmental Process” Gene Ontology category (GO:0032502).
A snapshot of the first part of the data looks like this:
The stats for the dataset are:
• There are 1241 measured genes
• 522 of the genes are development genes, 719 are not
• There are 92 variables (tissues) we can use for prediction. All of them are quantitative so are
compatible with all model types
• The variable to predict is Categorical (Development or Not Development) so we can only use
models with a Categorical output.
• Although there is a gene column we aren’t going to use that as there’s no way that this can be
predictive since it’s a categorical value which is different for every gene.
Running Models
To let you try out some of these models you can go to:
https://www.bioinformatics.babraham.ac.uk/shiny/machinelearning/
Where we have built a simple interface which lets you run a variety of different model types on this data.
Just select the model you want to run from the drop down box and press the “Run Model” button.
After the model has run you will see some information about the model on the left which summarises the
parameters which were used to run it – you should be able to match these to the theory we talked about
before.
Exercises: Introduction to Machine Learning 4
On the right you will see a summary of some predictions made by the model. We have run two sets of
data through the model.
1. We re-ran the data used to train the model back through it to see how well it is able to predict
data it has seen before.
2. We set aside a portion of the original data before training the model and then ran this through
the model after it was trained to see how well it works against data it hasn’t seen before.
Results
This table shows a summary of the predictions the model made and how they matched against the known
correct values in the data. It’s important to validate a model against data where you know the answer,
before using it to make predictions on data where the answer isn’t known.
In this table you can see the total number of correct (TRUE) and incorrect (FALSE) predictions the model
made.
Questions:
Run the different models and look at their output and the summary of the predictions they make then
answer the questions below.
1. Do all of the models perform similarly well, or are some better than others?
Exercises: Introduction to Machine Learning 5
2. Do the models perform similarly well on the data they have seen before and the data they haven’t
seen before?
3. Do the more complex models perform better than the simpler ones?
4. If you run each model a couple of times, do the results change? If they only change for some of
the models why is this?
5. If you hadn’t run a model, but had simply assigned the most frequent category (Not
Development) to every prediction, how many correct answers would you expect to have seen in
the test data of 249 samples? Do any of the models do substantially better than this?
Changing Model Parameters
We have a second interface which lets you rerun the random forest model whilst changing the
parameters used to construct it:
https://www.bioinformatics.babraham.ac.uk/shiny/optimising_model/
In this version you can change the total number of trees constructed, the number of random predictors
selected at each branch point, and the minimum number of measures which must appear in a node at
the bottom of the tree so it doesn’t get too complex at the bottom of the tree.
Try running the model a few times and seeing what effect changing these parameters has on the results.
What settings would you have to use to mimic a conventional decision tree?
What do you think the effect of changing the different parameters would be? Do you see this in the
results?
Exercises: Introduction to Machine Learning 6
Exercise 2: Evaluating Models
Now that you have learned about the way that models can be tested you can initially look back at the
results you found in exercise 1 to see the additional metrics which were supplied alongside the raw
results.
1. Are the models actually identifying developmental genes at a rate which is significantly higher
than you’d get by guessing?
2. What is the balance in the models between sensitivity (the ability to say that a developmental
gene is a developmental gene) and specificity (the ability to identify non-developmental genes)?
3. Are there differences in the sensitivity / specificity trade-off between the models?
4. Which models appear to be most strongly overfitted to the training data (do well on training, and
poorly on testing)?
Exercises: Introduction to Machine Learning 7
Exercise 3: Building your first model
In this exercise you will build your first tidymodels model. You can going to use a dataset comprising all
of the canonical proteins in the mouse genome. For each of these you have some basic information
about the gene, transcript and protein, plus you have the compositional break down of the protein into
its component amino acids.
The aim of your model is to predict which of these proteins contains one or more transmembrane
segments, such that the protein is normally found embedded within a membrane.
To do this you are going to build and train a random forest model. The steps in the modelling procedure
will be:
1. Load the R packages we’re going to need for this analysis
2. Load in the original data
3. Prepare the data for modelling
a. Convert the variable to predict to be a factor
b. Remove proteins with missing data
c. Shuffle the rows
4. Split the data into a training and testing subset
5. Build the model
6. Train the model using the training data
7. Predict the transmembrane proteins from the testing data
8. Check how good the predictions are
Below we will talk you through how to construct a script in RStudio to perform all of these steps. In an
actual modelling experiment we would include more evaluation of the data before starting on the
modelling, so this is a somewhat truncated version of the full procedure you’d use.
To get started you need to open a new R script, save it, then set the location of the data you’re going to
use.
Setting up your environment
Inside RStudio select
File > New File > R Script
Once the script has opened go to File > Save As and save it into the MachineLearningData folder in a
file called model.R
In the RStudio menu select Session > Set Working Directory > To Source File Location
Exercises: Introduction to Machine Learning 8
Loading the R packages we need
We will be using two packages in this script, the tidyverse package, which will do the general data
manipulation for us, and the tidymodels package which will do the modelling.
We can load these with
library(tidyverse)
library(tidymodels)
tidymodels_prefer()
The last line here simply says that we should always use functions from tidymodels, even if another
function with the same name, but from a different package exists.
Loading the input data
To load the data from the TSV file it’s saved in we need to do
read_delim("transmembrane_data.txt") -> data
You can then click on the data in the Environment tab (top right) and have a look at what the data looks
like.
Preparing the data for modelling
Turing transmembrane into a factor
If a column is going to be used as the value to predict then it must have a data type of “factor” which is
a data type specifically used to represent data which can hold one of a defined set of values. Our
transmembrane predictions are currently just in a text column so we need to change that.
data %>%
mutate(
transmembrane = factor(transmembrane)
) -> data
After you’ve run this, hold your mouse over the transmembrane column header when looking at the data.
It should now says that it is a factor
Removing the gene_id column
In our data the gene_id column just holds the name of the gene. This isn’t useful in the model and will
just slow things down or cause them to overfit, so we need to remove it.
data %>%
select(-gene_id) -> data
Exercises: Introduction to Machine Learning 9
You should now see that the gene_id column has gone, and that the transmembrane column is now the
first one.
Shuffling the rows
For some types of model there may be information contained in the order the rows appear (for example
if all of the transmembrane proteins were next to each other). To prevent this information from having
any effect we can just shuffle all of the rows.
data %>%
sample_frac() -> data
This won’t change the structure of your data but where the data originally put all proteins from the same
chromosome together you should now see that they are all mixed up.
Removing missing values
We will remove any rows in which any of the columns have missing values.
data %>%
na.omit() -> data
After running this you should see that the number of rows in the data goes down from 19,701 to 18,352.
Because we are going to run a random forest model this is all of the preparation we need to do. Later
we may try other model types where we would need to make the data behave in a more quantitatively
nice way, but tree based models really don’t care.
Splitting the data
Before we construct the model we must split off some training data so that we aren’t using the same data
to test the model as we are to train it.
data %>%
initial_split(prop=0.8, strata=transmembrane) -> split_data
This will split off 80% of our data to be used for training and 20% for testing.
We can see the data in the two subsets by running:
training(split_data)
..or..
testing(split_data)
Exercises: Introduction to Machine Learning 10
You should see about 14,600 rows in the training data and about 3,600 in the testing.
Building the model
Now all the data is prepared we can go on and build a model. We’re going to build a random forest model
using the ranger engine. We also need to tell it that it’s going to make a classification prediction.
rand_forest(trees=100) %>%
set_engine("ranger") %>%
set_mode("classification") -> forest_model
To see the model you can run
forest_model %>% translate()
Note that a lot of the options in the model fit template are set to “missing_arg()” which means that
they are values we will need to supply later in the process.
Training the model
We now need to train the model. We are going to give it the training data from our split data, and we’re
going to tell it that it should try to predict the transmembrane values using all of the rest of the columns.
forest_model %>%
fit(transmembrane ~ ., data=training(split_data)) -> forest_fit
Once the model is fit we can see it by running
forest_fit
We should see all of the variables for the model in place, and see some of the details of the data and the
fit (number of variables and cases etc).
Testing the model
To test the model we need to use it to make predictions about data where we know the answer, which is
what our testing data is for. We are going to use the predict function to make predictions on this data.
To make a prediction we need to pass in a new dataset with the same variables as the training data and
it will make predictions.
forest_fit %>%
predict(testing(split_data))
Which will give us something like:
Exercises: Introduction to Machine Learning 11
# A tibble: 3,671 × 1
.pred_class
<fct>
1 Soluble
2 Soluble
3 Transmembrane
4 Soluble
5 Soluble
6 Soluble
7 Soluble
8 Soluble
9 Soluble
10 Soluble
The problem with this is that it only outputs the predictions, we don’t see the rest of the data, including
the column which says what the answer should have been, so we need to join those predictions to the
training data
forest_fit %>%
predict(testing(split_data)) %>%
bind_cols(testing(split_data)) -> prediction_results
You can now click on the prediction_results in the environment window to see the predictions (in
the .pred_class column) alongside the known correct answers (in the transmembrane column)
Evaluating the predictions
From the set of predictions we can now see how well the model actually did by comparing the predictions
to the known true values.
We can start by simply counting the number of times we see different combinations of predictions and
true values in the data.
prediction_results %>%
group_by(transmembrane, .pred_class) %>%
count()
From this you can see how many times a correct and incorrect prediction was made and the break down
of the mistakes which were made.
We can also get more specific values for sensitivity and specificity
prediction_results %>%
sens(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 12
..and..
prediction_results %>%
spec(transmembrane, .pred_class)
Finally we can get an overall accuracy value, and we can also get the Coehn’s kappa value to say
whether we’re actually performing better than chance on the data.
prediction_results %>%
metrics(transmembrane, .pred_class)
What is your evaluation of how well the model has performed? Feel free to try playing with the setup
parameters for the model to see if you can improve on the initial performance. Remember though that
there is a random component, so just because a model works better once doesn’t mean that those
settings will always be better.
Exercises: Introduction to Machine Learning 13
Exercise 4: Using Recipes and Workflows
We’re going to build another model from the same transmembrane data as before, but this time we’re
constructing a neural net.
Because neural networks have more constraints on the data which goes into the model we’re going to
have to do more pre-processing, and we’re going to have to apply this to both the training and testing
data (and we’d have to do it to any unknown proteins in future), so we’re going to automate this with a
recipe and we’re going to integrate this into a workflow to run it.
For the first part of the model where we:
1. Loaded the required packages
2. Loaded the data
3. Prepared the data
4. Split the data into training and testing
We can follow the same steps as before, or we can use the same split_data variable as for the
random forest model.
Building a Recipe
Firstly we’re going to build a recipe which will combine the formula for prediction and the training data.
Once we have it we can then add steps to it to complete the pre-processing.
recipe(
transmembrane ~ . ,
data=training(split_data)
) -> neural_recipe
We can then view the recipe with
neural_recipe
Now we have a recipe we can add processing steps to it. The steps will be:
1. Log transform the gene_length and transcript_length columns
2. Z-Score normalise all of the numeric columns
3. Turn all of the text columns into dummy number columns
neural_recipe %>%
step_log(gene_length, transcript_length) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) -> neural_recipe
Look at the recipe again to see the new steps have been added.
Exercises: Introduction to Machine Learning 14
Building the model
We can now create the neural network model. We’re going to use a single hidden layer with 10 nodes
in it. You could play around with the settings made here once you had the basic model in place.
mlp(
epochs = 1000,
hidden_units = 10,
penalty = 0.01,
learn_rate = 0.01
) %>%
set_engine("brulee", validation = 0) %>%
set_mode("classification") -> nnet_model
The arguments here are as follows:
• epochs = how many rounds of refinement (back propagation) the model goes through
• hidden_units = how many nodes we want in the hidden layer.
• penalty = a value which penalises complexity in the model to try to prevent overfitting
• learn_rate = how much the estimates are moved to try to optimise the model
Again, these values could be modified after generating an initial model, but these will give us something
to work from.
We can see the model with
nnet_model %>% translate()
Building a workflow
A workflow will combine the recipe and the model together and will allow us to run everything at once.
workflow() %>%
add_recipe(neural_recipe) %>%
add_model(nnet_model) -> neural_workflow
We can view the workflow with
neural_workflow
Training the model via the workflow
To train the model we run the fit function and pass in our training data. This will preprocess the data
then feed it to the model.
fit(neural_workflow,data=training(split_data)) -> neural_fit
This will take a couple of minutes to complete. Once complete we can see the fitted model with
Exercises: Introduction to Machine Learning 15
neural_fit
You should see that a load more parameters have now been set because the model and the pre-
processing have been finalised.
Evaluating the Model
We can now use the model to make predictions on our testing data to see how well it is performing. As
before, the predict function only returns the predictions, so we need to bind the results to the training
data itself so we can see the predictions alongside the known correct values.
predict(neural_fit, new_data=testing(split_data)) %>%
bind_cols(testing(split_data)) %>%
select(.pred_class, transmembrane) -> neural_predictions
You can look at the contents of the neural_predictions variable to get an idea of how well it did.
Now we can calculate some of the standard metrics from this. We can make up a simple confusion table.
neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count()
..or if we want to be fancier…
neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count() %>%
pivot_wider(
names_from=.pred_class,
values_from=n,
names_prefix = "predicted_"
) %>%
rename(true_transmembrane=transmembrane)
We can also calculate the specific metrics
neural_predictions %>%
metrics(transmembrane, .pred_class)
neural_predictions %>%
sens(transmembrane, .pred_class)
neural_predictions %>%
spec(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 16
Exercises: Introduction to Machine Learning 17
Additional Exercise: Tuning models
For a final more challenging exercise we are going to try to rerun the transmembrane data but this time
using a k-nearest neighbour model. As well as changing the model type we will also try to optimise the
number of nearest neighbours to use.
The preparation of the data will be the same as before initially, but then we will hit some changes.
For the model you are going to use a knn model, and let the number of neighbours be a tuneable
parameter
nearest_neighbor(neighbors = tune(), weight_func = "triangular") %>%
set_mode("classification") %>%
set_engine("kknn") -> model
For the data you need to build a 10 fold cross validation split of the full dataset, rather than a single 80%
split.
vfold_cv(
data,
v=10
) -> vdata
You can then build a workflow from the model and data using the same formula as before.
Once you have the workflow you can look at the tuneable paramters.
workflow %>%
extract_parameter_set_dials()
…and from these we want to change the neighbors parameter to run from 1 to 50
workflow %>%
extract_parameter_set_dials() %>%
update(
neighbors = neighbors(c(1,50))
) -> tune_parameters
We’re then going to run the workflow generating a regular grid of 20 samples over the 1-50 range. We
are going to measure both the sensitivity and specificity of the model.
workflow %>%
tune_grid(
vdata,
grid = grid_regular(tune_parameters, levels=20),
metrics = metric_set(sens,spec)
) -> tune_results
Exercises: Introduction to Machine Learning 18
Finally we can plot out the tuned results to see which value for k we think is best.
autoplot(tune_results)