0% found this document useful (0 votes)

32 views18 pages

Introduction To Machine Learning Exercises

Uploaded by

Serge Pascal Fogoum Tamu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views18 pages

Introduction To Machine Learning Exercises

Uploaded by

Serge Pascal Fogoum Tamu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Exercises:

Introduction to Machine
Learning

Version 2023-08
Exercises: Introduction to Machine Learning 2

Licence
This manual is © 2023, Simon Andrews, Laura Biggins.

This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0
licence. This means that you are free:

• to copy, distribute, display, and perform the work

• to make derivative works

Under the following conditions:

• Attribution. You must give the original author credit.

• Non-Commercial. You may not use this work for commercial purposes.

• Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work
only under a licence identical to this one.

Please note that:

• For any reuse or distribution, you must make clear to others the licence terms of this work.
• Any of these conditions can be waived if you get permission from the copyright holder.
• Nothing in this license impairs or restricts the author's moral rights.

Full details of this licence can be found at

http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
Exercises: Introduction to Machine Learning 3

Exercise 1: Running machine learning models

In this exercise we have given you a filtered subset of the data in GSE1133, which is a microarray study
measuring gene expression across a panel of around 90 different tissues.

The aim of the model is to try to predict which genes are involved in development. This is defined based
on the “Developmental Process” Gene Ontology category (GO:0032502).

A snapshot of the first part of the data looks like this:

The stats for the dataset are:

• There are 1241 measured genes
• 522 of the genes are development genes, 719 are not
• There are 92 variables (tissues) we can use for prediction. All of them are quantitative so are
compatible with all model types
• The variable to predict is Categorical (Development or Not Development) so we can only use
models with a Categorical output.
• Although there is a gene column we aren’t going to use that as there’s no way that this can be
predictive since it’s a categorical value which is different for every gene.

Running Models
To let you try out some of these models you can go to:

https://www.bioinformatics.babraham.ac.uk/shiny/machinelearning/

Where we have built a simple interface which lets you run a variety of different model types on this data.
Just select the model you want to run from the drop down box and press the “Run Model” button.

After the model has run you will see some information about the model on the left which summarises the
parameters which were used to run it – you should be able to match these to the theory we talked about
before.
Exercises: Introduction to Machine Learning 4

On the right you will see a summary of some predictions made by the model. We have run two sets of
data through the model.

1. We re-ran the data used to train the model back through it to see how well it is able to predict
data it has seen before.

2. We set aside a portion of the original data before training the model and then ran this through
the model after it was trained to see how well it works against data it hasn’t seen before.

Results

This table shows a summary of the predictions the model made and how they matched against the known
correct values in the data. It’s important to validate a model against data where you know the answer,
before using it to make predictions on data where the answer isn’t known.

In this table you can see the total number of correct (TRUE) and incorrect (FALSE) predictions the model
made.

Questions:
Run the different models and look at their output and the summary of the predictions they make then
answer the questions below.

1. Do all of the models perform similarly well, or are some better than others?
Exercises: Introduction to Machine Learning 5

2. Do the models perform similarly well on the data they have seen before and the data they haven’t
seen before?

3. Do the more complex models perform better than the simpler ones?

4. If you run each model a couple of times, do the results change? If they only change for some of
the models why is this?

5. If you hadn’t run a model, but had simply assigned the most frequent category (Not
Development) to every prediction, how many correct answers would you expect to have seen in
the test data of 249 samples? Do any of the models do substantially better than this?

Changing Model Parameters

We have a second interface which lets you rerun the random forest model whilst changing the
parameters used to construct it:

https://www.bioinformatics.babraham.ac.uk/shiny/optimising_model/

In this version you can change the total number of trees constructed, the number of random predictors
selected at each branch point, and the minimum number of measures which must appear in a node at
the bottom of the tree so it doesn’t get too complex at the bottom of the tree.

Try running the model a few times and seeing what effect changing these parameters has on the results.

What settings would you have to use to mimic a conventional decision tree?

What do you think the effect of changing the different parameters would be? Do you see this in the
results?
Exercises: Introduction to Machine Learning 6

Exercise 2: Evaluating Models

Now that you have learned about the way that models can be tested you can initially look back at the
results you found in exercise 1 to see the additional metrics which were supplied alongside the raw
results.

1. Are the models actually identifying developmental genes at a rate which is significantly higher
than you’d get by guessing?

2. What is the balance in the models between sensitivity (the ability to say that a developmental
gene is a developmental gene) and specificity (the ability to identify non-developmental genes)?

3. Are there differences in the sensitivity / specificity trade-off between the models?

4. Which models appear to be most strongly overfitted to the training data (do well on training, and
poorly on testing)?
Exercises: Introduction to Machine Learning 7

Exercise 3: Building your first model

In this exercise you will build your first tidymodels model. You can going to use a dataset comprising all
of the canonical proteins in the mouse genome. For each of these you have some basic information
about the gene, transcript and protein, plus you have the compositional break down of the protein into
its component amino acids.

The aim of your model is to predict which of these proteins contains one or more transmembrane
segments, such that the protein is normally found embedded within a membrane.

To do this you are going to build and train a random forest model. The steps in the modelling procedure
will be:

1. Load the R packages we’re going to need for this analysis

2. Load in the original data

3. Prepare the data for modelling

a. Convert the variable to predict to be a factor
b. Remove proteins with missing data
c. Shuffle the rows

4. Split the data into a training and testing subset

5. Build the model

6. Train the model using the training data

7. Predict the transmembrane proteins from the testing data

8. Check how good the predictions are

Below we will talk you through how to construct a script in RStudio to perform all of these steps. In an
actual modelling experiment we would include more evaluation of the data before starting on the
modelling, so this is a somewhat truncated version of the full procedure you’d use.

To get started you need to open a new R script, save it, then set the location of the data you’re going to
use.

Setting up your environment

Inside RStudio select

File > New File > R Script

Once the script has opened go to File > Save As and save it into the MachineLearningData folder in a
file called model.R

In the RStudio menu select Session > Set Working Directory > To Source File Location
Exercises: Introduction to Machine Learning 8

Loading the R packages we need

We will be using two packages in this script, the tidyverse package, which will do the general data
manipulation for us, and the tidymodels package which will do the modelling.

We can load these with

library(tidyverse)
library(tidymodels)
tidymodels_prefer()

The last line here simply says that we should always use functions from tidymodels, even if another
function with the same name, but from a different package exists.

Loading the input data

To load the data from the TSV file it’s saved in we need to do

read_delim("transmembrane_data.txt") -> data

You can then click on the data in the Environment tab (top right) and have a look at what the data looks
like.

Preparing the data for modelling

Turing transmembrane into a factor
If a column is going to be used as the value to predict then it must have a data type of “factor” which is
a data type specifically used to represent data which can hold one of a defined set of values. Our
transmembrane predictions are currently just in a text column so we need to change that.

data %>%
mutate(
transmembrane = factor(transmembrane)
) -> data

After you’ve run this, hold your mouse over the transmembrane column header when looking at the data.
It should now says that it is a factor

Removing the gene_id column

In our data the gene_id column just holds the name of the gene. This isn’t useful in the model and will
just slow things down or cause them to overfit, so we need to remove it.

data %>%
select(-gene_id) -> data
Exercises: Introduction to Machine Learning 9

You should now see that the gene_id column has gone, and that the transmembrane column is now the
first one.

Shuffling the rows

For some types of model there may be information contained in the order the rows appear (for example
if all of the transmembrane proteins were next to each other). To prevent this information from having
any effect we can just shuffle all of the rows.

data %>%
sample_frac() -> data

This won’t change the structure of your data but where the data originally put all proteins from the same
chromosome together you should now see that they are all mixed up.

Removing missing values

We will remove any rows in which any of the columns have missing values.

data %>%
na.omit() -> data

After running this you should see that the number of rows in the data goes down from 19,701 to 18,352.

Because we are going to run a random forest model this is all of the preparation we need to do. Later
we may try other model types where we would need to make the data behave in a more quantitatively
nice way, but tree based models really don’t care.

Splitting the data

Before we construct the model we must split off some training data so that we aren’t using the same data
to test the model as we are to train it.

data %>%
initial_split(prop=0.8, strata=transmembrane) -> split_data

This will split off 80% of our data to be used for training and 20% for testing.

We can see the data in the two subsets by running:

training(split_data)

..or..

testing(split_data)
Exercises: Introduction to Machine Learning 10

You should see about 14,600 rows in the training data and about 3,600 in the testing.

Building the model

Now all the data is prepared we can go on and build a model. We’re going to build a random forest model
using the ranger engine. We also need to tell it that it’s going to make a classification prediction.

rand_forest(trees=100) %>%
set_engine("ranger") %>%
set_mode("classification") -> forest_model

To see the model you can run

forest_model %>% translate()

Note that a lot of the options in the model fit template are set to “missing_arg()” which means that
they are values we will need to supply later in the process.

Training the model

We now need to train the model. We are going to give it the training data from our split data, and we’re
going to tell it that it should try to predict the transmembrane values using all of the rest of the columns.

forest_model %>%
fit(transmembrane ~ ., data=training(split_data)) -> forest_fit

Once the model is fit we can see it by running

forest_fit

We should see all of the variables for the model in place, and see some of the details of the data and the
fit (number of variables and cases etc).

Testing the model

To test the model we need to use it to make predictions about data where we know the answer, which is
what our testing data is for. We are going to use the predict function to make predictions on this data.
To make a prediction we need to pass in a new dataset with the same variables as the training data and
it will make predictions.

forest_fit %>%
predict(testing(split_data))

Which will give us something like:

Exercises: Introduction to Machine Learning 11

# A tibble: 3,671 × 1
.pred_class
<fct>
1 Soluble
2 Soluble
3 Transmembrane
4 Soluble
5 Soluble
6 Soluble
7 Soluble
8 Soluble
9 Soluble
10 Soluble

The problem with this is that it only outputs the predictions, we don’t see the rest of the data, including
the column which says what the answer should have been, so we need to join those predictions to the
training data

forest_fit %>%
predict(testing(split_data)) %>%
bind_cols(testing(split_data)) -> prediction_results

You can now click on the prediction_results in the environment window to see the predictions (in
the .pred_class column) alongside the known correct answers (in the transmembrane column)

Evaluating the predictions

From the set of predictions we can now see how well the model actually did by comparing the predictions
to the known true values.

We can start by simply counting the number of times we see different combinations of predictions and
true values in the data.

prediction_results %>%
group_by(transmembrane, .pred_class) %>%
count()

From this you can see how many times a correct and incorrect prediction was made and the break down
of the mistakes which were made.

We can also get more specific values for sensitivity and specificity

prediction_results %>%
sens(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 12

..and..

prediction_results %>%
spec(transmembrane, .pred_class)

Finally we can get an overall accuracy value, and we can also get the Coehn’s kappa value to say
whether we’re actually performing better than chance on the data.

prediction_results %>%
metrics(transmembrane, .pred_class)

What is your evaluation of how well the model has performed? Feel free to try playing with the setup
parameters for the model to see if you can improve on the initial performance. Remember though that
there is a random component, so just because a model works better once doesn’t mean that those
settings will always be better.
Exercises: Introduction to Machine Learning 13

Exercise 4: Using Recipes and Workflows

We’re going to build another model from the same transmembrane data as before, but this time we’re
constructing a neural net.

Because neural networks have more constraints on the data which goes into the model we’re going to
have to do more pre-processing, and we’re going to have to apply this to both the training and testing
data (and we’d have to do it to any unknown proteins in future), so we’re going to automate this with a
recipe and we’re going to integrate this into a workflow to run it.

For the first part of the model where we:

1. Loaded the required packages

2. Loaded the data
3. Prepared the data
4. Split the data into training and testing

We can follow the same steps as before, or we can use the same split_data variable as for the
random forest model.

Building a Recipe
Firstly we’re going to build a recipe which will combine the formula for prediction and the training data.
Once we have it we can then add steps to it to complete the pre-processing.

recipe(
transmembrane ~ . ,
data=training(split_data)
) -> neural_recipe

We can then view the recipe with

neural_recipe

Now we have a recipe we can add processing steps to it. The steps will be:

1. Log transform the gene_length and transcript_length columns

2. Z-Score normalise all of the numeric columns
3. Turn all of the text columns into dummy number columns

neural_recipe %>%
step_log(gene_length, transcript_length) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors()) -> neural_recipe

Look at the recipe again to see the new steps have been added.
Exercises: Introduction to Machine Learning 14

Building the model

We can now create the neural network model. We’re going to use a single hidden layer with 10 nodes
in it. You could play around with the settings made here once you had the basic model in place.

mlp(
epochs = 1000,
hidden_units = 10,
penalty = 0.01,
learn_rate = 0.01
) %>%
set_engine("brulee", validation = 0) %>%
set_mode("classification") -> nnet_model

The arguments here are as follows:

• epochs = how many rounds of refinement (back propagation) the model goes through
• hidden_units = how many nodes we want in the hidden layer.
• penalty = a value which penalises complexity in the model to try to prevent overfitting
• learn_rate = how much the estimates are moved to try to optimise the model

Again, these values could be modified after generating an initial model, but these will give us something
to work from.

We can see the model with

nnet_model %>% translate()

Building a workflow
A workflow will combine the recipe and the model together and will allow us to run everything at once.

workflow() %>%
add_recipe(neural_recipe) %>%
add_model(nnet_model) -> neural_workflow

We can view the workflow with

neural_workflow

Training the model via the workflow

To train the model we run the fit function and pass in our training data. This will preprocess the data
then feed it to the model.

fit(neural_workflow,data=training(split_data)) -> neural_fit

This will take a couple of minutes to complete. Once complete we can see the fitted model with
Exercises: Introduction to Machine Learning 15

neural_fit

You should see that a load more parameters have now been set because the model and the pre-
processing have been finalised.

Evaluating the Model

We can now use the model to make predictions on our testing data to see how well it is performing. As
before, the predict function only returns the predictions, so we need to bind the results to the training
data itself so we can see the predictions alongside the known correct values.

predict(neural_fit, new_data=testing(split_data)) %>%

bind_cols(testing(split_data)) %>%
select(.pred_class, transmembrane) -> neural_predictions

You can look at the contents of the neural_predictions variable to get an idea of how well it did.

Now we can calculate some of the standard metrics from this. We can make up a simple confusion table.

neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count()

..or if we want to be fancier…

neural_predictions %>%
group_by(.pred_class,transmembrane) %>%
count() %>%
pivot_wider(
names_from=.pred_class,
values_from=n,
names_prefix = "predicted_"
) %>%
rename(true_transmembrane=transmembrane)

We can also calculate the specific metrics

neural_predictions %>%
metrics(transmembrane, .pred_class)

neural_predictions %>%
sens(transmembrane, .pred_class)

neural_predictions %>%
spec(transmembrane, .pred_class)
Exercises: Introduction to Machine Learning 16
Exercises: Introduction to Machine Learning 17

Additional Exercise: Tuning models

For a final more challenging exercise we are going to try to rerun the transmembrane data but this time
using a k-nearest neighbour model. As well as changing the model type we will also try to optimise the
number of nearest neighbours to use.

The preparation of the data will be the same as before initially, but then we will hit some changes.

For the model you are going to use a knn model, and let the number of neighbours be a tuneable
parameter

nearest_neighbor(neighbors = tune(), weight_func = "triangular") %>%

set_mode("classification") %>%
set_engine("kknn") -> model

For the data you need to build a 10 fold cross validation split of the full dataset, rather than a single 80%
split.

vfold_cv(
data,
v=10
) -> vdata

You can then build a workflow from the model and data using the same formula as before.

Once you have the workflow you can look at the tuneable paramters.

workflow %>%
extract_parameter_set_dials()

…and from these we want to change the neighbors parameter to run from 1 to 50

workflow %>%
extract_parameter_set_dials() %>%
update(
neighbors = neighbors(c(1,50))
) -> tune_parameters

We’re then going to run the workflow generating a regular grid of 20 samples over the 1-50 range. We
are going to measure both the sensitivity and specificity of the model.

workflow %>%
tune_grid(
vdata,
grid = grid_regular(tune_parameters, levels=20),
metrics = metric_set(sens,spec)
) -> tune_results
Exercises: Introduction to Machine Learning 18

Finally we can plot out the tuned results to see which value for k we think is best.

autoplot(tune_results)

Assignment 4 R Program1
No ratings yet
Assignment 4 R Program1
11 pages
ML LAB Manual
No ratings yet
ML LAB Manual
18 pages
Module - 1
No ratings yet
Module - 1
9 pages
Lab Manual - MACHINE LEARNING LABORATORY
No ratings yet
Lab Manual - MACHINE LEARNING LABORATORY
42 pages
Day 2 Presentation
No ratings yet
Day 2 Presentation
65 pages
Lecture 01
No ratings yet
Lecture 01
23 pages
Statistics and Machine Learning Toolbox™ Release Notes
No ratings yet
Statistics and Machine Learning Toolbox™ Release Notes
150 pages
Notes On Data Science and Machine Learning
No ratings yet
Notes On Data Science and Machine Learning
53 pages
Big Data Mid Term
No ratings yet
Big Data Mid Term
14 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
470 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
30 pages
1694266379-Unit1 Machine Learning Introduction CU 2.0
No ratings yet
1694266379-Unit1 Machine Learning Introduction CU 2.0
58 pages
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
100% (1)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
424 pages
Data Science Machine Learning
No ratings yet
Data Science Machine Learning
369 pages
Hands On Machine Learning With R 1st Edition Brad Boehmke (Author) Online Reading
No ratings yet
Hands On Machine Learning With R 1st Edition Brad Boehmke (Author) Online Reading
93 pages
Aiam 2023 Diy1
No ratings yet
Aiam 2023 Diy1
3 pages
ML Lab Manual
No ratings yet
ML Lab Manual
38 pages
Article - 10 Machine Learning Algorithms in R
No ratings yet
Article - 10 Machine Learning Algorithms in R
2 pages
Logistic Regression in Python
No ratings yet
Logistic Regression in Python
1 page
ISL439 E - Syllabus - 2023 - 2024
No ratings yet
ISL439 E - Syllabus - 2023 - 2024
4 pages
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
No ratings yet
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
13 pages
R Code Intro
No ratings yet
R Code Intro
46 pages
Practical File DL
No ratings yet
Practical File DL
14 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Unit 2
No ratings yet
Unit 2
19 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Data Science Content
No ratings yet
Data Science Content
11 pages
Machine Learning With Python
100% (3)
Machine Learning With Python
137 pages
Using Random Forests v4.0
No ratings yet
Using Random Forests v4.0
33 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
No ratings yet
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
9 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
Machine Learning A Z Q A
100% (1)
Machine Learning A Z Q A
52 pages
Machine Learning The Basics
No ratings yet
Machine Learning The Basics
158 pages
MlLabManualdocx 2024 09 04 22 02 58
No ratings yet
MlLabManualdocx 2024 09 04 22 02 58
19 pages
Practical Labs Guide
No ratings yet
Practical Labs Guide
34 pages
Machine Learning Lab Manual 17CSL76
No ratings yet
Machine Learning Lab Manual 17CSL76
57 pages
2 Machine Learning
No ratings yet
2 Machine Learning
21 pages
Lecture 17&18 - Introduction To Machine Learning
No ratings yet
Lecture 17&18 - Introduction To Machine Learning
51 pages
Unit 1
No ratings yet
Unit 1
28 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
5 pages
Statistical Learning For Biomedical Data Accessible PDF Download
No ratings yet
Statistical Learning For Biomedical Data Accessible PDF Download
14 pages
Bookdown Demo PDF
No ratings yet
Bookdown Demo PDF
19 pages
Lecture 1
No ratings yet
Lecture 1
36 pages
Machine Learning Lab Guide
No ratings yet
Machine Learning Lab Guide
69 pages
A Guided Tour To Machine Learning Using MATLAB
No ratings yet
A Guided Tour To Machine Learning Using MATLAB
15 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Record
No ratings yet
Record
23 pages
Wa0001.
No ratings yet
Wa0001.
46 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Machine Learning Theory and Practice
No ratings yet
Machine Learning Theory and Practice
299 pages
Telkom Offerings Schedule Migrations Conversions Renewals v0.1
No ratings yet
Telkom Offerings Schedule Migrations Conversions Renewals v0.1
3 pages
New Employee Orientation Manual
No ratings yet
New Employee Orientation Manual
7 pages
Mini Pro 300V Power Supply: Instruction Manual
No ratings yet
Mini Pro 300V Power Supply: Instruction Manual
20 pages
Kuakata Proposal
No ratings yet
Kuakata Proposal
3 pages
Shahina
No ratings yet
Shahina
31 pages
Mosfet
No ratings yet
Mosfet
15 pages
Bangladesh Securities Trade Insights
No ratings yet
Bangladesh Securities Trade Insights
15 pages
English 2nd Quarter
No ratings yet
English 2nd Quarter
5 pages
Data File Handling Abcd
No ratings yet
Data File Handling Abcd
9 pages
Instabiz: Mobile Banking for MSMEs
No ratings yet
Instabiz: Mobile Banking for MSMEs
16 pages
Guardian Paramount
No ratings yet
Guardian Paramount
7 pages
NURS FPX 6020 Assessment 3 - Tutorsacademy - Co
No ratings yet
NURS FPX 6020 Assessment 3 - Tutorsacademy - Co
5 pages
ME Week 7 LAQ
No ratings yet
ME Week 7 LAQ
3 pages
L3. Elasticity - CH 3, Questions - Student
No ratings yet
L3. Elasticity - CH 3, Questions - Student
2 pages
Ziaur Rahman.
No ratings yet
Ziaur Rahman.
2 pages
Bidhya Rimal
No ratings yet
Bidhya Rimal
14 pages
Contract of Lease - Gozo & Javier
No ratings yet
Contract of Lease - Gozo & Javier
2 pages
Invoice
No ratings yet
Invoice
1 page
Art Critique Graphic Organizer Directions
No ratings yet
Art Critique Graphic Organizer Directions
4 pages
RFP Document NRC
No ratings yet
RFP Document NRC
59 pages
Financial Policy at Apple
25% (4)
Financial Policy at Apple
9 pages
Grameen Phone Roaming - Activation - Form
No ratings yet
Grameen Phone Roaming - Activation - Form
2 pages
Tracker Team Names List
No ratings yet
Tracker Team Names List
19 pages
Case Study Coca Cola
No ratings yet
Case Study Coca Cola
3 pages
Lab Exercise 9 Combined Cycle Power Plant
No ratings yet
Lab Exercise 9 Combined Cycle Power Plant
2 pages
24 Slot 4 Pole 3 Phase IM Winding
No ratings yet
24 Slot 4 Pole 3 Phase IM Winding
12 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Tea Bag Packaging Innovators
No ratings yet
Tea Bag Packaging Innovators
16 pages
Business Training Proposal 2025
No ratings yet
Business Training Proposal 2025
2 pages
Big Data Architecture Group 1 PROJECT
No ratings yet
Big Data Architecture Group 1 PROJECT
55 pages

Introduction To Machine Learning Exercises

Uploaded by

Introduction To Machine Learning Exercises

Uploaded by

Exercises:

• to copy, distribute, display, and perform the work

• to make derivative works

Under the following conditions:

• Attribution. You must give the original author credit.

Please note that:

Full details of this licence can be found at

Exercise 1: Running machine learning models

A snapshot of the first part of the data looks like this:

The stats for the dataset are:

Changing Model Parameters

Exercise 2: Evaluating Models

Exercise 3: Building your first model

1. Load the R packages we’re going to need for this analysis

2. Load in the original data

3. Prepare the data for modelling

4. Split the data into a training and testing subset

5. Build the model

6. Train the model using the training data

7. Predict the transmembrane proteins from the testing data

8. Check how good the predictions are

Setting up your environment

File > New File > R Script

Loading the R packages we need

We can load these with

Loading the input data

read_delim("transmembrane_data.txt") -> data

Preparing the data for modelling

Removing the gene_id column

Shuffling the rows

Removing missing values

Splitting the data

We can see the data in the two subsets by running:

Building the model

To see the model you can run

forest_model %>% translate()

Training the model

Once the model is fit we can see it by running

Testing the model

Which will give us something like:

Evaluating the predictions

Exercise 4: Using Recipes and Workflows

For the first part of the model where we:

1. Loaded the required packages

We can then view the recipe with

1. Log transform the gene_length and transcript_length columns

Building the model

The arguments here are as follows:

We can see the model with

nnet_model %>% translate()

We can view the workflow with

Training the model via the workflow

fit(neural_workflow,data=training(split_data)) -> neural_fit

Evaluating the Model

predict(neural_fit, new_data=testing(split_data)) %>%

..or if we want to be fancier…

We can also calculate the specific metrics

Additional Exercise: Tuning models

nearest_neighbor(neighbors = tune(), weight_func = "triangular") %>%

You might also like