0% found this document useful (0 votes)

68 views124 pages

Exercise Guide v2.0

The document is an Exercise Guide for the Certified Tester AI Testing (CT-AI) Course, detailing various exercises and solutions related to machine learning (ML) concepts. It includes 19 exercises covering topics such as building classifiers, data preparation, evaluation metrics, and testing techniques, with a focus on using the Weka software for practical applications. Additionally, it provides appendices for pre-course preparation and Weka data filters to aid in the exercises.

Uploaded by

satishc81ml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views124 pages

Exercise Guide v2.0

Uploaded by

satishc81ml

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

Exercise Guide

for
Certified Tester
AI Testing (CT-AI) Course
Version 2.0

International Software Testing Qualifications Board

© STA Consulting Inc. 2022

Contents
0 Introduction .................................................................................................................................... 5
1 Exercise 1 – ML Course Recommender........................................................................................... 6
1.1 Building our first classifier..................................................................................................... 11
1.2 Seeing the Model .................................................................................................................. 14
2 Exercise 2 – Identify a Suitable ML Approach ............................................................................... 17
2.1 Speech Recognition............................................................................................................... 17
2.2 Chatbot ................................................................................................................................. 17
2.3 Website Optimization ........................................................................................................... 17
2.4 Drug Response Modelling ..................................................................................................... 18
2.5 Customer Segmentation ....................................................................................................... 18
2.6 Credit Scoring ........................................................................................................................ 18
2.7 Predictive Maintenance ........................................................................................................ 18
2.8 Fraud Detection .................................................................................................................... 19
2.9 ML Approaches and Example Systems ................................................................................. 19
2.9.1 Supervised Classification ............................................................................................... 19
2.9.2 Supervised Regression .................................................................................................. 20
2.9.3 Unsupervised Clustering ............................................................................................... 20
2.9.4 Unsupervised Association ............................................................................................. 20
2.9.5 Reinforcement Learning................................................................................................ 20
2.10 Solutions – Exercise 2 - Identify a Suitable ML Approach ..................................................... 21
3 Exercise 3 – Data Preparation for ML ........................................................................................... 22
3.1 Filters and Feature Engineering ............................................................................................ 22
3.2 Data Preparation with a Larger Dataset ............................................................................... 25
3.3 Underfitting ........................................................................................................................... 27
3.4 Outliers and Extreme Values................................................................................................. 28
3.5 Overfitting ............................................................................................................................. 30
4 Exercise 4 – Creating Independent Training and Test Datasets ................................................... 36
5 Exercise 5 – Evaluation and Tuning and Testing ........................................................................... 42
5.1 Comparing Algorithms .......................................................................................................... 42
5.2 Setting Parameters................................................................................................................ 45
6 Exercise 6 – ML Functional Performance Metrics......................................................................... 50
6.1 Polio Detection...................................................................................................................... 50
6.2 Spam Detection ..................................................................................................................... 50

© STA Consulting Inc. 2021 2 CT-AI Exercise Guide

6.3 Solution – Polio Detection .................................................................................................... 50
6.4 Solution – Spam Detection ................................................................................................... 51
7 Exercise 7 – Confusion Matrix in Weka......................................................................................... 52
7.1 Solution - ML Functional Performance Metrics .................................................................... 52
8 Exercise 8 – Selecting ML Functional Performance Metrics ......................................................... 54
8.1 ML Systems - Selecting ML Functional Performance Metrics............................................... 54
8.2 Solutions - Selecting ML Functional Performance Metrics ................................................... 54
9 Exercise 9 – Build a Perceptron in Excel........................................................................................ 59
10 Exercise 10 – Selecting Objectives and Acceptance Criteria..................................................... 70
10.1 Example Systems and Characteristics ................................................................................... 70
10.2 Solution – Student Grading ................................................................................................... 70
10.3 Solution – Medical Image Analysis........................................................................................ 72
10.4 Solution – Financial Advisor .................................................................................................. 74
11 Exercise 11 – Explainability using ExpliClas............................................................................... 77
12 Exercise 12 – Selecting a Test Approach for an ML System ...................................................... 80
12.1 Example Situations ................................................................................................................ 80
12.2 Solutions – Example Situations and Initiating Scenarios ...................................................... 80
13 Exercise 13 – Pairwise Testing – Self-Driving Car...................................................................... 83
14 Exercise 14 – Metamorphic Relations....................................................................................... 91
14.1 Speech Recognition............................................................................................................... 91
14.2 Online Search ........................................................................................................................ 91
14.3 Solution - Speech Recognition .............................................................................................. 92
14.4 Solution - Online Search........................................................................................................ 93
15 Exercise 15 – Metamorphic Testing with Teachable Machine ................................................. 94
16 Exercise 16 – Exploratory Testing - TensorFlow Playground .................................................. 101
17 Exercise 17 – Selecting Test Techniques ................................................................................. 104
17.1 Mini-Scenarios - Selecting Test Techniques ........................................................................ 104
17.2 Solutions - Selecting Test Techniques ................................................................................. 105
18 Exercise 18 – Bug Prediction ................................................................................................... 110
18.1 Solution - Bug Prediction .................................................................................................... 111
19 Exercise 19 – Discussion - Which test activities are least likely to be replaced by AI? ........... 112
20 Appendix - Pre-Exercise Preparation ...................................................................................... 117
20.1 Introduction ........................................................................................................................ 117
20.2 Web Access ......................................................................................................................... 117

© STA Consulting Inc. 2021 3 CT-AI Exercise Guide

20.3 Downloading and Installing Weka ...................................................................................... 117
20.4 Install the Weka Training Data ............................................................................................ 117
20.5 MS Excel .............................................................................................................................. 117
20.6 Webcam .............................................................................................................................. 118
20.7 Pairwiser.............................................................................................................................. 118
21 Appendix - Weka Filters .......................................................................................................... 123

© STA Consulting Inc. 2021 4 CT-AI Exercise Guide

0 Introduction
This Exercise Guide describes both the exercises and solutions associated with the K3/K4
learning objectives (LOs) and the hands-on exercises (HOs) required for the CT-AI course.
The 19 exercises are presented in the order they are encountered in the course.
There are two appendices:
• The first appendix provides guidance on the necessary pre-course
preparation to support these exercises (e.g. application downloads and
copying of data files)
• The second appendix provides a list of Weka data filters that are used for the
exercises that use the Waikato Environment for Knowledge Analysis (Weka).
Weka is used to support six of the hands-on machine learning (ML) exercises on the course.
It contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. Weka
provides us with:
• The ability to build and test ‘industrial-strength’ ML models, without the need
for any programming skills.
• A comprehensive collection of data pre-processing and modelling techniques.
• Ease of use due to its graphical user interfaces.
• Portability, since it is fully implemented in the Java programming language
and thus runs on almost any modern computing platform.
• Free availability under the GNU General Public License.
Some of the remaining exercises use other applications, including:
• MS Excel
• ExpliClas
• Inductive’s Pairwiser Tool
• Google’s Teachable Machine
• TensorFlow Playground
And there are also some exercises that simply require the students to use their brains!

© STA Consulting Inc. 2021 5 CT-AI Exercise Guide

1 Exercise 1 – ML Course Recommender
With Weka installed (see the appendix in section 20 for guidance on downloading and
installing Weka), open it, and you will be presented with the GUI Chooser.

The GUI Chooser consists of five ‘Applications’ buttons—one for each of the five major
Weka applications—and four menus on the top. The buttons can be used to start the
following applications:
• Explorer An environment for exploring data with Weka (most of this guide deals
with this application in more detail).
• Experimenter An environment for performing experiments and conducting
statistical tests between learning schemes (we will also have a go at using this).
• KnowledgeFlow This environment supports essentially the same functions as the
Explorer but with a drag-and-drop interface. One advantage is that it supports
incremental learning.
• Workbench The workbench is an environment that combines all of the GUI
interfaces into a single interface. It can be useful when you need to frequently
jump between different interfaces.
• SimpleCLI Provides a simple command-line interface that allows direct execution of
Weka commands for operating systems that do not provide their own command
line interface.

© STA Consulting Inc. 2021 6 CT-AI Exercise Guide

On this course, we will be mainly interested in using the Explorer and Experimenter
applications. But, feel free to explore the others in your spare time.
From the GUI Chooser, select the Explorer application, and then ensure the Preprocess
window has opened (it is the default).

Use the top-left ‘Open file…’ button to open the Weka data file: course.arff
This data file includes 15 instances (an instance is a set of connected attribute values), each
made up of 5 instance values. You can consider an instance to be an example of what
happened in the past. We will use this data about what happened in the past to build a
classifier that predicts what should happen in the future. Or, put another way, whether the
classifier would recommend someone should take the CT-AI course (or not) based on their
experience in testing and AI, whether they have management support, and whether they
have passed the ISTQB Foundation course. Obviously, when there are more instances
(examples) available to use for training, then we would expect the classifier we build based
on this more extensive training data to be more accurate.

© STA Consulting Inc. 2021 7 CT-AI Exercise Guide

On the left in the ‘Attributes’ window, you can see the 5 attributes (test_experience,
AI_experience, management_support, foundation_passed and TAI_course) labelled 1 to 5.
Once we have built a classifier (a model), then the model will take values of the first four
attributes and use them to predict the value of the fifth attribute (whether we should we
take the CT-AI course, or not).
The content of the course.arff file is shown below. The .arff is a text file and can be edited
using Notepad , or any text editor.
@relation course_choice

@attribute test_experience {low_test_experience,medium_test_experience,high_test_experience}

@attribute AI_experience {no_AI_experience,some_AI_experience,lots_of_AI_experience}

@attribute management_support {management_support,no_management_support}

@attribute foundation_passed {foundation,no_foundation}

@attribute TAI_course {TAI_course,no_TAI_course}

@data

low_test_experience,no_AI_experience,no_management_support,foundation,TAI_course

low_test_experience,some_AI_experience,management_support,no_foundation,no_TAI_course

© STA Consulting Inc. 2021 8 CT-AI Exercise Guide

low_test_experience,some_AI_experience,no_management_support,no_foundation,no_TAI_course

low_test_experience,lots_of_AI_experience,management_support,no_foundation,no_TAI_course

low_test_experience,lots_of_AI_experience,management_support,foundation,no_TAI_course

medium_test_experience,no_AI_experience,management_support,foundation,TAI_course

medium_test_experience,no_AI_experience,no_management_support,foundation,TAI_course

medium_test_experience,some_AI_experience,no_management_support,foundation,TAI_course

medium_test_experience,lots_of_AI_experience,no_management_support,no_foundation,no_TAI_course

medium_test_experience,lots_of_AI_experience,management_support,no_foundation,no_TAI_course

high_test_experience,no_AI_experience,no_management_support,foundation,TAI_course

high_test_experience,some_AI_experience,management_support,no_foundation,no_TAI_course

high_test_experience,some_AI_experience,no_management_support,foundation,TAI_course

high_test_experience,some_AI_experience,no_management_support,no_foundation,no_TAI_course

high_test_experience,lots_of_AI_experience,no_management_support,foundation,no_TAI_course

The 15 instances are listed under @data, with each instance showing values for the
attributes listed above (in the same order). Therefore, the first entry tells us that for this
example situation the candidate had low test experience, no AI experience, no management
support, did already have the Foundation certificate, and did take the TAI course.
To see or edit a dataset in Weka (as shown below), simply select the ‘Edit…’ button near the
top-right of the Preprocess window.

If we close the ‘Viewer’ and go back to looking at ‘Weka Explorer’, we can see that the
window in the bottom right shows the values for the test_experience attribute as a bar
chart (test_experience is shown as it is the first attribute). When we hover over each of the

© STA Consulting Inc. 2021 9 CT-AI Exercise Guide

three red and blue bars, then we can see which of the three attribute values (low-, medium-
or high-test experience) each bar represents. By chance, for this attribute the three
attribute values are equally represented (five each).

The red and blue colours on the bars show us the proportion of these values that contribute
to the decision about whether the CT-AI course is recommended or not. So, we can see that
for the five instances (examples) where test experience is low, only one (shaded blue)
resulted in the TAI course being taken.
We can see the same information for any of the attributes by selecting (left-clicking) the
attribute of interest in the ‘Attributes’ panel. Alternatively, we can click on the ‘Visualize All’
button (just above the red and blue bars) and see the information for all five attributes at
the same time. Note that the TAI_course attribute (the attribute that tells us the result –
i.e. whether we take the course, or not) shows that there are 6 examples that recommend
taking the course (in blue) and 9 examples that recommend not to (in red).

© STA Consulting Inc. 2021 10 CT-AI Exercise Guide

While you have the information for all five attributes open, imagine that you have to
manually build a model that recommends whether to take the course or not. One strategy
you might employ could be to look for attribute values that clearly favour one of the two
result options. For example, you can see that the seven instances of not having passed
Foundation only ever results in a recommendation not to take the course. You could
interpret this to tell you that a check for this value should result in a ‘do not take the course’
recommendation. Remember, however, that we are only looking at 15 examples/instances
and that there are 36 possible combinations (3x3x2x2) for our four input attributes, and
some of the combinations we do not have examples for may come up with a different
result. Also remember that these are only examples, and that new examples we add to our
dataset may not always agree with existing examples (examples gathered from different
candidates with the same input attribute values may well differ in their result attribute).
Models built from more examples will normally be more reliable than models built from
fewer examples.

1.1 Building our first classifier

We will get Weka to use an algorithm to build a classifier for us. Select the ‘Classify’ tab
near the top left of the ‘Weka Explorer’ window. At the top of the window, you can see that
the default classifier is ZeroR. This is the most basic algorithm for building a model, but it is
a good place to start, so leave it unchanged.

© STA Consulting Inc. 2021 11 CT-AI Exercise Guide

Now, left-click on the ‘Start’ button (half-way down the left-hand side). Very quickly
(because the algorithm is so simple) the results of both building and evaluating the model
will appear in the ‘Classifier output’ panel.

© STA Consulting Inc. 2021 12 CT-AI Exercise Guide

At this point, we are interested in the value for ‘Correctly Classified Instances’, which should
show 9 and 60%. This gives an estimate of the accuracy of the ZeroR model Weka just built.
As a tester, you may be wondering how Weka measured the accuracy of the model it built
without any test data. We will cover this a bit later, but, for now, just accept that Weka did
some clever calculations on the accuracy using the 15 examples we provided.
So, how does the ZeroR algorithm work? Any ideas?
Well, ZeroR simply looks at the target variable (in our case, TAI_course) and selects which
recommendation occurs more often in the training dataset and then predicts that result
every time without even considering the input attributes. Remember that for our dataset of
15 examples there are 9 examples of where the candidate did not take the course, so that
means the ZeroR algorithm will generate a model that will always predict the result of ‘do
not take the course’. As we have used the same data to evaluate the model as we used to
train it, then we know that 9 out of 15 times the model will guess correctly (and 9/15=0.6 
60%). If you look near the top of the ‘Classifier output’ window, then below the ‘Attributes’
and ‘Test mode’ you can see the single ‘rule’ used by the ZeroR model we generated:
“ZeroR predicts class value: no_TAI_course”
So, ZeroR is a rule-based algorithm based on the most predicted target class in the training
data. You can see an example (using pseudo-code) of the ZeroR algorithm for a classifier
with two outcomes (TRUE and FALSE).

© STA Consulting Inc. 2021 13 CT-AI Exercise Guide

The algorithm simply counts the number of instances and if there are more TRUE outcomes
than FALSE outcomes it creates a model that always outputs TRUE, otherwise it creates a
model that always outputs FALSE (if they are equal it arbitrarily chooses TRUE as it has to
choose one of the two options to output).
The ML algorithm generates an ML model. For the ZeroR algorithm for a classifier with two
outcomes, just two possible models are possible (one where the majority of examples were
TRUE and one where the majority of examples were FALSE).

If more TRUE predictions in training set:

program ZeroR_Model
begin
output (“TRUE”)
end ZeroR_Model

If more FALSE predictions in training set:

program ZeroR_Model
begin
output (“FALSE”)
end ZeroR_Model

It is important to remember that the model code above is NOT written by a person but is
generated by the algorithm. This is the point of ‘machine learning’ – the machine (algorithm
on a computer) builds a model by learning from the training data.
At this point you may be questioning the value of an algorithm that builds such simple
models (and, perhaps, the validity of using the same data to both train and evaluate it, but
we will cover that later). However, ZeroR can be useful as a benchmark to compare its
results with those from more sophisticated algorithms.
Actually, building models from such a small dataset is quite challenging, but we can build a
deep neural net that will outperform the ZeroR algorithm (with this dataset). To do this,
left-click on the ‘Choose’ button in the ‘Classifier’ panel towards the top-left of the screen.
From the presented list, select ‘functions’ and then ‘MultilayerPerceptron’. As before, left-
click on the ‘Start’ button and check the results of building and evaluating a multi-layer
perceptron (otherwise known as a deep neural net). This time the accuracy should be
showing as about 93.3%.

1.2 Seeing the Model

If you left-click on the ‘MultilayerPerceptron’ text in the ‘Classifier’ panel (where we chose
the classifier – see the yellow arrow in the following screenshot), then you are presented
with a ‘weka.gui.GenericObjectEditor’ that allows you to see and alter the algorithm

© STA Consulting Inc. 2021 14 CT-AI Exercise Guide

parameters (these are often known as algorithm hyperparameters). The first parameter
shows the default of GUI set to ‘False’. Change this to ‘True’, click ‘OK’ (at the bottom of the
editor) and then click on ‘Start’ again.

This will bring up a graphical interface allowing you to see what the neural network looks
like. You may need to stretch the window horizontally to more easily see all the text. As
you can see, there are 8 inputs, 5 neurons, and two outputs (so, not a very complex neural
network).
By default the number of neurons is set to the average of the input and output neurons
((8+2)/2=5). You can add new neurons to layers, and add new layers of neurons, and
change the learning rate and momentum parameters (but leave these as they are for now)

© STA Consulting Inc. 2021 15 CT-AI Exercise Guide

We can use this graphical interface to control the algorithm by clicking the ‘Start’ and
‘Accept’ buttons below the model.
As you are using the GUI, you will need to click the ‘Start and ‘Accept’ button 11 times each.
The first ten times will build a model using nine-tenths of the data and test it with the
remaining one-tenth (ten times so that every piece of data is used to test the models). The
ten sets of test results are then averaged to create the performance metrics. The eleventh
time will create the final model using all of the data. This is known as 10-fold cross-
validation, which is the default. You can see that it is already selected in the ‘Test options’
panel. If you clicked correctly, the results should be the same as without the GUI.

© STA Consulting Inc. 2021 16 CT-AI Exercise Guide

2 Exercise 2 – Identify a Suitable ML Approach
For each of the following project scenarios, identify an appropriate ML approach (from
classification, regression, clustering, association, or reinforcement learning). Suggested
solutions are provided in section 2.10.

2.1 Speech Recognition

You have been tasked with implementing a speech recognition system. The system is to be
used as the basis of various applications that include voice user interfaces such as voice
dialling (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), appliance
control, search key words (e.g. finding a podcast where particular words were spoken),
simple data entry (e.g. entering a credit card number), preparation of structured documents
(e.g. a radiology report), determining speaker characteristics, speech-to-text processing (e.g.
word processors or emails), and aircraft (e.g. direct voice input). Individuals are used to
train the system by reading text or isolated vocabulary into the system. The system analyses
the person's specific voice and uses it to fine-tune the recognition of that person's speech,
resulting in increased accuracy.

2.2 Chatbot
You have been tasked with implementing a chatbot that attempts to solve a specific
problem for a user. These chatbots can help people with tasks such as to book a ticket or
find a reservation and are often referred to as goal-oriented chatbots. You have been
provided with a natural language understanding (NLU) component and a natural language
generator (NLG) component, but you don’t have access to a set of potential chatbot
conversations, and instead have decided to train your chatbot by creating trial-and-error
conversations between two protype chatbots. You will determine the success of the
resultant chatbot by measuring conversation attributes such as coherence, informativity,
and ease of answering in the chatbot dialogues.

2.3 Website Optimization

You have been tasked with implementing a website optimization system that provides the
owner of a website with guidance on which pages on the website should be directly linked
by hyperlinks. Ideally, the system will be automated and implement the linking of pages
dynamically, based on the customer interaction. The input to the system will be the web
access log, which records the pages visited by users and in which order. For instance,
analysis of the log may show that if a user has visited pages X and Y, then there is an 75%
probability that they will visit page Z in the same visit. This information can be used to
decide to create a direct link to page Z from pages X and Y so that the user can "click-
through" to page Z directly. This kind of information is particularly valuable for a web server
supporting an e-commerce site, allowing the owner to link different product pages
dynamically, based on customer interactions.

© STA Consulting Inc. 2021 17 CT-AI Exercise Guide

2.4 Drug Response Modelling
You have been tasked with implementing a drug response modelling system to assist with
the precision treatment of cancer. The main goal of precision medicine is to provide
therapies that not only increase the survival chances of patients but also improve their
quality of life by reducing unwanted side effects. This is achieved by matching patients with
appropriate therapies or therapeutic combinations. Pharmacogenomic data is used as input
to the system and includes both patient genetic data (data, such as cell lines, patient-
derived xenografts, etc.) , patient responses to treatments, and various drug treatments
that can be used in isolation, but, more often, in combination. The system will measure and
visualize the synergy achieved from various drug combinations and provide several drug
synergy scores, such as Bliss independence, Loewe additivity, highest single agent (HSA),
and zero interaction potency (ZIP) for a patient.

2.5 Customer Segmentation

You have been tasked with implementing a customer segmentation system that will divide
your organization’s customers to your e-commerce website into several groups based on
the customer ID, address, gender, age, income, expense score, credit card score, survey
responses, number of items purchased and their spending score (based on spending
behaviour). The aim is to group the customers that share commonalities to better
understand the groups and so make strategic decisions regarding product growth and
marketing. It is likely this will result in targeted marketing to these groups with promotional
plans. Various groupings should be provided based on each of demographic, behavioural,
geographic, and psychological customer features. The intent is to optimize the
organization’s future interactions with customers in terms of product design, promotions,
and marketing.

2.6 Credit Scoring

You have been tasked with implementing a credit scoring system for a green energy start-up
based in central America. The start-up provides water pumps, sensors and activators,
powered by either solar power or wind turbines, to small farmers. These small farmers are
usually neglected by the traditional banking sector and they have few financing options and
so the start-up provides the water pumps, sensors, activators, solar generators and wind
turbines on credit. The credit scoring system is used to determine the likelihood that the
farmers will pay back the costs of the provided goods. The system will use a combination of
data sources provided by the kit sold to the farmers, such as soil sensor data and irrigation
pump usage data, along with historical loan default data, to predict whether the farmer is
expected to default or not on their loan.

2.7 Predictive Maintenance

You have been tasked with implementing a predictive maintenance system that determines
the condition of equipment and predicts when maintenance should be performed. The
system should lead to major cost savings, higher predictability, and the increased availability

© STA Consulting Inc. 2021 18 CT-AI Exercise Guide

of the systems. The predictive maintenance system will determine when machine
conditions are at a state of needed repair or even replacement so that maintenance can be
performed exactly when and how is most effective. The system will be fed data from the
production floor sensors, production line computers, historical maintenance, usage, and
performance data. Since the operational lifespan of production machines is usually several
years, historical data should reach back far enough to properly reflect the machines’
deterioration processes. The predictive maintenance system will provide two separate
outputs. First, it will predict remaining useful lifetime (RUL) based on looking at the average
predicted time to failure for all the potential failure modes. Second, it will predict likelihood
of failure (LOF), which is whether a failure should be expected in a specified future period.
This prediction enables the maintenance crew to watch for symptoms and plan
maintenance schedules.

2.8 Fraud Detection

You have been tasked with implementing a fraud detection system that can identify
artworks that are falsely attributed to a famous artist. Determining a painting’s authenticity
can be extremely challenging. Typically, art experts reach decisions after thorough
consideration of many different types of evidence. Technical analysis of the pigments and
other materials used in the artwork, and the method of their preparation are used. Study of
the artist’s creative process as can be seen in the underlayers of the painting (observed
through X-ray and infrared imaging), and visual aspects of appearance and style of the work
are compared against those of the artist’s other works. Correspondence from the artist’s
lifetime and documents tracing the painting’s history of ownership also provide clues as to
an artwork’s provenance.

2.9 ML Approaches and Example Systems

The following are further example systems categorised under the most likely ML approach:

2.9.1 Supervised Classification

• Hand-writing recognition
• Makes of cars identification from webcam video footage
• Spam email filtering
• Individual financial risk detection
• Sports event winner prediction
• Computer network attack type identification
• Machinery health determination from sensor inputs
• Art movement and style recognition from mobile camera

© STA Consulting Inc. 2021 19 CT-AI Exercise Guide

2.9.2 Supervised Regression
• House price prediction
• Click-through rate prediction
• Electricity load forecasting
• Algorithmic financial trading
• Stock market price predictions
• Financial forecasting

2.9.3 Unsupervised Clustering

• Grouping similar documents
• Flagging outliers in a dataset
• Object recognition/detection
• Image segmentation
• Gene sequence analysis
• Market research
• Separating photos of different people
• Identifying new astronomical bodies
• Production line quality control
• Categorization of artworks

2.9.4 Unsupervised Association

• Shopping basket analysis
• System components pattern identification

2.9.5 Reinforcement Learning

• Optimized marketing
• Chess playing
• Playing go (e.g. AlphaGo Zero)
• Robotic hands
• Elevator scheduling
• Self-driving
 trajectory optimization
 motion planning

© STA Consulting Inc. 2021 20 CT-AI Exercise Guide

 dynamic pathing
 controller optimization
 parking using parking policies
 lane changing
 overtaking
• Financial trading

2.10 Solutions – Exercise 2 - Identify a Suitable ML Approach

• Speech Recognition - Supervised Classification
• Chatbot - Reinforcement learning. (Note that chatbots can also use
supervised learning that maps user dialogue to responses, but this scenario
explicitly stated that you do not have access to ideal responses.)
• Website Optimization - Unsupervised Association
• Drug Response Modelling - Supervised Regression
• Customer Segmentation - Unsupervised Clustering
• Credit Scoring - Supervised Classification
• Predictive Maintenance
 RUL - Supervised Regression
 LOF - Supervised Classification
• Fraud Detection - Supervised Classification

© STA Consulting Inc. 2021 21 CT-AI Exercise Guide

3 Exercise 3 – Data Preparation for ML
Re-open the ‘Preprocess’ window in Weka by clicking on the tab at the top left.
The course datafile you used previously should still be loaded in Weka. However, in the
event that it is no longer loaded, use the top-left ‘Open file…’ button to re-open the Weka
data file: course.arff.

3.1 Filters and Feature Engineering

A major use of the ‘Preprocess’ window in Weka is to explore, and, where appropriate, filter
the data before it is used by the algorithm to generate a model.
Weka provides many different filters, but we will start by doing something quite basic – that
is, removing one of the attributes from the dataset (e.g. as part of ‘Feature Engineering’).
To do this click on the ‘Choose’ button in the Filter panel near the top of the ‘Preprocess’
window.
We want to remove one of the attributes used to determine the predicted class, and so we
choose weka filtersunsupervisedattributeRemove.

Left-click on the name of the selected filter in the Filter panel, and we see a short
description of the filter and we can modify the filter properties in a GenericObjectEditor.
We want to remove the ‘test_experience’ attribute, which we can see is numbered as ‘1’ in
the ‘Attributes’ panel.

© STA Consulting Inc. 2021 22 CT-AI Exercise Guide

Set ‘attributeIndices’ to 1 and select OK. The filter should now say ‘Remove -R 1’. To the
right of this is an ‘Apply’ button (you may only be able to see the first couple of letters).

Left-click this button and you can see that ‘test_experience’ has now disappeared from the
list of attributes (there are now only four attributes).

© STA Consulting Inc. 2021 23 CT-AI Exercise Guide

Note that this, and future Weka filters we will use, are listed for reference in the ‘Filter
Appendix’ in section 19, where they are listed with short descriptions.
Let’s now generate a new Multi-layered Perceptron:

1. Select the ‘Classify’ tab near the top left of the ‘Weka Explorer’ window.

2. Check that the Multilayered Perceptron classifier is selected (select it, if necessary –
it is listed under functions in the classifiers).

3. Left-click on the ‘Multilayered Perceptron’ in the ‘Classifier panel to open the editor
that allows the algorithm parameters to be changed.

4. Set GUI to False and click ‘OK’ in the editor.

5. Click ‘Start’ to generate a new model.

© STA Consulting Inc. 2021 24 CT-AI Exercise Guide

Notice that by removing one of the four attributes used for prediction the accuracy has now
increased to 100%. This result may not be intuitive, but sometimes removing attributes (also
known as features, hence the term ‘feature engineering’) can improve the accuracy of the
generated models.
Return to the ‘Preprocess’ window and left-click on the ‘Undo’ button near the top and the
removed attribute will re-appear. Actually, if all we want to do is to remove an attribute
from the dataset, we could simply select the attribute in the ‘Attributes’ panel (left-click the
box to see a tick in it) and then left-click on the big ‘Remove’ button beneath the list of
attributes.

3.2 Data Preparation with a Larger Dataset

Load the glass dataset using the ‘Open File’ button in the ‘Preprocess’ tab, as you did before
for the course dataset. The glass dataset is a larger dataset (214 instances) that is used to
build models to predict the type of glass (e.g. tableware or car headlamps) from the
chemical make-up of the glass. Such a model could be used where glass fragments have
been found at the scene of a crime and we wanted to know where they came from.
Use Visualize All to get a feel for the complexity of the dataset. Note that some glass is
manufactured by floating it on top of molten metal (hence the ‘build_win_float’ value for
windows for buildings created using the molten metal approach).

© STA Consulting Inc. 2021 25 CT-AI Exercise Guide

Now, generate models and calculate and compare their accuracy using the ZeroR,
MultilayerPerceptron and J48 algorithms (in the ‘Classify’ window). These three different
algorithms can be found under the rules, functions and trees types of classifiers in the Weka
classifier chooser.
When you use the J48 algorithm, right-click on the relevant entry in the ‘Result list’ panel (in
the bottom left of the Classifier window).

© STA Consulting Inc. 2021 26 CT-AI Exercise Guide

Select ‘Visualize tree’ (two-thirds of the way down the menu that appears). This will provide
a graphical representation of the generated model (unsurprisingly, it’s a tree).

There is no similar opportunity to visualize the ZeroR model (it’s too simple), and you have
already seen a visual representation of the multilayer perceptron. You can always see the
underlying model information for any algorithm in the ‘Classifier output’ panel, ahead of the
accuracy, and other performance measures.

3.3 Underfitting
Underfitting is generally caused when we don’t provide sufficient, useful data for the
training. Often it can be difficult to know which attributes contribute to the accuracy of the
algorithm, and which don’t.
Try removing different attributes from the glass dataset to see the effect this has on the
accuracy of the model produced by the J48 algorithm. Remember that:
• each time you remove attributes you can go back and ‘Undo’ the removal
(otherwise you may have to re-load the glass dataset)
• the easiest way to select a few attributes is to first select ‘All’, then click on
those you want to use (so that the ‘tick’ disappears), and then click on
‘Remove’

© STA Consulting Inc. 2021 27 CT-AI Exercise Guide

• when you remove attributes, you always need to leave the final ‘Type’
attribute in the dataset, as that is the predicted class
Can you find any sets of three or more attributes that achieve an accuracy of less than 50%?
Alternatively, can you find sets of two or three attributes that improve on the accuracy of
the model produced using all 9 attributes?
These are the accuracies achieved by the J48 algorithm with some example subsets of the
attributes:
• All 9 attributes  66.8%
• Si, Fe  35.5%
• Na, Si, Fe  48.6%
• RI, Mg  68.7%
• RI, Mg, Al, K, Ba  71%
• RI, Na, Mg, Ca, Ba  73.8%

3.4 Outliers and Extreme Values

When a feature is provided numerically, the ‘InterquartileRange’ filter
(wekafiltersunsupervisedattributeInterquartileRange) can be used to identify
outliers and extreme values. Statistical interquartile ranges are used to define what
constitutes an outlier or an extreme value.
Apply the ‘InterquartileRange’ filter to the full glass dataset. You will see that two new
attributes are added to the dataset.

© STA Consulting Inc. 2021 28 CT-AI Exercise Guide

View the dataset using the ‘Edit…’ button near the top-right of the ‘Preprocess’ window (the
new attributes are on the right).

You will see that some of the instances are now flagged as containing one or more attribute
values that are either outliers and/or extreme values. By default, in Weka outliers are
outside three times the interquartile range, but within six times the interquartile range,
whereas extreme values are even further out from the median value. Note that an instance
can be flagged as having both an outlier and an extreme value as there are 9 input
attributes, each of which can cause the whole instance to be flagged.
To remove the instances with outliers and extreme values, use the ‘RemoveWithValues’
filter twice; once to remove outliers and once to remove extreme values.
For instance, to remove outliers, first choose the ‘RemoveWithValues’ filter
(wekafiltersunsupervisedinstanceRemoveWithValues). Next, click on the filter
name and, in the editor, set the ‘attributeIndex’ parameter to point at the relevant attribute
(‘11’ for ‘Outlier’) and set the ‘nominalIndices’ parameter to ‘last’ to remove those instances
where the ‘Outlier’ attribute is set to ‘yes’ (‘no’ appears as the ‘first’ value as there will be
more instances set to ‘no’).

© STA Consulting Inc. 2021 29 CT-AI Exercise Guide

Then apply the filter, and you see that the number of instances changes from the original
214 to 198. Repeat to remove those instances flagged as containing extreme values
(remembering that ‘ExtremeValue’ is attribute 12).
Once the outliers and extreme values are removed, you should be left with 163 instances of
the original 214.
Before using this new ‘cleaned’ dataset for training, remove the ‘Outlier’ and
‘ExtremeValue’ attributes (simply select them and click on the ‘Remove’ button) – as we
don’t want to use these attributes for training
Use the J48 classifier to see if the accuracy has improved. The original accuracy using all
attributes of 66.8% should improve to 69.9%, despite there now being far fewer instances
(163 instead of 214).

3.5 Overfitting
We will now look at a different rule-based algorithm from ZeroR, the OneR algorithm. To
understand how it works we will first apply it to the course dataset. The OneR algorithm
creates rules based on the single (non-target) attribute that it determines is most likely to
give us the correct prediction (where several attributes provide equal levels of accuracy, one
is chosen at random).
Load the course dataset and use ‘Visualize All’ to see the attributes.

© STA Consulting Inc. 2021 30 CT-AI Exercise Guide

Any idea which is the most ‘predictive’ attribute?
• Let’s start with the ‘foundation_passed’ attribute. If we create a rule for the
‘no_foundation’ value (on the right, in red to show that the course is not
taken), then the recommendation should be not to take the course and this
would be correct for all seven instances. However, when we create a second
rule for when the attribute is set to ‘foundation’, then for two of the eight
instances it will be wrong when it recommends taking the course. This
means that rules based solely on this attribute will be wrong in only two out
of fifteen cases (86.7% correct).
• We get a similar result if we choose the ‘AI_experience’ attribute. A rule for
when the attribute is set to ‘no_AI_experience’ (recommending taking the
course) would be correct for all four instances. A second rule for when the
attribute is set to ‘lots_of_AI_experience’ would be correct for all five
instances (recommending not to take the course). The third rule for when
the attribute is set to ‘some_AI_experience’ can, at best, be right for four of
the six instances. Thus, overall, if we created a set of rules based on the
‘AI_experience’ attribute, the rules would be correct for 13 of the 15
instances (86.7%, again).
• Using the ‘management_support’ attribute provides a set of rules that are
right ten out of fifteen times (66.7%).
• Using the ‘test_experience’ attribute also provides a set of rules that are right
ten out of fifteen times (66.7%).

© STA Consulting Inc. 2021 31 CT-AI Exercise Guide

The OneR algorithm performs this processing for each attribute and then selects the
attribute that has the best chance of guessing right. In this example, we have two equally
usable attributes, ‘foundation_passed’ and ‘AI_experience’. Let’s check to see which
attribute the Weka algorithm chooses by running the OneR algorithm.
Before creating the model, set the testing to ‘Use training set’ in the ‘Test options’ window
(as we want to see the results based on all training examples, the same as we just used).
Use ‘Start’ to create and test the model and when you look in the ‘Classifier output’
window, you will see that it has selected the following rules:
• ‘no_AI_experience’  TAI_course
• ‘some_AI_experience’  no_TAI_course
• ‘lots_of_AI_experience’  no_TAI_course
And you get (as you would expect) an accuracy of 86.6667% (13/15*100).
The OneR algorithm also works with numeric attributes, such as those used in the glass
dataset. Can you guess from visualizing the attributes, which one the algorithm will select?
Apply the OneR algorithm to the glass dataset. Don’t worry if you didn’t guess aluminium –
that’s why we use tools for machine learning. In the Classifier output , you can also see that
it generates 8 rules based on the level of aluminium content – and achieves an accuracy of
63% using the training set and 57.9% using cross validation.

© STA Consulting Inc. 2021 32 CT-AI Exercise Guide

The reason for the 8 rules lies with the ‘minBucketSize’ parameter, which is set to 6 by
default. This parameter sets a value that limits the number of rules associated with a single
attribute based on the number of instance (examples) that fit the rule. We will now remove
this constraint by setting the ‘minBucketSize’ parameter to zero, and so allow the OneR
algorithm to create as many rules as it wants (but, of course, still all associated with a single
attribute).
Left click on the OneR name in the ‘Choose’ Classifier window to get GenericObjectEditor
and set the minBucketSize to zero. Click OK to accept the change.

© STA Consulting Inc. 2021 33 CT-AI Exercise Guide

Now re-run the OneR classifier.
The resultant model now uses the refractive index (RI) and generates over 100 rules (quite
impressive given there are only 214 examples).
These can be seen towards the top of the ‘Classifier output’ window.

© STA Consulting Inc. 2021 34 CT-AI Exercise Guide

However, the accuracy has now dropped to 47.7% using cross validation.
This 100+ rule algorithm is an example of overfitting. Every different piece of information
on refractive index in an example has been used to create a separate rule. This model will
work brilliantly if used on the examples it was trained from, but if we were to use it in the
‘real world’, it would not generalize at all well (hence the 47.7% accuracy when using cross
validation, but an accuracy of 93% if we test it using the training set).

© STA Consulting Inc. 2021 35 CT-AI Exercise Guide

4 Exercise 4 – Creating Independent Training and Test Datasets
In this exercise, we will create separate training and test datasets from the glass dataset. In
this way, we will train using one set of examples, and test using a completely separate set.
In the ‘Preprocess’ window, open the glass dataset of 214 instances. We will use a filter to
first randomize the instances in the dataset and then split the dataset into two datasets.
We randomize first in case the data was not collected and stored randomly in the dataset
(data is often stored in the same order it is collected and this can often be non-random, or it
is sorted by expected result).
First, select and apply the ‘Randomize’ filter
(wekafiltersunsupervisedinstanceRandomize). Note that the number of instances
does not change, but they are now in a random order.
Then, select the RemovePercentage filter (wekafiltersunsupervisedinstance
RemovePercentage). First, we will select 20% of the dataset to be the testing dataset. Left
click on the RemovePercentage filter name and we are presented with the
GenericObjectEditor that allows us to change the filter parameters. To select 20% we must
set the percentage parameter for the filter to be 80 to remove 80% of the dataset (leave the
other parameters at their default values) and select ‘OK’ and then left-click on the ‘Apply’
button.

You should be left with a dataset of 43 instances. Use the ‘Save…’ button at the top-right of
the window and save the training dataset with a suitable filename (with ‘test’ in it).

© STA Consulting Inc. 2021 36 CT-AI Exercise Guide

Next, ‘Undo’ the filter (using the ‘Undo’ button at the top of the window) and you should be
back to 214 instances. Open the ‘RemovePercentage’ filter properties by left-clicking on the
filter name and change the ‘invertSelection’ parameter to ‘True’, select ‘OK’ and then left-
click on the ‘Apply’ button.

The ‘invertSelection’ parameter will ensure that the remaining 20% of the original glass
dataset will be selected - you should be left with a dataset of 171 instances. Use the ‘Save…’
button at the top-right of the window and save the test dataset with a suitable filename
(with ‘train’ in it).
We will now use the two datasets to train and test a new model using the J48 tree classifier.
Open the ‘Classify’ window and, in the ‘Test options’ panel, select the ‘Supplied test set’
option and select your saved test set. Make sure that the target ‘Class’ is set to ‘(Nom)
Type’ (this is the default, so you should not have to do anything).

© STA Consulting Inc. 2021 37 CT-AI Exercise Guide

Choose the J48 (tree) algorithm and run it using the ‘Start’ button. Make a note of the
accuracy, which should be 69.8%.
Actually, Weka provides an easier way of doing this. Instead of creating two separate
datasets, we can use the original glass dataset and, before we run the J48 algorithm, we can
select the ‘Percentage split’ option in the ‘Test options’ panel – and set the split to be 80%.
Try this with the original glass dataset and you will see that you now get a different value for
the accuracy – 60.5%. Make sure you were using the original dataset to get this value,
because if you use the randomized set of the same 214 instances you will get 76.7%.
Finally, we will use the (often preferred) approach of applying cross validation with 10 folds
to evaluate the accuracy of the J48 algorithm on the glass dataset. The resultant accuracy
should be 66.8%. This value should be more accurate than our previous measures.
As described earlier, when using 10-fold cross-validation, Weka breaks the dataset into 10
distinct parts. (By default Weka uses stratified cross-validation, which means that when it

© STA Consulting Inc. 2021 38 CT-AI Exercise Guide

splits the dataset into 10 parts, it tries to create 10 parts, each with a similar cross-section of
attribute values.) It then creates ten J48 trees, each from a different 90% of the dataset,
and after each tree is created, it uses the remaining 10% of the dataset to evaluate its
accuracy. Having done this ten times, Weka calculates the overall accuracy as the average
of the ten accuracy measures and then it creates a final model using the full dataset (all ten
parts).
We can actually see the result of the 10-fold cross-validation in Weka. Select the ‘More
options…’ button in the ‘Test options’ panel and then tick the ‘Output models for training
splits’ option (the second option) and select ‘OK’. Now press ‘Start’ again. If you now look
in the ‘Classifier output’ panel, you can see that it shows the 11 separate (and slightly
different) J48 pruned trees that were created (the eleventh tree is the final model based on
all instances in the dataset). It may be best to switch off this option now, to keep the output
more manageable.

© STA Consulting Inc. 2021 39 CT-AI Exercise Guide

If we ask Weka to use cross-validation or use a percentage split approach, Weka will always
finally generate a model based on all the data we give it, even if beforehand it evaluates the
accuracy using a subset (or subsets) of the dataset. However, for many ML algorithms the
order of the instances in the dataset will affect the output model produced by the
algorithms, so don’t expect the same model to be produced even if the same dataset is used
by the same algorithm and with the same parameters set. In fact, you can see this quite
clearly by looking at the ‘Classifier output’ for each of your results and noticing that the
structure of the tree produced by the J48 algorithm differs (e.g. between using the
randomized and non-randomized datasets).
As we have seen, the evaluation results are also affected by the approach we use and which
instances are used in the training dataset and which instances are used in the test dataset.
When we use either a cross-validation or a percentage split approach, Weka randomly
selects which instances are put into the folds (for cross validation) or into the training and
test datasets (for percentage split). The reason we can include actual accuracy values in this
Exercise Guide (even with this randomization) is that Weka supports repeatability by using a
pseudo-random approach using seeds. So, if we don’t change the seed value, the results
will stay exactly the same.
As a tester (or data scientist), it would be nice to have an idea of how sensitive the
algorithm we are testing/developing is to this random splitting of the datasets (into folds or
training and test datasets). We can actually see this quite quickly and easily. Run the J48
algorithm using the default 10-fold cross- validation and note the accuracy (66.8%). Now,
click on the ‘More options…’ button in the ‘Test options’ panel. Near the bottom, you will
see ‘Random seed for XVal / % Split’ is set to 1.

© STA Consulting Inc. 2021 40 CT-AI Exercise Guide

Change this to 2, select OK and then run the algorithm again, making a note of the accuracy
(73.4%). Continue until you have tried 10 different seeds. My calculator comes up with a
mean of 67.6% and a standard deviation of about 2.4% for the 10 test runs, which is going to
be more statistically valid than the result from a single test run. We can also get statistically
valid results more easily if we use the Weka Experimenter, which we will try in the next
exercise.

© STA Consulting Inc. 2021 41 CT-AI Exercise Guide

5 Exercise 5 – Evaluation and Tuning and Testing
We can consider evaluation and tuning at two different levels. First, we can evaluate and
compare the results from different algorithms. This can suggest which algorithm is most
appropriate for our problem and dataset. Second, once we have selected an algorithm, we
need to select a near-optimal set of parameters for the algorithm. Of course, it is not
always quite so straightforward, and we may need to iterate between different algorithms
with different parameters.

5.1 Comparing Algorithms

Weka supports the comparison of algorithms directly in the Experimenter tool, which can be
chosen from the Weka GUI Chooser, which is where you started working with Weka. We
will compare the accuracy of three algorithms on the glass dataset.

Select the Weka ‘Experimenter’ and it will open in the ‘Setup’ tab. First, click on the ‘New’
button at the top right of the window. Then add the glass dataset to the ‘Datasets’ panel
(bottom-left) using the ‘Add new…’ button. Next, add the OneR, J48 and
MultilayerPerceptron algorithms to the ‘Algorithms’ panel (bottom-right), again using the
‘Add new…’ button (the different algorithms are selected using the ‘Choose’ button at the
top of the ‘GenericObjectEditor’ – keep all the default settings).

© STA Consulting Inc. 2021 42 CT-AI Exercise Guide

Now, select the ‘Run’ tab at the top of the window, and then click on ‘Start’. Be patient,
quite a lot is going on – it is running ten repetitions of 10-fold cross validation, so it is
creating and evaluating 100 models per algorithm. You can watch progress in the ‘Status’
panel at the bottom of the window. You should find it finishes with ‘0 errors’.
Now, select the ‘Analyse’ tab at the top of the window, and then click on the ‘Experiment’
button at the top right of the window. It will show the available results (a list of the three
algorithms) in the ‘Test output’ panel. Next, select the ‘Perform test’ button in the ‘Actions’
panel – and the experiment results will be shown.

© STA Consulting Inc. 2021 43 CT-AI Exercise Guide

The Experimenter can perform comparisons on many different measures, but as you can see
in the ‘Test output’ panel, it defaults to ‘Percent_correct’ (accuracy). You can see the
average score for accuracy for each of the algorithms with the glass dataset, and because
we put the OneR algorithm first, it is comparing the other two algorithms against it. The ‘v’
next to the results for J48 and the multi-layer perceptron indicate that these two algorithms
generated models that were statistically significantly better than OneR at the 5% confidence
level.
Given this result, we now need to know whether the J48 algorithm outperforms the multi-
layer perceptron by a statistically significant amount. Go back to the Setup window and
move the J48 algorithm to be first in the ‘Algorithms’ panel by selecting it in the ‘Algorithms’
panel and using the ‘Up’ button. Run the experiment again (in the ‘Run’ tab) and then, in
the ‘Analyse’ window, use the ‘Experiment’ and ‘Perform test’ buttons to get the new
results. From these, we can see from the ‘*’ that the OneR algorithm is statistically
significantly worse than the J48 algorithm, but the lack of either a ‘*’ or a ‘v’ against the
multilayer perceptron means that this set of tests cannot tell us that J48 is statistically
significantly better or worse than the multi-layer perceptron algorithm.

© STA Consulting Inc. 2021 44 CT-AI Exercise Guide

These results are all specific to the glass dataset. We can also experiment with multiple
datasets. Go back to the ‘Setup’ tab and add the course and iris datasets to the ‘Datasets’
panel, run the experiment (be patient!) and check the results. It appears, given these
datasets, there is not much to choose between models generated by the J48 and the multi-
layer perceptron algorithms.
So far, we have only looked at the accuracy (Percent_correct) of the models we have asked
Weka to build and test for us. In the ‘Configure test’ panel (on the left), if you select the
‘Comparison field’, then you will see that there are many other factors that can be
compared. Select ‘Elapsed_Time_training’ and then click on the ‘Perform test’ button. Now
you can see that if the time taken to train models is important, then the multi-layer
perceptron algorithm takes significantly longer to train than the other algorithms. For these
datasets the time is not too high, but if we were working with true ‘big data’ then the time
required to train a multilayer perceptron (deep neural net) could become far more
significant.

5.2 Setting Parameters

So far, we have used the default parameters provided by Weka when we select an
algorithm. In practice, we need to tune these parameters. One way to do this is by running
lots of experiments with different parameter values. Weka provides an alternative
approach, which is a bit quicker.
We will try tuning the J48 algorithm, as it appears to be quite accurate and it is far faster
than the multi-layer perceptron algorithm. Go back to the Weka Explorer and re-load the
glass dataset. Move to the ‘Classify’ tab and, in the Classifier’ panel, choose the
‘CVParameterSelection’ algorithm (it is listed as a meta classifier). Left-click on the

© STA Consulting Inc. 2021 45 CT-AI Exercise Guide

algorithm name in the ‘Choose’ window and you are presented with the generic object
editor. In the ‘classifier’ field, select the J48 algorithm.

Weka sets default values for J48 parameters for C (the confidence factor, set to 0.25) and M
(number of instances per leaf, set to 2). To ask Weka to try other values we must set some
options in the ‘CVParameters’ field. Click on this field and a ‘GenericArrayEditor’ opens. In
the input field (to the left of the ‘Add’ button), enter C 0.1 0.4 10.0 and click ‘Add’. This will
ask Weka to try 10 values for the confidence factor from 0.1 to 0.4. Now add another set of
options for the number of instances by adding M 1.0 4.0 4.0 and click ‘Add’. This set of
options will ask Weka to try 1 to 4 instances. Close the editor (use the cross in the top right)
and click ‘OK’ at the bottom of the generic object editor.
Now click ‘Start’ (on the left) and wait for Weka to tell you the optimal parameter values for
C and M in the J48 algorithm for the glass dataset. It may take some time – you can tell that
Weka is working by watching the Weka bird in the bottom right of the window (if it’s
moving, then Weka is working, if it’s sitting then Weka is finished).
If we check the ‘Classifier output’ panel (near the top after the list of attributes), we can see
that it recommends ‘Classifier Options’ of 1 for M and 0.266666666667 for C.

© STA Consulting Inc. 2021 46 CT-AI Exercise Guide

Now, use the Experimenter with the three datasets and two versions of the J48 algorithm
(one with default values and one with the suggested optimal parameter values for C and M).
Note that C is ‘confidenceFactor’ and M is ‘minNumObjects’.

© STA Consulting Inc. 2021 47 CT-AI Exercise Guide

Compare the accuracy for the J48 algorithm with the new parameters with the results from
the default parameters.

© STA Consulting Inc. 2021 48 CT-AI Exercise Guide

You will see that by tuning these two parameters the accuracy of the algorithm has slightly
improved for the glass dataset, but slightly decreased for the course and iris datasets.

This is not surprising as we only asked the ‘CVParameterSelection’ algorithm to optimise the
J48 parameters for the glass dataset.

© STA Consulting Inc. 2021 49 CT-AI Exercise Guide

6 Exercise 6 – ML Functional Performance Metrics
6.1 Polio Detection
Given the following confusion matrix for a Polio Detection system, derive the accuracy,
precision, recall and F1-Score. The solution is shown in section 6.3.

6.2 Spam Detection

Given the following confusion matrix for a Spam Detection system, derive the accuracy,
precision, recall and F1-Score. The solution is shown in section 6.4.

6.3 Solution – Polio Detection

© STA Consulting Inc. 2021 50 CT-AI Exercise Guide

6.4 Solution – Spam Detection

© STA Consulting Inc. 2021 51 CT-AI Exercise Guide

7 Exercise 7 – Confusion Matrix in Weka
Rerun the J48 algorithm on the course dataset and have a look at the ‘Classifier output’
panel. At the bottom is the confusion matrix.
According to the syllabus:

Actual

Positive Negative

True Positive False Positive

Positive
(TP) (FP)
Predicted
False Negative True Negative
Negative
(FN) (TN)

• Accuracy = (TP + TN) / (TP +TN + FP + FN) * 100%

• Precision = TP / (TP + FP) * 100%
• Recall = TP / (TP + FN) * 100%
• F1-score = 2* (Precision * Recall) / (Precision + Recall)
Given the above and the knowing that the confusion matrix may be presented differently in
different books/tools, work out which of the entries in the Weka confusion matrix are TP,
TN, FP and FN.
Also work out where Weka presents the values for accuracy, precision, recall and F1-Score.
The solution is shown in section 7.1.

7.1 Solution - ML Functional Performance Metrics

The confusion matrix provided by Weka is of the form:
a b  classified as
TP FN | a = yes

FP TN | b = no

In the ‘Classifier output’ panel in Weka, where it says: ‘===Detailed Accuracy By Class ===’,
only consider the top row of results, and you can read off the precision, recall and F1-Score.
Accuracy, as we have already seen, is given as a percentage to the right of ‘Correctly
Classified Instances’.

© STA Consulting Inc. 2021 52 CT-AI Exercise Guide

© STA Consulting Inc. 2021 53 CT-AI Exercise Guide
8 Exercise 8 – Selecting ML Functional Performance Metrics
8.1 ML Systems - Selecting ML Functional Performance Metrics
Suggest which of precision or recall should be higher for the following ML systems:
• Malignant tumour identifier (to suggest further investigation)
• YouTube recommender (to identify interesting clips)
• Rocket launch weather indicator (to advise whether to launch tomorrow)
• High-value antique detector (to suggest an antique is further investigated)
• Pregnancy tester (to indicate if someone is pregnant)
• Fraudulent transaction identifier (to identify transactions to investigate)
• Deep see treasure finder (to indicate where to start a deep-sea project)
• Legal evidence finder (to suggest significant evidence for a trial)
• Old Master Painting identifier (to identify potential works of art)
• Zombie detector (to determine whether to allow access to the human camp)
• Courtroom guilt determination (to decide if the defendant is guilty)
• Business contract defect detector (to identify mistakes in contracts)
• Umbrella-day indicator (to suggest that an umbrella should be carried)
• Spam email detector (to identify email that is spam)

8.2 Solutions - Selecting ML Functional Performance Metrics

The following list shows which of precision or recall should be higher for the following
systems, along with a brief explanation:
• Malignant tumour identifier (to suggest further investigation) – Recall - we don’t
want to miss any potential bad tumours
If we require 100% recall then there will be no malignant tumours missed that
could have been identified, because there would be no false negatives, where
a false negative corresponds to a malignant tumour that was not identified.
High precision would correspond to few or no times that a malignant tumour
was falsely predicted. Although predicting a malignant tumour that does not
exist will cause temporary anxiety, it is more important to ensure no malignant
tumours go unidentified, and so are not able to be treated.

• YouTube recommender (to identify interesting clips) – Precision - we don’t want to

recommend clips the viewer doesn’t like

© STA Consulting Inc. 2021 54 CT-AI Exercise Guide

High precision would mean that few or no videos were recommended that
did not match the viewer’s tastes. This should reinforce the viewer’s
confidence in the recommender and lead to more future use. High recall
would mean that few videos that the viewer would like are not
recommended. From the viewer’s perspective lower recall might mean they
miss an interesting video, but they would be unaware that they had missed it,
and there are also so many videos available that some interesting videos will
always be missed, so lower recall should not be a problem.

• Rocket launch weather indicator (to advise whether to launch tomorrow) –

Precision – we don’t want to prepare for a launch that cannot happen due to the
weather
High precision would mean that the system rarely predicted that a launch
could happen that was then called off due to bad weather. We want this to
happen rarely as the costs of preparing for a launch that does not take place
are high. In contrast, high recall would mean that it would be rare to suggest
a launch was not possible when the weather turned out to be fine. Although
this is not ideal as the launch is delayed, it only creates an unnecessary delay
compared to spending time and money filling and emptying the rocket of
fuel, with the associated risks, if we do not have high precision.

• High-value antique detector (to suggest an antique is further investigated) – Recall

– we don’t want to miss out on valuable antiques
High recall would mean that there would not be many times that the
detector would miss a high-value antique. High precision would mean that
there were few times that the detector would suggest an antique was high
value, when, in fact, it was not high value. Although high precision would be
good, it is assumed that the overhead of checking that a flagged antique is
really high value is less important than the chance of missing a high value
antique, and so high recall is preferred. This is probably correct in a market
where occasionally antiques at low prices are found to be worth many times
their initial cost.

• Pregnancy tester (to indicate if someone is pregnant) – Precision – we don’t want

to tell women they are pregnant if they are not.
High precision for the tester means that the woman will rarely be told they
are pregnant when they are not. Being told they were pregnant when not
pregnant could be challenging for someone who does not want to be
pregnant, and also for someone who wants to be pregnant but then finds
they are not. High recall would mean that the tester rarely told women that
they were not pregnant when they were pregnant. Again, such a mistake
could be problematic. The main problem would be when women made

© STA Consulting Inc. 2021 55 CT-AI Exercise Guide

important decisions based on the test such as lifestyle choices (e.g. drinking
alcohol), however the mistaken test result would normally only delay the
news that a woman was pregnant. Obviously the choice here is very
subjective, and selecting a value of high precision should probably be
accompanied with a related warning on the test, explaining that the test was
better at predicting that someone was pregnant than predicting that they
were not pregnant.

• Fraudulent transaction identifier (to identify transactions to investigate) – Precision

– we don’t want to miss out on any frauds
High recall would mean that very few transactions that were fraudulent were
labelled as being OK. High precision would mean that very few transactions
that were OK were labelled as being fraudulent. As the identifier is being
used to flag transactions to investigate, then high recall will mean that very
few fraudulent transactions are missed, while high precision will mean that
very few valid transactions are investigated as potentially fraudulent. If we
assume that the vast majority of transactions are not fraudulent then high
recall would be preferred to high precision unless there was an extremely
high cost to investigating.

• Deep see treasure finder (to indicate where to start a deep-sea project) – Precision
– we don’t want to suggest there is treasure if there isn’t
High precision would mean that there would be few times when the finder
suggested there was treasure and none was actually there. High recall would
mean that the treasure finder rarely indicated that there was no treasure
when there actually was treasure present. Given the high costs of running a
deep-sea project to extricate the treasure, then it is probably more important
that we have high precision so that we do not waste resources on projects
that find no treasure.

• Legal evidence finder (to suggest significant evidence for a trial) – Recall – we don’t
want to miss what may be evidence that our opponents do find
High recall would mean that the evidence finder rarely missed evidence when
it was there. High precision would mean that the finder would rarely suggest
there was evidence when it was not relevant. For a trial situation, it is more
important that all the relevant evidence is identified, otherwise we risk losing
the case if the opposing counsel has that evidence; thus, we need high recall
in this scenario.

• Old Master Painting identifier (to identify potential works of art) – Recall – we don’t
want to miss out on an Old Master
See the high-value antique detector. The same rationale applies.

© STA Consulting Inc. 2021 56 CT-AI Exercise Guide

• Zombie detector (to determine whether to allow access to the human camp) –
Recall – we cannot allow any zombies into the camp
In this situation we want to be sure that anyone we allow into the human
camp is definitely not a zombie, because otherwise a zombie could easily turn
everyone they met in the camp into a zombie. 100% recall would mean that
there would be no times that the detector falsely reported that a zombie was
not a zombie. 100% precision would mean that the detector would never say
that someone was a zombie when they were not a zombie. High precision
would be great from the perspective of a non-zombie trying to get into the
safe haven of the camp, but high recall will keep all the humans in the camp
safe, even if the occasional non-zombie gets turned away.

• Courtroom guilt determination (to decide if the defendant is guilty) – Precision –

we don’t want to decide someone is guilty if there is a chance they may not be
A high precision guilt predictor would rarely say that someone was guilty if
they were innocent. A high recall guilt predictor would rarely say that
someone was innocent if they were not innocent. As long as we assume that
people on trial are innocent until proved guilty, then we should have a high
precision guilt determination system. Many would argue that 100% precision
should be a requirement of such a system.

• Business contract defect detector (to identify mistakes in contracts) – Recall – we

want to highlight any potential defects
High recall would mean that there would be few cases where the detector
said there was not a defect in the contract, but there actually was a defect.
High precision would mean that there would be few cases when the detector
wrongly identified the presence of a defect that was not actually there. So
that nearly all defects are identified, we would normally opt for a defect
detector with high recall rather than with high precision, as we can probably
live with a detector that identifies some defects that are not actually there
when we check.

• Umbrella-day indicator (to suggest that an umbrella should be carried in case of

rain) – Recall – we want an umbrella just in case it may rain
High recall for the umbrella day indicator would mean that there would be
few days when the indicator said we did not need an umbrella and it then
rained. High precision would mean that most of the time when the indicator
told us we needed the umbrella we would actually need it. Normally we
would opt for high recall for such a system as we can always carry a furled
umbrella if it does not rain, but we cannot ‘magic’ one up if we aren’t
carrying one and it starts to rain.

© STA Consulting Inc. 2021 57 CT-AI Exercise Guide

• Spam email detector (to identify email that is spam) – Precision - we don’t want
real emails being put in the spam folder
With high precision the spam detector will flag few emails as spam that are
not actually spam. High recall would mean that the spam detector would
rarely flag an email as being OK (not spam) when it should have labelled it as
spam. For spam detectors we normally don’t want the detector to label
‘good’ emails as spam as that means there is a danger that we miss these
emails when they are diverted to our spam folder; thus we normally want
high precision. High recall would be great, but we can live with manually
deleting a few spam emails that get into our inbox due to a lower value for
recall.

© STA Consulting Inc. 2021 58 CT-AI Exercise Guide

9 Exercise 9 – Build a Perceptron in Excel
The function for a simple perceptron works with inputs (xi), weights (wi) and a bias (b).

Perceptrons can work with many inputs, however in this exercise we will limit ourselves to
two inputs so that we can more easily show what is happening. In which case we can use x
and y to represent our two inputs (rather than x1 and x2). The following shows a
classification problem suitable for a simple perceptron:

This is suitable as the two classes are linearly separable – that is, they can be divided by a
straight line (if the classes are not linearly separable we need a more complex solution, such
as a deep neural net, which can be thought of as a form of multi-layer perceptron). The
equation for any straight line (which could be used to define a linear boundary) is of the
form:
ax + by + c = 0
The perceptron function (for two inputs) is:
result = 1 if (wx*x) + (wy*y) > -b or 0 otherwise

© STA Consulting Inc. 2021 59 CT-AI Exercise Guide

Thus, a perceptron function for two inputs is simply defining the boundary and giving one
result for one side of the boundary and the other result for the other side of the boundary
(or on it).
Let’s consider a very simple piece of logic – the AND logical operator. We can represent it as
a truth table or as a graph:

AND
x y result
0 0 0
0 1 0
1 0 0
1 1 1
If you look at the graph, then you can see the two classes (the green tick represents one
class and the three red crosses represent the other class) are linearly separable as many
different straight lines can be drawn between them. This tells us that we can use a
perceptron to implement the logic of an AND operator.

It is not immediately clear what the values for wx, wy and b should be. Actually, if we did
some trial and error it would not take long to come up with a working solution (have a go).
However, an alternative is to train a perceptron using Excel (not many people’s first choice
as an ML development framework, but simple and ubiquitous). We will need training data,
and for that we will use the four training examples in the truth table.

© STA Consulting Inc. 2021 60 CT-AI Exercise Guide

We will use Excel to learn the AND function by starting with random values for the two
weightings and the bias, and then applying the first example from the training data to the
perceptron function. We will compare the result provided by the function with the desired
result (from the truth table) and use these two values to determine the error. If the results
differ then we will update the weightings and bias and try again (and again) until we find the
function works correctly for all four training examples.
We will need to provide a learning coefficient, which will be used to determine by how
much we change the weightings and bias if we find that they generate a wrong result. If we
choose a small learning coefficient it will not change the position of the line much and may
require us to apply the training set many times (each use of the complete training set of 4
examples is known as an epoch), whereas if we choose a large learning coefficient then
there is a good chance that when we move the position of the line it goes too far the other
way.
The graph shows an initial guess (labelled 1) and subsequent further guesses that gradually
converge on a solution (the green line) as a result of using a small learning coefficient.

The next graph shows the same initial guess (labelled 1), and subsequent guesses are
labelled in order until line 4 produces a working solution. Here the lines jumps around far
more due to the use of a high learning coefficient (and this figure shows a ‘lucky’ sequence
that manages to hit on a solution after only a few attempts).

© STA Consulting Inc. 2021 61 CT-AI Exercise Guide

Create a new worksheet in Excel and save it as ‘Perceptron’.
Add the training data.

As we saw in the truth table, we have four examples. And, when we train an ML model,
each time we use all of our training dataset, it is known as an epoch. We will add more
epochs, as needed later.
We now want to add the activation function for the perceptron node. We have already
seen that the perceptron activation function looks like this:
activation = 1 if wx*x + wy*y > -b (or 0 otherwise)
If we rewrite it to move bias to the same side as the weighted variables, we get:
if value > 0 output 1, else output 0, where value = wx*x + wy*y + b
Add the following to the spreadsheet:

© STA Consulting Inc. 2021 62 CT-AI Exercise Guide

The weightings (WX and WY) and the bias have been chosen at random, so just enter these
values. However, the ‘Value’ needs to be calculated as per the formula above, as shown
below:

The value for ‘Activation’ can now be calculated. If the calculated value is greater than zero,
then the activation value is 1, otherwise it is 0.

As you can see, we can calculate the activation value for the first training data example
(X=0, Y=0). However, we can also see that our randomly selected weights and bias are not
quite right (they have not immediately solved the problem) because the activation value
does not match the expected result (the activation value we get (1) is different from the
training data desired result (AND RESULT), which is zero).
The spreadsheet also needs to be able to spot this, so add another column (Error) to detect
if the activation result and desired result differ:

We want the spreadsheet to learn from its mistakes and change the values of the weights
and bias until the activation function can correctly calculate activation values that match the
desired result for every one of the four training examples in an epoch. As there was an
error (and we know the direction as this error is negative), we can modify the weightings
and bias based on the error and a learning coefficient.

© STA Consulting Inc. 2021 63 CT-AI Exercise Guide

So, let’s add a variable and value for the learning coefficient. We will start with a value of
0.1 (we will look at what happens with other values later). Note that we are putting this in
column L, so leave column K empty, as we will use it later.

We will now add the ‘learning’ to the spreadsheet, by adding functions that modify the
weights and bias if the error is non-zero (i.e. the activation function disagrees with the
desired result). Let’s start with WX. Add the following function to calculate a new value for
WX based on the current value.

You see that the learning function takes the previous value for WX and updates it based on
the learning coefficient, the error and the previous value of X. Note that WX does not
change in this case as the previous value of X was zero.
We now need to do the same for WY, which also does not change due its previous value
being zero.

It is similar for the bias, but we have no input variable to consider, and so you can see that
the bias actually changes due to the error. Note that the amount of change is highly
dependent on the selected learning coefficient (0.1).

© STA Consulting Inc. 2021 64 CT-AI Exercise Guide

Having updated WX, WY and the bias, we can now calculate the new value, activation and
error. The formulae for this are the same as for the previous training example, so we can
copy and paste the formulae for these from row 3 into row 4.

As expected with a new training example (and updated bias) the value is changed, but the
new activation matches the desired result (in column D) and so the error is zero.
The formulae we used for updating the weights and bias take account of the error value,
and so we can simply cut and paste them (along with those for value activation and error)
into the next row for the third training example:

We can see that there was another error, and so we would expect a change for the fourth
training example. To get this, we again cut and paste from the previous row:

We have now run one epoch of training data. There were errors (as shown in column J), and
so the weights and bias have been changed, thus we are not yet sure if the current weights
and bias will correctly define a boundary that will work for all four training examples (at
present we are only sure that the current values will work for the fourth example).
What we are looking for as we proceed, is running all four training examples (a complete
epoch) and there being no errors for any of them. This will mean that the weights and bias
do not change (due to the error being zero) and we will know that these weights and bias
define a boundary that works for all four examples.
When all the error values in an epoch are zero, we can say that we have converged on a
solution. Note that the solution that we have converged on is dependent on many factors,
including the initial values we chose for the weights and bias, the learning coefficient we
selected, and even the order in which the training examples are presented.

© STA Consulting Inc. 2021 65 CT-AI Exercise Guide

As we want to know when we have converged on the solution, we will add a check for this
into the spreadsheet in column K. Adding the following into column K will tell us if all four
errors in an epoch are zero:

We now have to do another epoch’s worth of training to see if we converge on a solution.

Luckily, this is mostly copy and paste from epoch 1, but we cannot simply copy and paste
the whole of rows 3 to 6 because row 3 was not calculated but was our starting point.
First, we have to set the new epoch number (either manually to 2 or as a function that
increments the previous epoch number (=A3+1).
Next, we need to copy and paste the training data:

We cannot simply copy the functions from all four rows of the first epoch as we started the
training with randomly selected values for the weights and bias in row 3, so we need to copy
the final row (row 6) of the activation function and the error and copy that into the next
four rows:

© STA Consulting Inc. 2021 66 CT-AI Exercise Guide

Immediately you can see that we have performed the second epoch’s training. We can also
see that there were three errors and that that the weights and bias have been changed,
where appropriate.
Before moving onto a third epoch of training, we will copy and paste the check for
convergence into the second epoch:

From now on, to add another epoch of training data, it is simply a single copy and paste of
the four rows of an epoch:

Using copy and paste, perform more training until the weights and bias converge on a
solution:

© STA Consulting Inc. 2021 67 CT-AI Exercise Guide

Now we have a solution, we can simply plug in the final weights and bias to a perceptron
and it will perform the logical AND function:

One way of checking graphically if this works is to plug the values into the equation of a line:
ax + by + c = 0.
We get the following equation: 0.2x + 0.1y – 0.2 = 0 (or y = 2 – 2x).

© STA Consulting Inc. 2021 68 CT-AI Exercise Guide

If we plot this line on a graph, we get:

The blue boundary line is ‘just’ OK as the cross on the line is considered to belong with the
other two crosses (it would need to be to the right of the line to be considered a 1 by the
perceptron).
The solution that we arrive at is dependent on many factors, including the initial values we
chose for the weights and bias, and the learning coefficient we selected. Create a copy of
the worksheet and try changing these factors to see what effect it has on how quickly it
converges on a solution.
If you load the solution spreadsheet, it will also automatically draw the graph, so that you
can visibly check if the solution works.

© STA Consulting Inc. 2021 69 CT-AI Exercise Guide

10 Exercise 10 – Selecting Objectives and Acceptance Criteria
You have been tasked with determining the relative importance of different quality
characteristics for an AI-based system. The importance you assign to each characteristic will
be used to define acceptance criteria for the system (higher importance  more important
acceptance criterion). Solutions for the three systems are provided in sections 10.2
(Student Grading), 10.3 (Medical Image Analysis) and 10.4 (Financial Advisor).

10.1 Example Systems and Characteristics

Select one of the following AI-based systems:
• Student Grading
• Medical Image Analysis
• Financial Advisor
Consider the following characteristics:
• Adaptability
• Autonomy
• Evolution / Degradation
• Flexibility
• Fairness / Bias
• Functional Performance
• Transparency / Explainable
• Complexity
• Non-Determinism/Probabilistic
• Ethics
Decide the level of importance for each characteristic on a three-point scale of low, medium
and high. Include the rationale for your choice and any assumptions you have made about
the system.

10.2 Solution – Student Grading

High Importance
– Autonomy – how often is the professor expected to check that the system is working
OK?
This is high importance as the main user of the system, a professor, does not want to
be constantly interacting with the system once it is set up and working. The

© STA Consulting Inc. 2021 70 CT-AI Exercise Guide

professor wants the system to grade students’ work without being required to check
that each automatically awarded grade is correct.
– Evolution / Degradation – if the system is self-learning, is it still giving accurate results?
If the system evolves and this leads to a degradation in the accuracy of the grading
produced by the system, this could be a big problem for the professor and the
students (depending on who notices).
– Fairness / Bias – did the system learn some implicit bias on how it grades (e.g. prefers all
essays to be written in third person) or is there explicit bias (e.g. it checks quality of
language although the learning objectives are technical)?
The system could be deliberately trained to favour particular styles of coursework,
and as long as these were not considered unfair this should not be a problem.
However, there is the chance that the system picks up bias in the training data,
which could, presumably, take the form of example essays and grades. Identifying
the potential for bias in training data and checking that it has not occurred are
important in ensuring a fair system, and an unfair grading system would be
unacceptable.
– Transparency / Explainability – presumably, it would be useful if the system could
explain its grading?
The characteristic of transparency (knowing the data and algorithm used to create
the system) may not be so important, but explainability (understanding how the
system decides on a given grade) is likely to be important in various usage scenarios,
particularly those where the students are aware that their coursework is being
graded automatically. If the system is required to explain its grades, then this
characteristic will be highly important.
– Ethics – can the system take account of students with disabilities, such as dyslexia?
One of the main issues with ethics is ensuring fairness and there are ethical issues
associated with students being graded by a ML model rather than their professor.
Limitations in the student grading system should be clearly communicated to and
taken into consideration by the professors using the system. Students should be
made aware that the system is being used, so that they can challenge grades that
they feel are unfair.
Medium Importance
– Adaptability – more important if the grading system is going to be used in a new
academic area (e.g. built for history, now used for physics or languages). However, it is
unlikely to used outside of student grading – but perhaps we could use it to rate the
quality of reports within an organization.
Adaptability is concerned with the ease of modifying the system to work in new
situations. For this system the requirement for adaptability depends largely on the

© STA Consulting Inc. 2021 71 CT-AI Exercise Guide

original expectations for its use and how this might change. If it was ‘sold’ as a
system for grading essays, but it was expected that it could be re-trained for
evaluating non-academic reports, then the ability to change it to grade/evaluate
against new features could be useful. The importance largely depends on the
specific situation; hence it has been assigned to medium here.
– Functional Performance – accuracy and other metrics may be useful to check, but it will
not need to be too accurate given the qualitative nature of grading.
The importance of this characteristic depends on your assumptions about the
grading system. If it is grading essays, then the qualitative nature of essay grading
will mean that accuracy will be difficult to measure. On the other hand, if it is
grading based on more objective criteria, then accuracy will be important, however,
the need for using AI in such objective grading with clear marking schemes is
unclear.
Low Importance
– Flexibility – not necessary as we have control over its environment.
Flexibility is concerned with being able to use the system outside its originally
expected operating scenario. For this system we would not expect it to be able to
change itself to work in new environments, largely because the learning is so specific
to the material being graded.
– Complexity – should not be that relevant as it won’t be too complex.
The functionality performed by the system is clearly understood and it is automating
a well-understood human task, so the complexity should not be too high.
– Non-Deterministic /Probabilistic – not so relevant due to the qualitative nature of
grading.
The level of non-determinism built into this system is not expected to be very high.
And the earlier point about the qualitative (subjective) nature of essay grading means
that any slight variations in results are unlikely to be noticeable.

10.3 Solution – Medical Image Analysis

High Importance
– Autonomy – we probably have to check the results – but the importance depends on
what the results are expected to be used for (e.g. suggesting a scan to get more
information or suggesting surgery).
Autonomy is concerned with the ability to leave the system working for sustained
periods without human intervention. As image analysis itself has no direct risk to the
patient, it should be able to be done with little or no human intervention. The level
of autonomy depends on the scope of the system. If the system includes the

© STA Consulting Inc. 2021 72 CT-AI Exercise Guide

imaging equipment, then there may be a requirement to re-calibrate the system that
provides the images on a regular basis.
– Evolution / Degradation – if the system is self-learning, is it still giving accurate results?
Any degradation in accuracy due to changes over time will need to be identified.
Loss of accuracy is more likely to be due to hardware equipment degradation, and
could be addressed by the system, while any self-learning would need thorough
built-in testing.
– Fairness / Bias – has the system learnt on a dataset that is biased in some way (e.g. sex,
gender, race)?
Medical systems are well-known for their susceptibility to sample bias. For instance,
systems that detect melanoma being provided with training data from
predominantly white patients, so making the systems poor at identifying melanoma
in non-white skinned patients. In the medical field, there are now checklists
available to identify such bias.
– Functional Performance – we will not want any false negatives – so this is important.
Depending on what the results from the system are used for, it may be very
important that the system does not miss any medical problems. For instance, if the
system is providing early warning of cancer, we do not want it to miss a possible
positive result that would allow further investigation and early treatment. In such a
situation any false negatives may be unacceptable, but it will depend on what the
image analysis system is being used for.
Medium Importance
– Adaptability – can we move it from lung tumours to brain tumours?
The ability to modify the system to perform different types of image analysis may be
required, in which case the ease with which this can be achieved may be important.
However, if we need to re-train a system with new images, then measuring the
success of the update could be expensive. We may also need to consider the ease of
changing the system to work with upgraded hardware that provides the images.
But, again, the testing of the change would be expensive.
– Transparency / Explainability – not of highest importance in medicine as patients trust
doctors and much medicine is experience-based.
From the patient’s perspective, there is unlikely to be a demand for explanations as
this image analysis is likely to be one of several inputs to a doctor who then presents
the results to the patient. However, the doctors may well want the system to
provide an explanation for why it flagged up a given image as ‘interesting’. This
depends on the what the system is used for and whether the flagged results are self-
evident or the analysed images are difficult to interpret.

© STA Consulting Inc. 2021 73 CT-AI Exercise Guide

Low Importance
– Flexibility – environment is unlikely to change (i.e. it will still be working with an X-Ray
machine, CAT scanner or MRI scanner)
The need for the system to modify itself in response to changes in its environment is
likely to be low. If changes are made to the hardware it works with, then this is likely
to require human updates and extensive testing and so is likely to require system
adaptability rather than flexibility.
– Complexity – not so complex that a human cannot check results
This depends on the sophistication of the image analysis. If the system is replacing
humans to improve the speed and consistency of the analysis then the complexity is
likely to be relatively low as a doctor would be able to check the analysis visually. A
more sophisticated medical analysis system that incorporated multiple input
features (rather than just the image) would be more complex.
– Non-Deterministic / Probabilistic – not so relevant due to the qualitative nature of
analysis.
The fact that the analysis results are not definitive should not be a problem as image
analysis (whether by ML or human) is understood by most to be probabilistic in
nature.
– Ethics – not so important as it shouldn’t disadvantage any minorities provided it is made
available to them.
The main ethics issue with this system is likely to be the sample bias (covered earlier).
This type of system is understood as providing benefits to patients and is unlikely to
create unfairness.

10.4 Solution – Financial Advisor

High Importance
– Fairness / Bias – need to check that it is not biased towards a group (e.g. those who have
been employed all their lives).
Bias is an obvious problem with a financial advice system. Care must be taken to
ensure that minority groups are not disadvantaged by the system, which could occur
if the system is trained using data that does not fully represent all user groups.
– Evolution / Degradation – this may be possible if the advisor accesses data from sites
that are not regularly updated.
A financial advice system can be severely handicapped if the advice it provides is not
fully aligned with the latest regulations and market conditions. This will require the
system to ensure that the information it uses is up-to-date, but also that new
communication channels are also considered.

© STA Consulting Inc. 2021 74 CT-AI Exercise Guide

– Transparency / Explainability – it needs to be able to provide a rationale for its advice.
A system providing financial advice needs to be able to justify its recommendations,
as users will not, for instance, invest money in a new scheme without an explanation
of why it is a good idea.
– Complexity – it may be basing decisions on big data, which may mean that it is quite
complex.
The complexity of recommendations based on the analysis of big data means that
these recommendations cannot easily be replicated by humans, especially in useful
timescales. For instance, there is no point making an investment in a changing
market after that market has changed too much.
– Ethics – bias is definitely possible against disadvantaged groups (e.g. those who have
been out of work due to illness or childbirth).
There are several different ways in which ethics apply to financial systems. For
instance, the systems may provide advice to invest in unethical schemes if the
system objectives are not carefully defined to include ethical considerations
alongside the objective of maximising profit.
Medium Importance
– Flexibility – We could envisage the system identifying new sources of information on the
internet and using them.
If the system can change itself, such as changing the sources of information it uses to
make recommendations, then this would need to be incorporated with care as it
would open up the system to the risk of being vulnerable to attack through this
route. For this reason, this form of flexibility seems unlikely. However, swapping
between information feeds from pre-validated data sources is a more likely
possibility.
– Adaptability – it seems that using it outside of financial advice is unlikely, but updating
information sources may be a possibility.
Modifying the system to change its functionality would require a lot of
knowledge and would be unlikely. However, making changes to where the
system gets its information is a possible modification that could be performed
without much technical skill.
– Autonomy – it depends – is this an app bought by an individual, who will use it for advice
or is it to support a professional financial advisor, who will allow it to make investments?
The name suggests that this system is providing advice that may be acted upon, or
not. If the system is limited to providing advice than the need for autonomy is low.
However, if the system is allowed to interact with the market to make transactions,
then the level of autonomy becomes far more important.

© STA Consulting Inc. 2021 75 CT-AI Exercise Guide

Low Importance
– Functional Performance – advice is normally qualitative, so unless the system advises on
investing specific amounts (e.g. just below a tax threshold) this is probably not so
important.
The importance of this characteristic is very dependent on the form of advice being
provided by the system. For most systems providing generic advice, then this will
not be so important. However, if the system is focused on providing specific advice,
such as an amount to invest, then this will be more important.
– Non-Deterministic / Probabilistic – not so relevant due to the qualitative nature of
analysis.
As with the previous characteristic, this is dependent on the form of advice being
provided.

© STA Consulting Inc. 2021 76 CT-AI Exercise Guide

11 Exercise 11 – Explainability using ExpliClas
1. Got to https://demos.citius.usc.es/ExpliClas (the ExpliClas starting page).

2. Select the BEGINNER option.

3. Find the ‘BEER’ dataset and click on ‘Select’.

4. Go to the top of the page and click on ‘Next’.

© STA Consulting Inc. 2021 77 CT-AI Exercise Guide

5. On the left, you can explore the data in the BEER dataset. There are 40 instances
and for each instance, you can see the values for each of the attributes. Towards the
bottom, on the left, you can see the 8 possible output classes. Choose an instance
(or leave it with the default first instance).

6. Once you are familiar with the dataset, go to the right, and click in the white space to
close the dataset explorer and show six classifiers, with their accuracy (labelled here
as precision).

7. Click on ‘NEXT’ and it will give you the opportunity to select which algorithms to
compare, and whether you want local or global explanations. The default is that all
are selected, so leave it with the default.

8. Click on ‘FINISH’ and ExpliClas will explain why each of the classifiers came up with
their result for the instance you selected. Below it there is a global explanation that
applies to the model in general.

9. Go ‘BACK’ twice to select another instance, and then repeat as before to see
explanations.

10. Now try the EXPERT option. Click on the three dots at the bottom right of the screen.

Select the ‘Logout’ option.

And then select the EXPERT option.

© STA Consulting Inc. 2021 78 CT-AI Exercise Guide

11. Select the BEER dataset again.

12. Now choose an instance on the left (Click on ‘DATA’), and click on the ‘CLASSIFY’ button,
below.

13. You will now be presented with a graphical view of the J48 model, with the path
through the tree highlighted in green.

From here, you can select different instances, different algorithms, look at local or global
explanations, etc.

© STA Consulting Inc. 2021 79 CT-AI Exercise Guide

12 Exercise 12 – Selecting a Test Approach for an ML System
This is an exercise on identifying potential scenarios that fit the selection of given test
approaches for an ML system.
For each of the following situations, write a brief scenario that could result in the situation
occurring. The suggested solutions are in section 12.2.

12.1 Example Situations

• Model testing gives excellent results, but the model’s operational
performance is poor
• The training dataset for the blood pressure advice system for people over 40
is biased towards women
• Model performance is worse than expected when performing evaluation
• Black-box adversarial testing is necessary
• The test results show the model is underfitting although the training dataset
should have been just large enough

12.2 Solutions – Example Situations and Initiating Scenarios

Example - Model testing gives excellent results but the model’s operational performance is
poor
For this example situation there are two alternative scenarios that could easily cause
this situation to occur:
• Scenario - Testing dataset was not completely independent of the training
dataset
In this scenario, the lack of independence means that some or all of the
testing examples were also used for training. This means that the model was
built based on some or all of the test examples, and so would be expected to
perform well with these examples that it had, in effect, already seen. Once
deployed, the operational model would come across new, unseen examples
and so would be expected to perform less well with these new examples.
• Scenario - Training and testing datasets were not fully representative of the
actual operational data.
In this scenario, even if the training and testing had been performed perfectly
well, the resultant model would fail to perform so well with operational data
as this data is not the same as the model was trained on, and so the model
will struggle to work well with data that does not match the data it was
trained with.

© STA Consulting Inc. 2021 80 CT-AI Exercise Guide

Example - The training dataset for the blood pressure advice system for people over 40 is
biased towards women
There could be several reasons why a training dataset is biased towards a particular
group. Medical ML systems seem to be particularly prone to this type of problem,
which is a form of sample bias, probably because of the way medical data is
collected, which is often from specific patient groups. Ideally the training data needs
to be representative of the population for which the system will be used. In the
medical field, there are now checklists available to identify such bias. The following
is one scenario that would result in sample bias for the blood pressure advice
system:
• Scenario - The training dataset was provided by a gathering data from
datasets that were gathered on patients undergoing menopause treatment.
Data from this patient group would largely fit into the right age group, but
the menopause applies exclusively to women, and so would mean that no
data from men was included.

Example - Model performance is worse than expected when performing evaluation

Sub-optimal model performance could be caused by several scenarios. There are
two basic options. The problem could be caused by the data, but, alternatively, it
could be caused by the algorithm. A third option is that both could be at fault. If we
assume that the training data is adequate and representative of the expected
operational use, then it could be that the testing dataset has been developed
independently (which is good practice), but that this independent dataset is not
representative, and so does not match the created model. If the problem is caused
by the algorithm, this is normally caused by a lack of understanding on the part of
the person or team who selected and implemented the algorithm. The following
describes one such scenario where a sub-optimal algorithm is used, so resulting in a
poorly performing model. Of course, the ‘right’ algorithm may have been selected,
but then it was not implemented well, such as by using a poor selection of algorithm
hyperparameters.
• Scenario - The training algorithm was chosen based on the data scientists’
previous happy experiences with it rather than it being a good match to the
objectives and data

Example - The test results show the model is underfitting although the training dataset
should have been just large enough
Underfitting can be caused by problems with the training data not containing
relevant features that are needed by the algorithm to build the model, or by the
wrong algorithm being used, or both of these. In this example situation, it says the
training dataset was large enough, so that suggests that either the training dataset

© STA Consulting Inc. 2021 81 CT-AI Exercise Guide

was misused or there was a problem with the algorithm (e.g. a linear algorithm was
selected for a non-linear problem). The following two scenarios reflect this:
• Scenario - Insufficient training data was used because the training dataset
was split in half to create separate training and validation datasets and cross
validation was not used
or
• Scenario - The wrong training algorithm was selected for the objective and
data

© STA Consulting Inc. 2021 82 CT-AI Exercise Guide

13 Exercise 13 – Pairwise Testing – Self-Driving Car
This exercise provides practice in using pairwise testing to test an AI-based system – in this
case a self-driving car.
Go to the Inductive website at: https://inductive.no/
Login to your account and then open the free online Pairwiser test tool.

Select ‘new’ and enter Self-Driving Car as the new Test Plan.
We will now define the parameters. We will consider the Operational Design Domain
(ODD). This is defined as a set of conditions that includes environmental factors, road
infrastructural elements, agents around vehicles, and various driving scenarios for which a
system or a feature of the system is designed. According to ISO 21448, the following factors
are part of the ODD (note that the ‘ego’ vehicle is the one we are testing):
• climate
• time of the day
• road shape
• road feature (e.g. tunnel, gate)
• condition of the road
• lighting (e.g. glare)
• condition of the ego vehicle (e.g. sensor covered by dust)
• operation of the ego vehicle (e.g. a vehicle is stopping)
• surrounding vehicles (e.g. a vehicle to the left of the ego vehicle)
• road participants (e.g. pedestrians)
• surrounding objects off roadway (e.g. a traffic sign)
• objects on the roadway (e.g. lane markings).

© STA Consulting Inc. 2021 83 CT-AI Exercise Guide

Each of these factors can have multiple values. For example, for ‘time of the day,’ its values
can be early morning, day-time, evening, and night-time. Based on ISO 21448, the total
number of combinations for all the ODD factors is 169,554,739,200.
Note that in the ODD we have not considered properties of agents (e.g. gender of a
pedestrian), vehicles (e.g. speed of the vehicle), and environmental attributes (e.g. amount
of snowfall). If we also did that, and we wanted to cover all combinations, we would need
even more than the 169 billion tests required to cover the ODD.
In practice, we would perform a hazard analysis to determine which ODD factors were
relevant to our situation. We will limit ourselves to considering just the road condition, the
weather, and the time of the day.
Select the ‘Define Parameters’ tab at the top of the Pairwiser window and you will be
presented with a new window where you can add parameters (ODD factors) and constraints
(between the parameters).

We will assume the following possible parameter values:

road condition - dry, wet with puddles, wet with no puddles
weather - rainy, snowy, cloudy, clear
time of the day - day, dusk/dawn, night
Note that well-drained roads would normally be wet with no puddles, while poorly drained
roads would accumulate puddles.
For this simple scenario we would need to generate 3x4×3 = 36 tests if we wanted to test all
possible combinations.
Use the ‘⊕ parameter’ button to add parameters and corresponding values (‘⊕’):

© STA Consulting Inc. 2021 84 CT-AI Exercise Guide

Once happy that you have entered the parameters and values correctly, save the data by
selecting the ‘ save’ button.
We will now generate test cases that will achieve coverage of all pairwise combinations (all
combinations of weather and road condition, weather and time of day, and road condition
and time of day).
Select the ‘Generate Tests’ tab at the top of the Pairwiser window and you will be presented
with a new window. Once there, select the ‘ generate tests’ button to generate a set of
tests that satisfy pairwise coverage. Pairwise (2-wise in the tool) is the default, so you
should not need to select this.

You will find that Pairwiser generates 13 tests (actually 13 combinations of test inputs as
there are no expected results). If you decided to do this manually (and you were efficient),
then you could cover all the parameter-value pairs with just 12 test cases. The algorithm for

© STA Consulting Inc. 2021 85 CT-AI Exercise Guide

calculating pairwise is not simple, and the Pairwiser tool sacrifices minimising the test set for
speed.
Actually, if you change the order of the parameters around (you can do this back in the
‘Define Parameters’ window using the ‘Move up ’ and ‘Move down ’ buttons), you will
find that when you subsequently generate tests it can sometimes go up to 15 tests for this
same set of parameters simply in a different order.
In practice, simply saying rainy is often too vague – we may want to specify how rainy. We
can do this by adding a fourth parameter for the amount of rain in cm. For this exercise we
will add rain levels from 0 cm to 6 cm in 1 cm increments.

Save the parameters and generate a new set of tests.

You will see that we now get 31 tests (all combinations would be 252 tests). However, if we
take a closer look at the generated tests, then we can see that we now have some weird
tests. For instance, there are tests where the road condition is ‘dry’, but there are also
several centimetres of rain. Thus, we can see that there is a dependency between the road
condition and the rain parameters. Also, we have tests where the weather is not rainy and
yet we still have several centimetres of rain – another dependency. Luckily, we can remove
these inconsistencies by setting some constraints in the ‘Define Parameters’ window.

© STA Consulting Inc. 2021 86 CT-AI Exercise Guide

Save the constraints and then generate a new set of tests.
We should now have 30 (more sensible tests).

© STA Consulting Inc. 2021 87 CT-AI Exercise Guide

Select the ‘Analysis of Tests’ tab at the top of the Pairwiser window, and then select the
‘ analyze’ button.

© STA Consulting Inc. 2021 88 CT-AI Exercise Guide

The graph shows us how with each additional test the coverage increases until we achieve
100% pairwise coverage.
We will now force Pairwiser to include a test with a particular combination of parameter
values. This is a useful facility, as it allows us to ensure that potentially high-risk situations
are always covered by a test case.
Imagine that we feel that driving in snowy conditions at night with a dry road (presumably
icy if it is not wet and it is snowing) is high risk and want to force Pairwiser to include that as
a test case.
Select the ‘Required Tests’ tab at the top of the Pairwiser window and then add a required
test (you can leave rain empty as it already knows from the constraints that if the road
condition is dry then rain must be 0 cm).

© STA Consulting Inc. 2021 89 CT-AI Exercise Guide

Save the required test and then generate a new set of tests but be sure to select the
‘include required tests’ option in the ‘Generate Tests’ window before generating the tests.
Your recently entered (high risk) required test should be listed at #1. And there are now 29
tests needed to achieve 100% pairwise coverage.
Finally, try generating tests for ‘1-wise coverage’ (not very efficient – it should only need 7
tests) and ‘3-wise coverage’ (lots of tests).

© STA Consulting Inc. 2021 90 CT-AI Exercise Guide

14 Exercise 14 – Metamorphic Relations
For the metamorphic testing exercise, we will initially consider two separate theoretical
problems based on typical AI-based systems (speech recognition and online search), and
then move onto practicing metamorphic testing by building an image recognition system
and then testing it with metamorphic test cases.

14.1 Speech Recognition

You have been tasked with testing a Speech Recognition system using metamorphic testing.
Our source test input is a sound file and the source test result from the ML system is the
text spoken in the speech file.
Identify several metamorphic relations for the system that could be used to generate
follow-up test cases.
The solution can be found in section 14.3.

14.2 Online Search

You have been tasked with testing an AI-Based online hotel search system using
metamorphic testing. Our source test input is a set of search terms (e.g. required dates,
location, number of people, WiFi availability, etc.) and the source test result from the
system is a list of hotels that meet the search criteria.
Identify several metamorphic relations for the system that could be used to generate
follow-up test cases.
The solution can be found in section 14.4.

14.3 Solution - Speech Recognition

On top of the three MRs shown on the slide, note that we can create more test cases with
volume at different levels, played at different speeds, and with changed background noise,
etc.
We can also create many follow-up test cases where we combine these changes (e.g. sound
at 90% and speed at 120%).

14.4 Solution - Online Search

On top of the three MRs shown on the slide, note that we can create more test cases with 2
or more search terms added or removed.

15 Exercise 15 – Metamorphic Testing with Teachable Machine
We will now build an ML system that can classify shapes using the Google Teachable
Machine.
Go to https://teachablemachine.withgoogle.com/train and you will be presented with three
large buttons, each of which will allow you to start a new project (image, audio or pose).
Choose the ‘Image Project’ by clicking on it. You will be presented with two options for your
‘New Image Project’: ‘Standard image model’ and ‘Embedded image model’ – choose the
‘Standard image model’.

We will define three classes (triangle, quadrilateral and background), the examples for
which we will input through our computer webcam.
Rename ‘Class 1’ to be ‘Triangle’ by clicking on the pencil icon. Rename ‘Class 2’ to be
‘Quadrilateral’ and use the ‘Add a class’ button to add a new class and name it ‘Background’.

Before we can add the examples as training data, we need to create them. Find some paper
and draw about 10 separate triangles (try and vary the triangle types).
Now add your triangles as training data. Click on ‘Webcam’ in the ‘Triangle’ class (you may
need to allow access to the camera).

If you hold down the ‘Hold to Record’ button it will take lots of pictures – so be careful (you
can always delete images you are not happy with by pointing at the image and clicking on
the bin icon that appears. TIP: Try not to get your fingers in the picture – we don’t want the
model to learn based on visibility of your fingers!

You can see that several different types of triangles are recorded above (and few fingers!).
Now do the same for the ‘Quadrilateral’ class.
Follow this by recording images for the ‘Background’ class.
Now, we will train the ML model by clicking on the ‘Train Model’ button (it should take
about 10 seconds).

You can see that the model can clearly see (it is 100% sure) that it is looking at the
background (and not a triangle or quadrilateral).
Now test the model in the Preview by presenting triangles and quadrilaterals to the
webcam.

Here we can see that it can also correctly identify a quadrilateral. It should also work with
triangles.
If the model successfully identifies these three, we have a set of source test cases that we
can use as the basis for metamorphic testing.
Identify as many metamorphic relations (MRs) as you can – and then use these MRs to
create follow-up test cases (pictures of triangles and quadrilaterals based on your existing
pictures of them).
See below for a source test case for an equilateral triangle (surrounded in blue lines at the
top-middle) and eight follow-on test cases with the metamorphic relation at the top (e.g.
rotate, fill, size).

Note that we can create more test cases with brightness, contrast, background, etc. at
different levels. Of course, we can also do combinations of these MRs to create new MRs,
as well.
And, if we had time, we would also create similar MRs for quadrilaterals and for the
background.

16 Exercise 16 – Exploratory Testing - TensorFlow Playground
TensorFlow Playground provides an interactive in-browser visualization of neural networks
from Google. TensorFlow Playground can simulate, in real-time, in your browser, small
neural networks and lets you see the results.
TensorFlow Playground can be found at https://playground.tensorflow.org/.
In this exercise in exploratory testing, you will have the opportunity to create neural
networks, test these networks, and at the same time test the playground. At the end of the
exploratory testing session, you should report back on what you have found – any defects,
any areas for improvement in quality, and any changes in functionality you feel would
improve the TensorFlow Playground.
You will see that it is possible to create a simple perceptron (as we did with Excel), and also
deep neural nets.
Here is a screenshot of a simple perceptron (zero hidden layers) created using the
TensorFlow Playground:

Here are two screenshots showing an example of a simple neural net where we can see that
the use of a high learning rate has the potential to cause the neural net to fail (depending on
the starting point and data).

See one hidden layer above and note the test/training loss of 0.000.

The same one hidden layer neural network (all parameters the same), but the result is
completely different (due to a slightly different starting state and the high learning rate). In
this case the neural net classifies inputs the opposite of what is wanted (note the yellow
dots are in the blue space, and vice versa).
The following, more complex neural net, is an attempt to classify the spiral dataset.

Note that better solutions are possible, but the test loss here is only 0.005.

17 Exercise 17 – Selecting Test Techniques
For each of the following mini-scenarios (listed in section 17.1), identify one or more test
techniques you believe would be a suitable choice for this scenario, and note any
assumptions you made.
Be prepared to explain your choice of test technique(s) and your assumptions.
Suggested techniques for each mini-scenario are provided in section 17.2.

17.1 Mini-Scenarios - Selecting Test Techniques

1. A system with the same core functionality is available, developed by a third-party

2. Does the new version of a web-based sales system make more sales?

3. Was the latest version of a spam filtering system attacked during its training?

4. We only have a few trusted test cases, and we have inexperienced testers who are
familiar with the application

5. Our software is used to control self-driving trucks, but we are aware that our
overseas rivals do not want us to succeed

6. There is a worry that the train control system may not handle conflicting inputs from
several sensors satisfactorily

7. Our confidence in the training data quality is low, but we do have testers
experienced in the boat insurance business

8. We have developed a new deep neural net to support medical procedures that
provides excellent functional performance but want to improve our confidence in it

9. We have several teams producing ML models, but they don’t all perform the same
verification tasks to ensure quality

10. We are new to ML, and would like to know that our testing is aligned with the
experienced model developers

11. We need something that guides our testers when they are asked to take a quick look
at the data used in the ML workflow

12. We have a tester who has seen several AI projects that have had problems with bias
and ethics – and we are worried about making the same mistakes ourselves

13. We have a test automation framework, but checking the test results from our AI
bioinformatics systems is very expensive with conventional tests

14. We are testing a self-learning system and we want to be sure that the system’s
updates are sensible

15. We are worried that updates to the system introduced defects in the unchanged
core functionality

16. We want to check that the new innovative ML model that has been developed is
broadly working as we would expect

17. We want to check that the replacement AI-based system provides the same basic
functions provided by the previous conventional system

18. We are testing an automated plant-feeding system that considers multiple factors,
such as weather features, water levels, plant type, growth stage, etc.

19. We believe that the public dataset we used for training may have been attacked by
someone adding random data examples

20. We have received a warning that the third-party supplier of our training data may
not be following the agreed data security practices

21. Our classifier provides similar functionality to a classifier which has reported
problems with inputs close to the classification boundary

17.2 Solutions - Selecting Test Techniques

These are the mini-scenarios, along with the suggested test technique(s):
1. A system with the same core functionality is available, developed by a third-party
• Back-to-back testing can be used to test the core functionality using this
third-party system as a test oracle.
• Existing test cases or even randomly-generated test inputs and automated
testing can be used.

2. Does the new version of a web-based sales system make more sales?
• A/B Testing can be used to see if the new version achieves better sales by
comparing the results from the two versions using statistical analysis and
splitting site visitors so that some visit the old version and some visit the new
version

3. Was the latest version of a spam filtering system attacked during its training?
• A/B Testing can be used to see if the new version provides results that are
statistically significantly different from those of the current system for the
same set of emails.
• If there is a difference it may be due to a data poisoning attack using the
training data.

4. We only have a few trusted test cases, and we have inexperienced testers who are
familiar with the application
• Metamorphic testing may be appropriate as we can generate many follow-up
tests from a few trusted source test cases.
• Inexperienced testers who are familiar with the application should be able to
generate metamorphic relations to do this, even with only a small amount of
training.

5. Our software is used to control self-driving trucks, but we are aware that our
overseas rivals do not want us to succeed
• Adversarial testing may be appropriate in situations where attacks against
mission-critical systems may occur.

6. There is a worry that the train control system may not handle conflicting inputs from
several sensors satisfactorily
• Pairwise testing can be used to ensure all pairs of values from different
sensors can be tested in a reasonable timescale when all combinations would
be infeasible.

7. Our confidence in the training data quality is low, but we do have testers
experienced in the boat insurance business
• Experience-based testing, using techniques such as EDA (exploratory data
analysis) may be able to identify whether we have reason to be worried
about the data quality.

8. We have developed a new deep neural net to support medical procedures that
provides excellent functional performance but want to improve our confidence in it
• Using a neural network coverage criterion to drive the generation of
additional test cases to achieve a higher level of coverage should increase our
confidence on the neural net.

9. We have several teams producing ML models, but they don’t all perform the same
verification tasks to ensure quality
• Using a checklist, such as the Google “ML Test Checklist” would help to
ensure all ML models had gone through the same testing steps.

10. We are new to ML, and would like to know that our testing is aligned with the
testing of experienced model developers
• Using a checklist, such as the Google “ML Test Checklist” would help inform a
new entrant to ML on what is a ‘good’ set of testing activities to perform
when building an ML model.

11. We need something that guides our testers when they are asked to take a quick look
at the data used in the ML workflow
• The use of exploratory testing is suitable when time is short and we can
define a data tour that focuses the exploratory testing sessions on areas
specifically related to the data.

12. We have a tester who has seen several AI projects that have had problems with bias
and ethics – and we are worried about making the same mistakes ourselves
• Error guessing using the experience of this tester may be useful in avoiding
unfair bias and ethical problems in our systems.

13. We have a test automation framework, but checking the test results from our AI
bioinformatics systems is very expensive with conventional tests
• Incorporating metamorphic testing into the existing test automation
framework should be possible and this will allow many follow-up tests to be
generated.

14. We are testing a self-learning system and we want to be sure that the system’s
updates are sensible
• A/B testing could be implemented automatically by the self-learning system.
This would involve any changes made by the system causing automated A/B
testing to be performed. This A/B testing would need to check that core
system behaviour was not made worse by the change by comparing the new
and current versions.

15. We are worried that updates to the system introduced defects in the unchanged
core functionality
• Back-to-Back testing can be used (with the updated and previous version) to
identify defects introduced to the core functionality (assuming it is supposed
to be unchanged).
• Note that A/B testing is not appropriate here as we are not comparing
measurable performance statistically but are identifying defects.

16. We want to check that the new innovative ML model that has been developed is
broadly working as we would expect
• Back-to-Back testing can be used (with the innovative model and a simple ML
model that is easy to understand being compared).
• This will give us a good idea of whether the new model is working along the
right lines (results would not be expected to be exactly the same, but similar).
• Note that A/B testing is not appropriate here as we are not comparing
measurable performance statistically but are identifying differences in
individual test results.

17. We want to check that the replacement AI-based system provides the same basic
functions provided by the previous conventional system
• Back-to-Back testing can be used (with the replacement AI-based system and
the previous conventional system) using regression tests focused on the basic
functions that are supposed to be the same.

18. We are testing an automated plant-feeding system that considers multiple factors,
such as weather features, water levels, plant type, growth stage, etc.
• Pairwise testing may be appropriate as all combinations of values for each of
the factors is not possible due to a combinatorial explosion problem.

19. We believe that the public dataset we used for training may have been attacked by
someone adding random data examples
• This sounds like a potential data poisoning attack.
• Exploratory data analysis (EDA) may be an appropriate response to identify if
there are now noticeable problems with the dataset, such as outliers.
•

20. We have received a warning that the third-party supplier of our training data may
not be following the agreed data security practices
• A review of the processes used by the third-party supplier may be required to
determine that the probability of a data poisoning attack on the training data
is as low as required.

21. Our classifier provides similar functionality to a classifier which has reported
problems with inputs close to the classification boundary
• Inputs close to the boundary may correspond to small perturbations that are
adversarial examples.
• Due to transferability of adversarial examples, this suggests that we should
perform adversarial testing of our classifier.

18 Exercise 18 – Bug Prediction
You will build a bug prediction system for use on a new project that will classify new
software components as either likely to include a defect, or not likely to include a defect,
based on a number of software measures. The idea is that by knowing if a component is
likely to contain defects, then the amount of testing effort can be more efficiently used.
Use the skills you have learnt on the course to build the best bug prediction system you can.
Do not forget what you learnt about data preparation. Just because the data comes from
NASA, does not mean it is perfect.
The relevant (20) software measures are:
1. loc McCabe's line count of code
2. v(g) McCabe "cyclomatic complexity"
3. ev(g) McCabe "essential complexity"
4. iv(g) McCabe "design complexity"
5. n Halstead total operators + operands
6. v Halstead "volume"
7. l Halstead "program length"
8. d Halstead "difficulty"
9. i Halstead "intelligence"
10. e Halstead "effort"
11. b Halstead
12. t Halstead's time estimator
13. lOCode Halstead's line count
14. lOComment Halstead's count of lines of comments
15. lOBlank Halstead's count of blank lines
16. uniq_Op unique operators
17. uniq_Opnd unique operands
18. total_Op total operators
19. total_Opnd total operands
20: branchCount of the flow graph
You have been provided with three labelled datasets from projects similar to the new
project which the bug prediction system will be used on (NASA bug data 1, NASA bug data 2,
and NASA bug data 3). Not all of the instances in the datasets are complete (they may have

some missing or invalid values), not all of the instances are unique (there are duplicates),
and some of the instances may be mislabelled.
The three datasets are in .arff format.

18.1 Solution - Bug Prediction

The following are a set of high-level steps that will create a workable bug prediction system:

1. Combine the three provided datasets into a single dataset. Remember that you only
need the preliminary information ahead of the data once in a single .arff dataset file.

2. Remove any duplicates from the combined dataset.

3. Replace any missing values in the dataset.

4. Remove invalid instances (e.g. where lines of code is non-integer).

5. Randomize the combined dataset.

6. Split the combined dataset into a training dataset (90%) and a test dataset (10%).

7. Remove any outliers from the training dataset (e.g. using the InterquartileRange
filter).

8. Fix the imbalance in the training dataset (between the ‘defects = true’ and ‘defects =
false’ classes).

a. Use SMOTE to oversample the minority ‘defects = true’ class until the true
and false classes for defects are about equal in size.

b. Alternatively, try undersampling the majority class (‘defects = false’).

9. Randomize again so that all oversampled true values are not together.

10. Create a classifier using the Random Forest with 300 trees and use the test dataset
as the ‘Supplied test set’.

19 Exercise 19 – Discussion - Which test activities are least likely to
be replaced by AI?
The following (potentially controversial) article covers the question (and surrounding areas)
and can be used as a basis for the discussion or as a starting point for students:

In many industry sectors the use of AI in place of humans is causing concern.

Some software testers are also concerned, and we have seen in this training course that AI
can potentially be used in several areas of software testing, including:
• Specification Review
• Specification-Based Script Generation
• Test Strategy Assistant
• Usage Profiling
• Defect Management
• Crowd Testing
• Web App Spidering
• Automated Exploratory Testing
• Analysis of Defect Reports
• Bug Prediction
• Test Case Generation
• Regression Test Optimization
• User Interface Testing
And, as AI technology advances, the list of areas will expand. As with test automation,
arguments based on consistency and speed (and how human testers can lack these) will
support more use of AI-based systems. After all, they are computer-based and do not get
tired and make simple mistakes like human testers. However, the tasks assigned to AI-
based systems currently tend to be repetitive and fairly simplistic.
A persuasive argument is made that AI-based software testing typically follows the test
automation route. That is, those activities that are most easily automated are also those
activities that are more likely to be addressed by AI. This seems logical if you look at what’s
involved in the majority of automated testing.
Automated testing typically requires test cases to already be available for execution. These
can then be incorporated into a regression test suite that can be usefully executed time and
again. Over the years, test automation frameworks have moved from capture/playback with

their fragile test scripts, to data-driven testing that still required fragile scripts, but could
manage data changes more easily, to keyword-driven testing that incorporates data and
keywords to reduce the fragility of the scripts to small changes in the system.
AI-based technology can be applied to this form of testing very effectively. As we have seen
in this course, we can now use visual testing backed up by AI-based image recognition to
remove the brittleness inherent in dealing with changes to the user interface.
And, for regression testing, we can use AI-based technology to monitor changes to the
system and identify the associated tests that need to be updated.
However, we still need to create these tests in the first place. AI can monitor testers
performing exploratory testing and create test cases based on what they do, but can the AI
create tests that are effective at finding defects? Probably not.
First of all, any testing should be risk-based, but the complexities of identifying risk areas
and determining the level of risk associated with each of them is beyond current AI
capabilities. Granted, AI can look at a system and generate tests that will provide a
reasonable level of coverage (but note that current AI is not normally capable of achieving
full decision coverage, but probably closer to 50%). AI can also use knowledge of the user
interface and attempt to cover all visible elements. By also using knowledge of the allowed
inputs, further coverage can be achieved.
However, achieving these forms of coverage is not a substitute for human-performed risk-
based testing. For instance, the AI will not know which areas of the system are more
important to the users (and the client) and so should be tested in more depth. You could
argue that the AI could monitor system use and know which areas were the most ‘popular’.
However, this requires the system to be deployed and used by the actual user base. To get
to that stage, we want to already have reasonable confidence, achieved through testing.
And, even if that hurdle can be overcome, just knowing which parts of a system are most
popular does not make those parts the most important. We might use the monitoring
functions on a nuclear reactor every minute, but we definitely want equal, if not higher,
confidence in the subsystem that shuts it down in an emergency.
If using risk-based testing to identify what to test is problematic for AI, there may be a more
fundamental difficulty. And that is, as has been covered in this course in some depth, the
test oracle problem. The AI may determine that part of a system needs to be tested, and
can generate the test inputs that will cover that part, but how does it decide if a test has
passed or failed? Where we have complete specifications, we do not even need AI – simpler
test automation technology can use formal specifications to determine if a test passes or
fails. But, in practice, we never have these. Sometimes, very rarely, we may have formal
specifications for the functionality and time behaviour requirements, but complete
specifications do not stop there. What about usability, robustness and security, to name but
three other quality characteristics?
In practice, we never get complete specifications. When we do, systems can be built
automatically. Then, our testing effort will move to ensuring that the specifications are truly

complete and that the delivered system plays nicely with other systems that may not all be
as perfect as it is. However, until that happens (if ever), humans will still be required to use
their judgment about what constitutes a system pass and fail.
You may have noted that we have been talking predominantly about system testing so far.
This is because unit/component testing is concerned with far smaller and less complex test
items. The testing is also more constrained as, typically, fewer quality characteristics are
tested at this level (normally just functionality). For unit/component testing, it is far more
likely that AI-based testing could be used to perform the majority of the required activities.
For a moment, let’s return specifically to the point about AI-based testing following the test
automation route. One point worth considering by those worried about losing their testing
job to AI, is to consider what happened to all those manual testing jobs that were going to
be automated away by test tools. Over the years we have seen a constant refrain
(especially from those who sell testing tools) that test automation is the way forward and
that manual testing will be consigned to the past. However, we are still waiting for it to
overtake manual testing as the main means of executing tests. In 2018, most organizations
surveyed stated that they were automating some tests (72%). But, this means that 28%
were not automating any tests (so these organizations were doing 100% manual testing).
And, of those organizations that did some automated testing, most of them (76%) were
automating less than half of their testing [REF1].
Another way of looking at the use of AI-based systems for testing is in terms of the
information used by the systems. An AI-based system is limited by the data we provide it or
allow it to access. As such, we can consider an AI-based system to be a closed system with
defined boundaries. Even a smart city system will normally only be provided with the data it
needs to perform its required functions, such as traffic data (e.g. networks, flows), utilities
data (e.g. water, power), health data, etc. It is unlikely that such a system will be provided
with data on local politics, global warming, and trending social media topics. Similarly, an
AI-based testing system will only be provided with the data considered to be relevant to its
required functionality. While this will allow it to perform a number of useful functions, it
will naturally be limited in how it interprets some test results where a wider knowledge-
base is required.
As humans, we are not part of a closed system (unless in a strict prison regime) and are
normally encouraged to take a wider perspective, which allows us to understand and
imagine what others would feel in a given situation. This allows human testers (when
working at the system level) to better interpret test results and decide reasonable pass/fail
criteria for complex situations where stakeholders are themselves influenced by external
factors unavailable to a computer system.
Similarly, an AI-based testing system, working in its limited closed system context, may find
it difficult to identify those edge cases that are outside the specified boundary for the
system (imagine a flight envelope for a drone). Whereas a human tester given such a

situation may be inclined to see what happens when one or two of the boundaries are
exceeded.
Two of the main areas that are discussed about the future of AI-based systems are unfair
bias and ethics. Any tester faced with testing an AI-based system needs to be aware of
these factors (which is why they have been covered on this course). However, they can be
extremely complex. What may be considered by one person to be a fair system, may be
considered by another to be unfair. And, similarly, a system considered to be aligned with
people’s ethical views in one year, may be misaligned a very short time later.
It is very difficult for many people to keep up with these changing and different views,
especially when systems are used (and bought) by disparate groups and in different regions
with different cultures. Although by using AI-based testing systems we will remove some
human bias from the testing (test data may still be biased), we cannot expect these systems
to understand the nuances of what is considered unacceptable by some users, especially
when many humans struggle to know this (and when it can change on what seems like a
yearly basis). Just re-training such systems to avoid problems of historic bias caused by
using out-of-date training data might, in itself, create a new category of software testing
job.
The following skill areas are some of the least likely to be taken over by an AI-based system:
• Emotional intelligence and empathy
• Negotiation and conflict resolution
• Creative, intuitive and strategic thinking
• Problem solving and planning
Advances in AI are actually more likely to create more opportunities for testers as new AI-
based technologies will not only require new ways of testing to be invented but will also
create more software to be tested.
In conclusion, In the short- to medium-term is seems unlikely that AI will adequately address
the following areas:
• Issues with interpreting test results
• Identifying the most important areas to test using risk-based testing, with its
constituent parts of risk identification, analysis and mitigation
• Writing the original test cases that are later used for regression testing
• Performing exploratory testing, especially in the areas of deciding what is
most important to test, interpreting test results, and deciding not only if
those tests have passed, but also what they tell us about where to test next
• Identifying unfair bias and ethical issues in a consistent manner, especially
where these are influenced by external factors

• Acting in a culturally sensitive manner
[REF1] https://www.infoworld.com/article/3286529/test-automation-comes-of-age.html

20 Appendix - Pre-Exercise Preparation
20.1 Introduction
This appendix describes the set-up activities required to ensure student’s PCs are in a state
ready for them to carry out the various exercises associated with the CT-AI course.
Ideally this preparation should be carried out ahead of the course, so that it does not cause
unnecessary delays during the course (e.g. to download and install applications or set up
accounts with tool vendors).
This preparation may be performed by the students if they are using their own PCs or may
be performed by the training provider if they are setting up their own machines to support
the course.

20.2 Web Access

Ensure access to the Web is available for downloads and accessing websites.

20.3 Downloading and Installing Weka

The latest stable version of Weka should be downloaded.
Links to downloads for different platforms (Windows, Mac OS and Linux) are available at:
https://waikato.github.io/weka-wiki/downloading_weka/
The Weka Manual (over 300 pages) is included when you download Weka. If you have a
problem using Weka, or want more details on using it, you can use this manual.
Obviously, there is also a lot of information about Weka online, and the first place to look
should be the Weka Wiki at Home - Weka Wiki (waikato.github.io)

20.4 Install the Weka Training Data

Load the provided files that accompany this guide into a suitable folder on your PC:
• course.arff
• glass.arff
• iris.arff
• NASA bug data 1.arff
• NASA bug data 2.arff
• NASA bug data 3.arff
Make a note of the folder location.

20.5 MS Excel
Ensure MS Excel is available.
Load the Perceptron spreadsheet onto your PC and make a note of its location.

20.6 Webcam
Check that your PC webcam is working.

20.7 Pairwiser
The people at Inductive have agreed that we can use their free online pairwise testing tool,
Pairwiser, for the pairwise testing exercise (exercise 13).
Go to the Inductive website at https://inductive.no/
Click on the Pairwiser tool link, as shown.

We want to use the free online version, so follow that link:

We don’t need to buy it (as it is $0), so simply click on the ‘Add to cart’ button.

When presented with the cart, click on the ‘Proceed to checkout’ button.

Fill in the provided form.
Keep a record of your password.
Tick to say you have read the ‘TERMS AND CONDITIONS’.
Then click on ‘Place order’.

This will take you to an order confirmation:

And you will now have a free online Pairwiser account.

21 Appendix - Weka Filters
This is an alphabetical list of the Weka filters (and their categorization in Weka) used in this
guide (or that may be useful in the exercises):

• InterquartileRange (unsupervised attribute)

Used to indicate outliers and extreme values (when a feature is provided
numerically) by adding two corresponding attributes to the dataset. Interquartile
ranges are used to define what constitutes an outlier or an extreme value. By
default, outliers are shown separately from extreme values, but they can be listed
together.
Use the ‘detection per attribute’ parameter if you want to see which attributes are
causing an instance to be flagged as an outlier/extreme value.

• Obfuscate (unsupervised attribute)

Replaces the attribute names with meaningless variable names

• Randomize (unsupervised instance)

Shuffles the order of instances in a dataset.

• Remove (unsupervised attribute)

Removes a range of attributes from the dataset.

• RemoveDuplicates (unsupervised instance)

Removes all duplicates from a dataset.

• RemoveMisclassified (unsupervised instance)

Removes instances that are incorrectly classified, such as outliers.
Requires a classifier to be selected as a parameter.

• RemovePercentage (unsupervised instance)

Removes a specified percentage of the dataset.
Use the ‘percentage’ parameter to set the percentage to select.

Set the ‘invertSelection’ to ‘True’ to select the opposite set of instances than when
set to ‘False’.

• RemoveWithValues (unsupervised instance)

Removes instances where attributes are set to a specified value. Use the
‘attributeIndex’ parameter to point at the relevant attribute and use the
‘nominalIndices’ parameter to set the relevant value.

• ReplaceMissingValues (unsupervised attribute)

Replaces all missing values with the mode (for nominal attributes) or mean (for
numeric attributes).

• Resample (supervised instance)

Creates a random subsample of the dataset (used for undersampling).
biasToUniformClass = 1.0 (uniform distribution)

• SMOTE (supervised instance)

The Synthetic Minority Oversampling Technique creates additional synthetic
instances for a minority class (used for oversampling).
Use the ‘percentage’ to specify how many more instances to create of the minority
class.