Exercise Guide v2.0
Exercise Guide v2.0
for
Certified Tester
AI Testing (CT-AI) Course
Version 2.0
The GUI Chooser consists of five ‘Applications’ buttons—one for each of the five major
Weka applications—and four menus on the top. The buttons can be used to start the
following applications:
• Explorer An environment for exploring data with Weka (most of this guide deals
with this application in more detail).
• Experimenter An environment for performing experiments and conducting
statistical tests between learning schemes (we will also have a go at using this).
• KnowledgeFlow This environment supports essentially the same functions as the
Explorer but with a drag-and-drop interface. One advantage is that it supports
incremental learning.
• Workbench The workbench is an environment that combines all of the GUI
interfaces into a single interface. It can be useful when you need to frequently
jump between different interfaces.
• SimpleCLI Provides a simple command-line interface that allows direct execution of
Weka commands for operating systems that do not provide their own command
line interface.
Use the top-left ‘Open file…’ button to open the Weka data file: course.arff
This data file includes 15 instances (an instance is a set of connected attribute values), each
made up of 5 instance values. You can consider an instance to be an example of what
happened in the past. We will use this data about what happened in the past to build a
classifier that predicts what should happen in the future. Or, put another way, whether the
classifier would recommend someone should take the CT-AI course (or not) based on their
experience in testing and AI, whether they have management support, and whether they
have passed the ISTQB Foundation course. Obviously, when there are more instances
(examples) available to use for training, then we would expect the classifier we build based
on this more extensive training data to be more accurate.
@data
low_test_experience,no_AI_experience,no_management_support,foundation,TAI_course
low_test_experience,some_AI_experience,management_support,no_foundation,no_TAI_course
low_test_experience,lots_of_AI_experience,management_support,no_foundation,no_TAI_course
low_test_experience,lots_of_AI_experience,management_support,foundation,no_TAI_course
medium_test_experience,no_AI_experience,management_support,foundation,TAI_course
medium_test_experience,no_AI_experience,no_management_support,foundation,TAI_course
medium_test_experience,some_AI_experience,no_management_support,foundation,TAI_course
medium_test_experience,lots_of_AI_experience,no_management_support,no_foundation,no_TAI_course
medium_test_experience,lots_of_AI_experience,management_support,no_foundation,no_TAI_course
high_test_experience,no_AI_experience,no_management_support,foundation,TAI_course
high_test_experience,some_AI_experience,management_support,no_foundation,no_TAI_course
high_test_experience,some_AI_experience,no_management_support,foundation,TAI_course
high_test_experience,some_AI_experience,no_management_support,no_foundation,no_TAI_course
high_test_experience,lots_of_AI_experience,no_management_support,foundation,no_TAI_course
The 15 instances are listed under @data, with each instance showing values for the
attributes listed above (in the same order). Therefore, the first entry tells us that for this
example situation the candidate had low test experience, no AI experience, no management
support, did already have the Foundation certificate, and did take the TAI course.
To see or edit a dataset in Weka (as shown below), simply select the ‘Edit…’ button near the
top-right of the Preprocess window.
If we close the ‘Viewer’ and go back to looking at ‘Weka Explorer’, we can see that the
window in the bottom right shows the values for the test_experience attribute as a bar
chart (test_experience is shown as it is the first attribute). When we hover over each of the
The red and blue colours on the bars show us the proportion of these values that contribute
to the decision about whether the CT-AI course is recommended or not. So, we can see that
for the five instances (examples) where test experience is low, only one (shaded blue)
resulted in the TAI course being taken.
We can see the same information for any of the attributes by selecting (left-clicking) the
attribute of interest in the ‘Attributes’ panel. Alternatively, we can click on the ‘Visualize All’
button (just above the red and blue bars) and see the information for all five attributes at
the same time. Note that the TAI_course attribute (the attribute that tells us the result –
i.e. whether we take the course, or not) shows that there are 6 examples that recommend
taking the course (in blue) and 9 examples that recommend not to (in red).
It is important to remember that the model code above is NOT written by a person but is
generated by the algorithm. This is the point of ‘machine learning’ – the machine (algorithm
on a computer) builds a model by learning from the training data.
At this point you may be questioning the value of an algorithm that builds such simple
models (and, perhaps, the validity of using the same data to both train and evaluate it, but
we will cover that later). However, ZeroR can be useful as a benchmark to compare its
results with those from more sophisticated algorithms.
Actually, building models from such a small dataset is quite challenging, but we can build a
deep neural net that will outperform the ZeroR algorithm (with this dataset). To do this,
left-click on the ‘Choose’ button in the ‘Classifier’ panel towards the top-left of the screen.
From the presented list, select ‘functions’ and then ‘MultilayerPerceptron’. As before, left-
click on the ‘Start’ button and check the results of building and evaluating a multi-layer
perceptron (otherwise known as a deep neural net). This time the accuracy should be
showing as about 93.3%.
This will bring up a graphical interface allowing you to see what the neural network looks
like. You may need to stretch the window horizontally to more easily see all the text. As
you can see, there are 8 inputs, 5 neurons, and two outputs (so, not a very complex neural
network).
By default the number of neurons is set to the average of the input and output neurons
((8+2)/2=5). You can add new neurons to layers, and add new layers of neurons, and
change the learning rate and momentum parameters (but leave these as they are for now)
2.2 Chatbot
You have been tasked with implementing a chatbot that attempts to solve a specific
problem for a user. These chatbots can help people with tasks such as to book a ticket or
find a reservation and are often referred to as goal-oriented chatbots. You have been
provided with a natural language understanding (NLU) component and a natural language
generator (NLG) component, but you don’t have access to a set of potential chatbot
conversations, and instead have decided to train your chatbot by creating trial-and-error
conversations between two protype chatbots. You will determine the success of the
resultant chatbot by measuring conversation attributes such as coherence, informativity,
and ease of answering in the chatbot dialogues.
Left-click on the name of the selected filter in the Filter panel, and we see a short
description of the filter and we can modify the filter properties in a GenericObjectEditor.
We want to remove the ‘test_experience’ attribute, which we can see is numbered as ‘1’ in
the ‘Attributes’ panel.
Left-click this button and you can see that ‘test_experience’ has now disappeared from the
list of attributes (there are now only four attributes).
1. Select the ‘Classify’ tab near the top left of the ‘Weka Explorer’ window.
2. Check that the Multilayered Perceptron classifier is selected (select it, if necessary –
it is listed under functions in the classifiers).
3. Left-click on the ‘Multilayered Perceptron’ in the ‘Classifier panel to open the editor
that allows the algorithm parameters to be changed.
There is no similar opportunity to visualize the ZeroR model (it’s too simple), and you have
already seen a visual representation of the multilayer perceptron. You can always see the
underlying model information for any algorithm in the ‘Classifier output’ panel, ahead of the
accuracy, and other performance measures.
3.3 Underfitting
Underfitting is generally caused when we don’t provide sufficient, useful data for the
training. Often it can be difficult to know which attributes contribute to the accuracy of the
algorithm, and which don’t.
Try removing different attributes from the glass dataset to see the effect this has on the
accuracy of the model produced by the J48 algorithm. Remember that:
• each time you remove attributes you can go back and ‘Undo’ the removal
(otherwise you may have to re-load the glass dataset)
• the easiest way to select a few attributes is to first select ‘All’, then click on
those you want to use (so that the ‘tick’ disappears), and then click on
‘Remove’
You will see that some of the instances are now flagged as containing one or more attribute
values that are either outliers and/or extreme values. By default, in Weka outliers are
outside three times the interquartile range, but within six times the interquartile range,
whereas extreme values are even further out from the median value. Note that an instance
can be flagged as having both an outlier and an extreme value as there are 9 input
attributes, each of which can cause the whole instance to be flagged.
To remove the instances with outliers and extreme values, use the ‘RemoveWithValues’
filter twice; once to remove outliers and once to remove extreme values.
For instance, to remove outliers, first choose the ‘RemoveWithValues’ filter
(wekafiltersunsupervisedinstanceRemoveWithValues). Next, click on the filter
name and, in the editor, set the ‘attributeIndex’ parameter to point at the relevant attribute
(‘11’ for ‘Outlier’) and set the ‘nominalIndices’ parameter to ‘last’ to remove those instances
where the ‘Outlier’ attribute is set to ‘yes’ (‘no’ appears as the ‘first’ value as there will be
more instances set to ‘no’).
3.5 Overfitting
We will now look at a different rule-based algorithm from ZeroR, the OneR algorithm. To
understand how it works we will first apply it to the course dataset. The OneR algorithm
creates rules based on the single (non-target) attribute that it determines is most likely to
give us the correct prediction (where several attributes provide equal levels of accuracy, one
is chosen at random).
Load the course dataset and use ‘Visualize All’ to see the attributes.
You should be left with a dataset of 43 instances. Use the ‘Save…’ button at the top-right of
the window and save the training dataset with a suitable filename (with ‘test’ in it).
The ‘invertSelection’ parameter will ensure that the remaining 20% of the original glass
dataset will be selected - you should be left with a dataset of 171 instances. Use the ‘Save…’
button at the top-right of the window and save the test dataset with a suitable filename
(with ‘train’ in it).
We will now use the two datasets to train and test a new model using the J48 tree classifier.
Open the ‘Classify’ window and, in the ‘Test options’ panel, select the ‘Supplied test set’
option and select your saved test set. Make sure that the target ‘Class’ is set to ‘(Nom)
Type’ (this is the default, so you should not have to do anything).
Select the Weka ‘Experimenter’ and it will open in the ‘Setup’ tab. First, click on the ‘New’
button at the top right of the window. Then add the glass dataset to the ‘Datasets’ panel
(bottom-left) using the ‘Add new…’ button. Next, add the OneR, J48 and
MultilayerPerceptron algorithms to the ‘Algorithms’ panel (bottom-right), again using the
‘Add new…’ button (the different algorithms are selected using the ‘Choose’ button at the
top of the ‘GenericObjectEditor’ – keep all the default settings).
Weka sets default values for J48 parameters for C (the confidence factor, set to 0.25) and M
(number of instances per leaf, set to 2). To ask Weka to try other values we must set some
options in the ‘CVParameters’ field. Click on this field and a ‘GenericArrayEditor’ opens. In
the input field (to the left of the ‘Add’ button), enter C 0.1 0.4 10.0 and click ‘Add’. This will
ask Weka to try 10 values for the confidence factor from 0.1 to 0.4. Now add another set of
options for the number of instances by adding M 1.0 4.0 4.0 and click ‘Add’. This set of
options will ask Weka to try 1 to 4 instances. Close the editor (use the cross in the top right)
and click ‘OK’ at the bottom of the generic object editor.
Now click ‘Start’ (on the left) and wait for Weka to tell you the optimal parameter values for
C and M in the J48 algorithm for the glass dataset. It may take some time – you can tell that
Weka is working by watching the Weka bird in the bottom right of the window (if it’s
moving, then Weka is working, if it’s sitting then Weka is finished).
If we check the ‘Classifier output’ panel (near the top after the list of attributes), we can see
that it recommends ‘Classifier Options’ of 1 for M and 0.266666666667 for C.
This is not surprising as we only asked the ‘CVParameterSelection’ algorithm to optimise the
J48 parameters for the glass dataset.
Actual
Positive Negative
FP TN | b = no
In the ‘Classifier output’ panel in Weka, where it says: ‘===Detailed Accuracy By Class ===’,
only consider the top row of results, and you can read off the precision, recall and F1-Score.
Accuracy, as we have already seen, is given as a percentage to the right of ‘Correctly
Classified Instances’.
• Deep see treasure finder (to indicate where to start a deep-sea project) – Precision
– we don’t want to suggest there is treasure if there isn’t
High precision would mean that there would be few times when the finder
suggested there was treasure and none was actually there. High recall would
mean that the treasure finder rarely indicated that there was no treasure
when there actually was treasure present. Given the high costs of running a
deep-sea project to extricate the treasure, then it is probably more important
that we have high precision so that we do not waste resources on projects
that find no treasure.
• Legal evidence finder (to suggest significant evidence for a trial) – Recall – we don’t
want to miss what may be evidence that our opponents do find
High recall would mean that the evidence finder rarely missed evidence when
it was there. High precision would mean that the finder would rarely suggest
there was evidence when it was not relevant. For a trial situation, it is more
important that all the relevant evidence is identified, otherwise we risk losing
the case if the opposing counsel has that evidence; thus, we need high recall
in this scenario.
• Old Master Painting identifier (to identify potential works of art) – Recall – we don’t
want to miss out on an Old Master
See the high-value antique detector. The same rationale applies.
Perceptrons can work with many inputs, however in this exercise we will limit ourselves to
two inputs so that we can more easily show what is happening. In which case we can use x
and y to represent our two inputs (rather than x1 and x2). The following shows a
classification problem suitable for a simple perceptron:
This is suitable as the two classes are linearly separable – that is, they can be divided by a
straight line (if the classes are not linearly separable we need a more complex solution, such
as a deep neural net, which can be thought of as a form of multi-layer perceptron). The
equation for any straight line (which could be used to define a linear boundary) is of the
form:
ax + by + c = 0
The perceptron function (for two inputs) is:
result = 1 if (wx*x) + (wy*y) > -b or 0 otherwise
AND
x y result
0 0 0
0 1 0
1 0 0
1 1 1
If you look at the graph, then you can see the two classes (the green tick represents one
class and the three red crosses represent the other class) are linearly separable as many
different straight lines can be drawn between them. This tells us that we can use a
perceptron to implement the logic of an AND operator.
It is not immediately clear what the values for wx, wy and b should be. Actually, if we did
some trial and error it would not take long to come up with a working solution (have a go).
However, an alternative is to train a perceptron using Excel (not many people’s first choice
as an ML development framework, but simple and ubiquitous). We will need training data,
and for that we will use the four training examples in the truth table.
The next graph shows the same initial guess (labelled 1), and subsequent guesses are
labelled in order until line 4 produces a working solution. Here the lines jumps around far
more due to the use of a high learning coefficient (and this figure shows a ‘lucky’ sequence
that manages to hit on a solution after only a few attempts).
As we saw in the truth table, we have four examples. And, when we train an ML model,
each time we use all of our training dataset, it is known as an epoch. We will add more
epochs, as needed later.
We now want to add the activation function for the perceptron node. We have already
seen that the perceptron activation function looks like this:
activation = 1 if wx*x + wy*y > -b (or 0 otherwise)
If we rewrite it to move bias to the same side as the weighted variables, we get:
if value > 0 output 1, else output 0, where value = wx*x + wy*y + b
Add the following to the spreadsheet:
The value for ‘Activation’ can now be calculated. If the calculated value is greater than zero,
then the activation value is 1, otherwise it is 0.
As you can see, we can calculate the activation value for the first training data example
(X=0, Y=0). However, we can also see that our randomly selected weights and bias are not
quite right (they have not immediately solved the problem) because the activation value
does not match the expected result (the activation value we get (1) is different from the
training data desired result (AND RESULT), which is zero).
The spreadsheet also needs to be able to spot this, so add another column (Error) to detect
if the activation result and desired result differ:
We want the spreadsheet to learn from its mistakes and change the values of the weights
and bias until the activation function can correctly calculate activation values that match the
desired result for every one of the four training examples in an epoch. As there was an
error (and we know the direction as this error is negative), we can modify the weightings
and bias based on the error and a learning coefficient.
We will now add the ‘learning’ to the spreadsheet, by adding functions that modify the
weights and bias if the error is non-zero (i.e. the activation function disagrees with the
desired result). Let’s start with WX. Add the following function to calculate a new value for
WX based on the current value.
You see that the learning function takes the previous value for WX and updates it based on
the learning coefficient, the error and the previous value of X. Note that WX does not
change in this case as the previous value of X was zero.
We now need to do the same for WY, which also does not change due its previous value
being zero.
It is similar for the bias, but we have no input variable to consider, and so you can see that
the bias actually changes due to the error. Note that the amount of change is highly
dependent on the selected learning coefficient (0.1).
As expected with a new training example (and updated bias) the value is changed, but the
new activation matches the desired result (in column D) and so the error is zero.
The formulae we used for updating the weights and bias take account of the error value,
and so we can simply cut and paste them (along with those for value activation and error)
into the next row for the third training example:
We can see that there was another error, and so we would expect a change for the fourth
training example. To get this, we again cut and paste from the previous row:
We have now run one epoch of training data. There were errors (as shown in column J), and
so the weights and bias have been changed, thus we are not yet sure if the current weights
and bias will correctly define a boundary that will work for all four training examples (at
present we are only sure that the current values will work for the fourth example).
What we are looking for as we proceed, is running all four training examples (a complete
epoch) and there being no errors for any of them. This will mean that the weights and bias
do not change (due to the error being zero) and we will know that these weights and bias
define a boundary that works for all four examples.
When all the error values in an epoch are zero, we can say that we have converged on a
solution. Note that the solution that we have converged on is dependent on many factors,
including the initial values we chose for the weights and bias, the learning coefficient we
selected, and even the order in which the training examples are presented.
We cannot simply copy the functions from all four rows of the first epoch as we started the
training with randomly selected values for the weights and bias in row 3, so we need to copy
the final row (row 6) of the activation function and the error and copy that into the next
four rows:
From now on, to add another epoch of training data, it is simply a single copy and paste of
the four rows of an epoch:
Using copy and paste, perform more training until the weights and bias converge on a
solution:
One way of checking graphically if this works is to plug the values into the equation of a line:
ax + by + c = 0.
We get the following equation: 0.2x + 0.1y – 0.2 = 0 (or y = 2 – 2x).
The blue boundary line is ‘just’ OK as the cross on the line is considered to belong with the
other two crosses (it would need to be to the right of the line to be considered a 1 by the
perceptron).
The solution that we arrive at is dependent on many factors, including the initial values we
chose for the weights and bias, and the learning coefficient we selected. Create a copy of
the worksheet and try changing these factors to see what effect it has on how quickly it
converges on a solution.
If you load the solution spreadsheet, it will also automatically draw the graph, so that you
can visibly check if the solution works.
6. Once you are familiar with the dataset, go to the right, and click in the white space to
close the dataset explorer and show six classifiers, with their accuracy (labelled here
as precision).
7. Click on ‘NEXT’ and it will give you the opportunity to select which algorithms to
compare, and whether you want local or global explanations. The default is that all
are selected, so leave it with the default.
8. Click on ‘FINISH’ and ExpliClas will explain why each of the classifiers came up with
their result for the instance you selected. Below it there is a global explanation that
applies to the model in general.
9. Go ‘BACK’ twice to select another instance, and then repeat as before to see
explanations.
10. Now try the EXPERT option. Click on the three dots at the bottom right of the screen.
12. Now choose an instance on the left (Click on ‘DATA’), and click on the ‘CLASSIFY’ button,
below.
13. You will now be presented with a graphical view of the J48 model, with the path
through the tree highlighted in green.
From here, you can select different instances, different algorithms, look at local or global
explanations, etc.
Example - The test results show the model is underfitting although the training dataset
should have been just large enough
Underfitting can be caused by problems with the training data not containing
relevant features that are needed by the algorithm to build the model, or by the
wrong algorithm being used, or both of these. In this example situation, it says the
training dataset was large enough, so that suggests that either the training dataset
Select ‘new’ and enter Self-Driving Car as the new Test Plan.
We will now define the parameters. We will consider the Operational Design Domain
(ODD). This is defined as a set of conditions that includes environmental factors, road
infrastructural elements, agents around vehicles, and various driving scenarios for which a
system or a feature of the system is designed. According to ISO 21448, the following factors
are part of the ODD (note that the ‘ego’ vehicle is the one we are testing):
• climate
• time of the day
• road shape
• road feature (e.g. tunnel, gate)
• condition of the road
• lighting (e.g. glare)
• condition of the ego vehicle (e.g. sensor covered by dust)
• operation of the ego vehicle (e.g. a vehicle is stopping)
• surrounding vehicles (e.g. a vehicle to the left of the ego vehicle)
• road participants (e.g. pedestrians)
• surrounding objects off roadway (e.g. a traffic sign)
• objects on the roadway (e.g. lane markings).
You will find that Pairwiser generates 13 tests (actually 13 combinations of test inputs as
there are no expected results). If you decided to do this manually (and you were efficient),
then you could cover all the parameter-value pairs with just 12 test cases. The algorithm for
On top of the three MRs shown on the slide, note that we can create more test cases with
volume at different levels, played at different speeds, and with changed background noise,
etc.
We can also create many follow-up test cases where we combine these changes (e.g. sound
at 90% and speed at 120%).
On top of the three MRs shown on the slide, note that we can create more test cases with 2
or more search terms added or removed.
We will define three classes (triangle, quadrilateral and background), the examples for
which we will input through our computer webcam.
Rename ‘Class 1’ to be ‘Triangle’ by clicking on the pencil icon. Rename ‘Class 2’ to be
‘Quadrilateral’ and use the ‘Add a class’ button to add a new class and name it ‘Background’.
Here are two screenshots showing an example of a simple neural net where we can see that
the use of a high learning rate has the potential to cause the neural net to fail (depending on
the starting point and data).
The same one hidden layer neural network (all parameters the same), but the result is
completely different (due to a slightly different starting state and the high learning rate). In
this case the neural net classifies inputs the opposite of what is wanted (note the yellow
dots are in the blue space, and vice versa).
The following, more complex neural net, is an attempt to classify the spiral dataset.
2. Does the new version of a web-based sales system make more sales?
3. Was the latest version of a spam filtering system attacked during its training?
4. We only have a few trusted test cases, and we have inexperienced testers who are
familiar with the application
5. Our software is used to control self-driving trucks, but we are aware that our
overseas rivals do not want us to succeed
6. There is a worry that the train control system may not handle conflicting inputs from
several sensors satisfactorily
7. Our confidence in the training data quality is low, but we do have testers
experienced in the boat insurance business
8. We have developed a new deep neural net to support medical procedures that
provides excellent functional performance but want to improve our confidence in it
9. We have several teams producing ML models, but they don’t all perform the same
verification tasks to ensure quality
10. We are new to ML, and would like to know that our testing is aligned with the
experienced model developers
11. We need something that guides our testers when they are asked to take a quick look
at the data used in the ML workflow
12. We have a tester who has seen several AI projects that have had problems with bias
and ethics – and we are worried about making the same mistakes ourselves
13. We have a test automation framework, but checking the test results from our AI
bioinformatics systems is very expensive with conventional tests
14. We are testing a self-learning system and we want to be sure that the system’s
updates are sensible
16. We want to check that the new innovative ML model that has been developed is
broadly working as we would expect
17. We want to check that the replacement AI-based system provides the same basic
functions provided by the previous conventional system
18. We are testing an automated plant-feeding system that considers multiple factors,
such as weather features, water levels, plant type, growth stage, etc.
19. We believe that the public dataset we used for training may have been attacked by
someone adding random data examples
20. We have received a warning that the third-party supplier of our training data may
not be following the agreed data security practices
21. Our classifier provides similar functionality to a classifier which has reported
problems with inputs close to the classification boundary
2. Does the new version of a web-based sales system make more sales?
• A/B Testing can be used to see if the new version achieves better sales by
comparing the results from the two versions using statistical analysis and
splitting site visitors so that some visit the old version and some visit the new
version
3. Was the latest version of a spam filtering system attacked during its training?
• A/B Testing can be used to see if the new version provides results that are
statistically significantly different from those of the current system for the
same set of emails.
• If there is a difference it may be due to a data poisoning attack using the
training data.
5. Our software is used to control self-driving trucks, but we are aware that our
overseas rivals do not want us to succeed
• Adversarial testing may be appropriate in situations where attacks against
mission-critical systems may occur.
6. There is a worry that the train control system may not handle conflicting inputs from
several sensors satisfactorily
• Pairwise testing can be used to ensure all pairs of values from different
sensors can be tested in a reasonable timescale when all combinations would
be infeasible.
7. Our confidence in the training data quality is low, but we do have testers
experienced in the boat insurance business
• Experience-based testing, using techniques such as EDA (exploratory data
analysis) may be able to identify whether we have reason to be worried
about the data quality.
8. We have developed a new deep neural net to support medical procedures that
provides excellent functional performance but want to improve our confidence in it
• Using a neural network coverage criterion to drive the generation of
additional test cases to achieve a higher level of coverage should increase our
confidence on the neural net.
10. We are new to ML, and would like to know that our testing is aligned with the
testing of experienced model developers
• Using a checklist, such as the Google “ML Test Checklist” would help inform a
new entrant to ML on what is a ‘good’ set of testing activities to perform
when building an ML model.
11. We need something that guides our testers when they are asked to take a quick look
at the data used in the ML workflow
• The use of exploratory testing is suitable when time is short and we can
define a data tour that focuses the exploratory testing sessions on areas
specifically related to the data.
12. We have a tester who has seen several AI projects that have had problems with bias
and ethics – and we are worried about making the same mistakes ourselves
• Error guessing using the experience of this tester may be useful in avoiding
unfair bias and ethical problems in our systems.
13. We have a test automation framework, but checking the test results from our AI
bioinformatics systems is very expensive with conventional tests
• Incorporating metamorphic testing into the existing test automation
framework should be possible and this will allow many follow-up tests to be
generated.
14. We are testing a self-learning system and we want to be sure that the system’s
updates are sensible
• A/B testing could be implemented automatically by the self-learning system.
This would involve any changes made by the system causing automated A/B
testing to be performed. This A/B testing would need to check that core
system behaviour was not made worse by the change by comparing the new
and current versions.
16. We want to check that the new innovative ML model that has been developed is
broadly working as we would expect
• Back-to-Back testing can be used (with the innovative model and a simple ML
model that is easy to understand being compared).
• This will give us a good idea of whether the new model is working along the
right lines (results would not be expected to be exactly the same, but similar).
• Note that A/B testing is not appropriate here as we are not comparing
measurable performance statistically but are identifying differences in
individual test results.
17. We want to check that the replacement AI-based system provides the same basic
functions provided by the previous conventional system
• Back-to-Back testing can be used (with the replacement AI-based system and
the previous conventional system) using regression tests focused on the basic
functions that are supposed to be the same.
18. We are testing an automated plant-feeding system that considers multiple factors,
such as weather features, water levels, plant type, growth stage, etc.
• Pairwise testing may be appropriate as all combinations of values for each of
the factors is not possible due to a combinatorial explosion problem.
19. We believe that the public dataset we used for training may have been attacked by
someone adding random data examples
• This sounds like a potential data poisoning attack.
• Exploratory data analysis (EDA) may be an appropriate response to identify if
there are now noticeable problems with the dataset, such as outliers.
•
21. Our classifier provides similar functionality to a classifier which has reported
problems with inputs close to the classification boundary
• Inputs close to the boundary may correspond to small perturbations that are
adversarial examples.
• Due to transferability of adversarial examples, this suggests that we should
perform adversarial testing of our classifier.
1. Combine the three provided datasets into a single dataset. Remember that you only
need the preliminary information ahead of the data once in a single .arff dataset file.
6. Split the combined dataset into a training dataset (90%) and a test dataset (10%).
7. Remove any outliers from the training dataset (e.g. using the InterquartileRange
filter).
8. Fix the imbalance in the training dataset (between the ‘defects = true’ and ‘defects =
false’ classes).
a. Use SMOTE to oversample the minority ‘defects = true’ class until the true
and false classes for defects are about equal in size.
9. Randomize again so that all oversampled true values are not together.
10. Create a classifier using the Random Forest with 300 trees and use the test dataset
as the ‘Supplied test set’.
20.5 MS Excel
Ensure MS Excel is available.
Load the Perceptron spreadsheet onto your PC and make a note of its location.
20.7 Pairwiser
The people at Inductive have agreed that we can use their free online pairwise testing tool,
Pairwiser, for the pairwise testing exercise (exercise 13).
Go to the Inductive website at https://inductive.no/
Click on the Pairwiser tool link, as shown.
When presented with the cart, click on the ‘Proceed to checkout’ button.