Lexi2: Lexicase Selection with Lexicographic Parsimony Pressure
Allan de Lima
Samuel Carvalho
Douglas Mota Dias∗
University of Limerick
Limerick, Ireland
Allan.DeLima@ul.ie
Limerick Institute of Technology
Limerick, Ireland
samuel.carvalho@lit.ie
Rio de Janeiro State University
Rio de Janeiro, Brazil
douglas.dias@uerj.br
douglas.motadias@ul.ie
Enrique Naredo
Joseph P. Sullivan
Conor Ryan
University of Limerick
Limerick, Ireland
Enrique.Naredo@ul.ie
Limerick Institute of Technology
Limerick, Ireland
joe.sullivan@lit.ie
University of Limerick
Limerick, Ireland
Conor.Ryan@ul.ie
ABSTRACT
1
Bloat, a well-known phenomenon in Evolutionary Computation,
often slows down evolution and complicates the task of interpreting
the results. We propose Lexi2 , a new selection and bloat-control
method, which extends the popular lexicase selection method, by
including a tie-breaking step which considers attributes related
to the size of the individuals. This new step applies lexicographic
parsimony pressure during the selection process and is able to reduce the number of random choices performed by lexicase selection
(which happen when more than a single individual correctly solve
the selected training cases).
Furthermore, we propose a new Grammatical Evolution-specific,
low-cost diversity metric based on the grammar mapping modulus
operations remainders, which we then utilise with Lexi2 .
We address four distinct problems, and the results show that
Lexi2 is able to reduce significantly the length, the number of nodes
and the depth for all problems, to maintain a high level of diversity
in three of them, and to significantly improve the fitness score in
two of them. In no case does it adversely impact the fitness.
Evolutionary Algorithms (EAs), such as Genetic Programming (GP)
[12] and Grammatical Evolution (GE) [21], are a group of algorithms
inspired by Darwin’s theory of evolution by natural selection, in
which we evolve a population of solutions following the principle
of survival of the fittest and applying some genetic operators, such
as crossover and mutation.
Since methods like GP or GE evolve solutions using a variablelength representation, a possible and undesirable effect is the occurrence of sharp growth of these solutions [18]. Populations experiencing such a growth rarely enjoy a corresponding improvement
in fitness. This growth without a relevant increase in fitness is
known as bloat [20]. Indeed, there are some forms of GP, such as
Cartesian Genetic Programming, in which bloat is not a problem
[25], but for many others, bloat is a naturally occurring issue. As
a consequence of this phenomenon, the evolutionary process can
slow down because bigger solutions usually take more time to be
evaluated. Moreover, the interpretability can be difficult, and even
the generalisation ability can be limited because more complex
solutions can result in overfitting of the training set.
Lexicographic parsimony pressure is a method introduced for
controlling the bloat of GP trees which consists of choosing the
smallest option when two or more solutions present identical fitness.
It was tested on different problem domains, maintaining similar
results regarding the fitness score, while reducing the tree sizes
significantly [14].
Lexicase selection is mostly used as a parent selection method,
and it has been shown to give significantly better results when
compared to other selection methods, such as roulette and tournament, albeit at a cost of speed. This method consists of selecting
individuals by filtering a pool of them, which starts with the entire
population, and step by step, each training case is examined in random order, while the filtering process filters individuals according
to the performance on each training case [24]. In an ideal situation,
lexicase selection checks training cases until a single candidate
remains in the pool of individuals. However, if more than one individual remains after filtering with all training cases, the selection
is made randomly within the remaining individuals.
In this paper, we propose Lexi2 , a new selection method, which
inserts lexicographic parsimony pressure as a step of lexicase selection. The concept of including non-error metrics inside lexicase
CCS CONCEPTS
· Computing methodologies → Genetic programming; Supervised learning; Discrete space search.
KEYWORDS
lexicase selection, lexicographic parsimony pressure, grammatical
evolution
ACM Reference Format:
Allan de Lima, Samuel Carvalho, Douglas Mota Dias, Enrique Naredo, Joseph
P. Sullivan, and Conor Ryan. 2022. Lexi2 : Lexicase Selection with Lexicographic Parsimony Pressure. In Genetic and Evolutionary Computation Conference (GECCO ’22), July 9ś13, 2022, Boston, MA, USA. ACM, New York, NY,
USA, 9 pages. https://doi.org/10.1145/3512290.3528803
∗ Also
with University of Limerick.
This work is licensed under a Creative Commons Attribution International 4.0 License.
GECCO ’22, July 9ś13, 2022, Boston, MA, USA
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9237-2/22/07.
https://doi.org/10.1145/3512290.3528803
929
INTRODUCTION
GECCO ’22, July 9–13, 2022, Boston, MA, USA
A. de Lima, et al.
selection was investigated once in a work that combines modularity metrics with error values to guide the evolution of modular
solutions [23]. With our proposal, we aim to reduce the bloat of
GE individuals, and simultaneously perform fewer random choices
with lexicase when selecting parents. Since individuals in GE can
be represented as trees, a straightforward metric for individuals’
size is their number of nodes, i.e., the fewer nodes an individual
has, the smaller it is. Therefore, the number of nodes is our first
choice as a tie-breaking criterion in this work. At the same time,
alternative metrics linked to individuals’ size, such as tree depth
and the number of genotypic used codons, are also investigated.
2
(1) Initialise:
(a) Put the entire population in a pool of candidates
(b) Put all training cases in a list of cases in random order
(2) Loop:
(a) Replace candidates with the individuals currently in
candidates, which presented the best fitness for the first
training case in cases
(b) If a single individual remains in candidates, return this
individual
(c) Else if a single training case remains in cases, try to break
the tie between the remaining individuals in candidates
(i) Replace candidates with the individuals currently in
candidates, which presented the best value regarding
a pre-defined tie-breaking criterion
(ii) If a single individual remains in candidates, return
this individual
(iii) Otherwise, return a random individual within the remaining individuals in candidates
(d) Else eliminate the first training case in cases and re-run
the Loop
BACKGROUND
GE is an EA method used to build programs [21][17][22]. We can
represent a GE individual through its genotype, a variable-length
sequence of codons (groups of eight bits), each consisting of an
integer value . This genotype can be mapped to a phenotype, which
is usually a more understandable representation to the user [5].
A grammar is a set of rules, each represented by a non-terminal
and a set of production rules, which consists of terminals, items that
can appear in the final program, and non-terminals, intermediate
structures used by the production rules. The modulo operator1 is
used to select production rules during the mapping process. A GE
individual is considered invalid when the mapping process is not
able to eliminate all non-terminals [16][15].
Lexicase selection is a method for selecting parents throughout the evolutionary process, which considers the fitness of each
training case individually and in random order instead of an aggregated fitness value over the training cases. The original algorithm
[24] starts with the entire population in a pool of candidates to
be selected as parents, and the list of training cases is considered
in random order. The first training case is checked, and only the
candidates with the best fitness value in that training case remain
in the pool. The subsequent training cases are checked until just a
single candidate remains in the pool. When it happens, that individual is selected as a parent. But, if more than one candidate stays
in the pool after all training cases have been checked, the parent is
chosen randomly among the remaining candidates.
This method was initially applied to modal problems [24], in
which solutions must execute different actions on particular training cases. Later, its application was expanded to uncompromising
problems [10], in which a solution is not allowed to accept a low
performance in one training case in return for high performance in
other training cases. After that, the method was tested successfully
in other contexts such as program synthesis problems [8] and learning classifier systems [1]. A reason which may partially explain
its success is the ability of lexicase selection to maintain higher
levels of diversity for the population compared to methods that use
aggregated fitness values, while still providing enough selection
pressure to exploit good solutions [10][7].
3
Listing 1: Algorithm for Lexi2 selection
one picked later, which works only for sorting within those that
have the same fitness in the first training case. This analogy with the
lexicographic concept inspired the name łlexicasež. In this work,
we propose extending this aspect to other attributes other than
fitness. Therefore, since we are expanding the analogy, it inspired
the name Lexi2 (reads łlexi squaredž).
Listing 1 shows the algorithm of Lexi2 to select a parent, which is
similar to the original lexicase algorithm [24], with the addition of a
tie-breaking step. We reach this step when more than one candidate
remains in the pool after checking all training cases. In this way,
we filter the remaining candidates by considering a pre-defined
tie-breaking criterion, expecting to pick the best candidate with
this criterion. Still, if more than one has the best value, we choose
randomly from them. However, we can also expand this step to
use more than one criterion. In this way, if a tie still remains after
a first attempt to break it, we try again by checking a different
pre-defined criterion within the remaining individuals, and so on,
using as many criteria as we want, until a single individual remains
in the pool. Two criteria are used in this work, and we choose their
ordering randomly in each attempt of filtering. It is necessary in
order to avoid introducing some hierarchy within different criteria
and expand the search space. In addition, since we are proposing
Lexi2 to decrease bloat issues, we are measuring and addressing as
tie-breaking criteria only those attributes related to the size of a
GE individual, such as the number of nodes and the depth of the
tree-representation of the phenotype, and the number of codons
used to map the phenotype.
In the original approach of lexicase selection, if we have ties at the
end of the selection process, the individuals remaining in the pool
have the same vector of fitness cases. Posterior works [19][9] use
an optional pre-selection step before entering the loop of lexicase
selection. It consists of taking individuals with completely equal
vectors of fitness cases, and choosing a single one randomly to
LEXI2
The lexicase method uses the lexicographic concept because the
fitness of each training case is considered łlexicographicallyž [24].
It means that the fitness of a training case has priority over another
1 this
executes an operation which returns the remainder of a division
930
Lexi2 : Lexicase Selection with Lexicographic Parsimony Pressure
GECCO ’22, July 9–13, 2022, Boston, MA, USA
keep in the pool of candidates. It results in an initial pool of unique
individuals regarding fitness cases, which allows the process to
escape from the loop earlier since it avoids the situation when we
would have many cases remaining in the filtering process, but the
remaining candidates would have the same fitness cases values. This
approach significantly reduces the computational cost of lexicase
selection, but has no practical effects on the results, since it does not
change the probability of any individual being selected. Therefore,
despite the higher computational cost, we decided not to use a
pre-selection step in this work. We consider that it is easier to
understand what is happening inside the lexicase loop using its
original approach [24], and compare it with the Lexi2 loop, as we
show in Figure 5. However, for further applications, we plan and
recommend using a pre-selection step with Lexi2 as well. In this
way, when taking individuals with completely equal vectors of
fitness cases, we choose the smallest one instead of at random.
4
<e> ::= and(<e>,<e>) | or(<e>,<e>) | not(<e>)
| if(<e>,<e>,<e>) | x[0] | x[1] | x[2]
| x[3] | x[4] | x[5] | x[6] | x[7]
| x[8] | x[9] | x[10]
(a) 11-bit Multiplexer
<e> ::= and(<e>,<e>) | or(<e>,<e>) | not(<e>)
| x[0] | x[1] | x[2] | x[3] | x[4]
(b) 5-bit Parity
<e> ::= o1 = <log_op>; o0 = <log_op>
<log_op> ::= and(<log_op>,<log_op>)
| or(<log_op>,<log_op>)
| not(<log_op>)
| <boolean_feature>
<boolean_feature> ::= x[0] | x[1] | x[2] | x[3]
| x[4] | x[5] | x[6] | x[7]
| x[8] | x[9] | x[10] | x[11]
| x[12] | x[13] | x[14]
| x[15] | x[16] | x[17]
| x[18] | x[19] | x[20]
REMAINDERS DIVERSITY
(c) Car Evaluation
Although the relationship between performance and diversity is
not simple, the preservation of diversity in populations is usually
considered desirable to help avoid premature convergence [3].
Due to the way in which the GE mapping process operates, individuals with different genotypes can produce similar or even
identical phenotypes. This means that phenotypic diversity measurements probably give a more accurate view of the population
than genotypic diversity measurements.
In this work, we use two different phenotypic diversity measurements. The first, named fitness diversity, is defined as the number
of different fitness values identified in the population divided by
the number of possible values [11]. In classification problems, if we
define the fitness score as the number of outputs correctly predicted,
the number of possible fitness values will usually be the number
of different samples plus one, since there is a possibility that no
outputs have been correctly predicted.
The second, structural diversity, is defined as the number of
different program structures found in the population divided by
the population size [11]. We can consider the phenotype of an
individual as the structure, which would not be a trivial measure
to implement in GE. However, since we use the modulo operator
in the mapping process, we can consider the resulting sequence of
remainders of an individual genotype during the mapping process
as a representation of its phenotype. Thus, specifically for GE,
we can redefine structural diversity as the number of different
sequences of remainders found in the population divided by the
population size. We name this measure remainders diversity, which
is entirely equivalent to structural diversity, but much cheaper to
calculate.
5
<e> ::= and(<e>,<e>) | or(<e>,<e>) | not(<e>)
| if(<e>,<o>,<e>) | x[0] | x[1] | x[2]
| x[3] | x[4] | x[5] | x[6]
<o> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
(d) LED
Listing 2: Grammars
features. Finally, we also address the LED problem, a noisy dataset
with ten classes, each referring to a digit from 0 to 9, and 7 binary
features, each with a probability of 0.1 of being in error [2].
In our grammars (Listing 2), we define the function sets using
the same approach as Koza [12] for the Boolean problems. This
means using the operators AND, OR and NOT for both problems
and the IF function for the 11-bit Multiplexer. This function is the
common LISP function which has three arguments and executes
the IF-THEN-ELSE operation. In addition, since the remaining problems have only binary inputs, we employ the same function sets.
However, our idea is to analyse the performance of Lexi2 with not
only various problems, but also distinct grammars, so we try different structures. In the LED problem, we expect that a reasonable
individual has integer outcomes, but its grammar also allows individuals with Boolean outcomes. When it happens, these individuals
can correctly predict only the classes 0 or 1, resulting in poor performance. Then, we let GE identify this issue, and Lexi2 reduce
the size of the solutions. Regarding the Car Evaluation problem,
the grammar provides the creation of individuals with two binary
outcomes, which we use to predict the four classes of this problem.
We use the whole dataset as the training set for our Boolean
problems, since we aim to find individuals with perfect score. On the
other hand, for the Car Evaluation and the LED problems, we split
the samples into training (70%) and test sets (30%), doing a different
split in each run in order to build a better statistical analysis.
We can see in Table 1 the hyperparameters used in all experiments. The choice of these parameters was based on a small set of
initial runs and, for Boolean problems, population size was set to
a sufficient value to find individuals with a perfect score in some
EXPERIMENTAL SETUP
We address the key problems from a recent work [1] which studied
the effects of lexicase parent selection on classification problems.
Firstly, we use the 11-bit Multiplexer and the 5-bit Parity, two
Boolean problems traditionally used as benchmarks in evolutionary algorithms. Secondly, we pick the Car Evaluation problem, a
strongly unbalanced dataset with four classes and six categorical
features, in which we used one-hot encoding to provide 21 binary
931
GECCO ’22, July 9–13, 2022, Boston, MA, USA
A. de Lima, et al.
0.5
runs. In the table, the population size values refer to the multiplexer,
parity, Car Evaluation and LED problems, respectively.
Lexicase
Lexi2 - nodes
Lexi2 - depth
Lexi2 - used codons
Lexi2 - nodes and depth
Lexi2 - used codons and depth
Lexi2 - used codons and nodes
0.3
0.2
Training fitness
0.4
Table 1: Experimental hyperparameters
Parameter type
Parameter value
Number of runs
30
Number of generations
200
Population size
500/2000/1000/1000
Elitism ratio
0.01
Mutation method
Codon-based integer flip [6]
Mutation probability
0.01
Crossover method
Variable one-point [6]
Crossover probability
0.8
Initialisation method
PI Grow [4]
0.1
0
50
100
Generations
150
200
(a) 11-bit Multiplexer
0.5
We define as fitness function the mean absolute error (MAE).
Since we evolve classifiers for all problems, this measure represents
the rate of outputs wrongly predicted, and therefore, we want to
minimise the fitness score and give invalid individuals the worst
fitness value possible. Regarding the training cases, each one has
only two possible outcomes: 1, if it is correctly predicted, and 0
otherwise.
We run experiments with Lexi2 using each time the number of
nodes, the number of used codons or the depth as a tie-breaker,
and also every combination of two of them. Furthermore, we run
experiments with lexicase in order to compare the results.
6
0.3
0.2
Training fitness
0.4
0.1
0
RESULTS AND DISCUSSION
50
100
Generations
200
(b) 5-bit Parity
We start this section by summarising graphically the results of
lexicase selection and Lexi2 with six distinct combinations of tiebreaking criteria, with all graphs showing the average between 30
runs, and in the end, we present a statistical analysis to support
our results.
Figure 1: Average fitness of the best individual across generations
Table 2: Number of successful runs in the 11-bit Multiplexer
and 5-bit Parity problems
6.1 Fitness
Figure 1 shows the fitness of the best individual across generations for the Boolean problems, while Table 2 shows the number
of successful runs2 . In the multiplexer problem, we can see that
all approaches using Lexi2 converge to a perfect score faster than
those using lexicase selection. This happens because all Lexi2 approaches achieved 29 or 30 successful runs, while lexicase achieved
24 successful runs. On the other hand, in the 5-bit Parity problem
we do not have a clear superiority of Lexi2 , but the approach using
the number of nodes and the number of used codons as tie-breaking
criteria converges to the smallest value, achieving 12 successful
runs, while the approach with lexicase achieves eight successful
runs.
For the Car Evaluation and LED problems, unlike when using
Boolean datasets, the idea is to find solutions with the ability to
generalise the training set. Therefore, the fitness result that matters
is measured on a test set, and we do this only with the best individual
of each run. Figure 2 shows these results as box plots. In the Car
Evaluation problem, the results for four set-ups of Lexi2 present a
slightly smaller average than the results using lexicase selection.
2 runs
150
Lexicase selection
Lexi2 (nodes)
Lexi2 (depth)
Lexi2 (used codons)
Lexi2 (nodes and depth)
Lexi2 (used codons and depth)
Lexi2 (used codons and nodes)
Successful runs (out of 30)
11-bit Multipler 5-bit Parity
24
8
29
9
29
9
30
10
30
7
29
7
29
12
The best one is the approach using the number of used codons as
the tie-breaking criterion. In contrast, four set-ups of Lexi2 present
clearly better results for the LED problem, especially the approach
using the number of nodes and the depth as tie-breaking criteria.
6.2 Bloat
Figure 3 shows our most important results, since a key aim of
this work is to reduce the bloat in GE individuals. The graphs
present the average number of nodes in the population across the
generations. Lexi2 is able to reduce the bloat in every problem, no
that found a solution, which satisfies all test cases in a Boolean problem
932
Lexi2 : Lexicase Selection with Lexicographic Parsimony Pressure
GECCO ’22, July 9–13, 2022, Boston, MA, USA
(b) LED
(a) Car Evaluation
Figure 2: Box plots with the test fitness score achieved by the best individual of each run
1,000
1,000
1,000
800
800
800
800
600
600
600
400
400
400
200
200
200
600
400
Number of nodes
1,000
200
0
50
100
150
Generations
(a) 11-bit Multiplexer
200
0
50
100
150
200
Generations
0
50
100
150
200
0
Lexicase
Lexi2 - nodes
Lexi2 - depth
Lexi2 - used codons
Lexi2 - nodes and depth
Lexi2 - used codons and depth
Lexi2 - used codons and nodes
50
Generations
(b) 5-bit Parity
(c) Car Evaluation
100
150
200
Generations
(d) LED
Figure 3: Average number of nodes across generations
matter which tie-breaking criteria are being used. The results are
especially good for the multiplexer problem, but even for the Car
Evaluation problem, which presented the least impressive results,
we can see clearly a decrease when using Lexi2 in comparison with
the results from lexicase selection.
we can observe that the convergence value is smaller when using
Lexi2 than using lexicase for all problems.
6.4 Selection process analysis
We can see the average number of individuals being selected in
each step of lexicase and Lexi2 in Figure 5. This is an interesting
analysis for this work because it shows in detail what is happening
inside the selection process over generations. Firstly, we need to
highlight that the sum of each step for each method is equal to the
total number of selected individuals, which means the population
size minus the elitism size. Lexicase has only two steps: selection by
error, which means that only one individual remains in the pool of
candidates after all training cases have been checked, and random
selection, which means that the previous step ended in a tie. Then,
the number of individuals being selected by lexicase is equal to the
sum of these two quantities.
On the other hand, Lexi2 has one or more tie-breaking steps
between those, and then the number of individuals being selected
is equal to the sum of these three or more steps. For instance, in
Figure 5 we have a step related to selection by error, two tie-breaking
steps, and also random selection, in a total of four steps for Lexi2 .
This figure shows the results of the approach using the number of
nodes and the depth as tie-breakers, but the results are similar to
when using other approaches. Moreover, in these graphs, we show
the number of individuals being selected by tiebreaker 1 and by
tiebreaker 2, rather the number of nodes or the depth, because the
choice of which criterion is being considered first is made randomly.
6.3 Invalid individuals
Although reducing the occurrence of invalid individuals during a
run was not a specific aim of this work, we noted that our experiments did precisely this, most probably as a consequence of the
reduction in the individuals’ size, since smaller individuals are less
likely to turn into invalid solutions. Figure 4 shows the decimal
logarithm of 𝜈 across generations, where 𝜈 is the average number
of invalid individuals in each generation, except when this average
is zero. We choose this approach to show these results because we
have a too sharp peak of invalid individuals in the second generation.
We have no invalid individuals in the first generation, since
we are using Position Independent Grow [4] as the initialisation
method, in which complete individuals are generated. Another
aspect of each initial individual is that the genome length is 50%
longer than the number of used codons [16]. However, this tail is
generated randomly, and then, after the parents are selected in the
first generation, crossover and mutation operations create a huge
number of invalid individuals, and the peak of each graph is right
in the second generation, because of the randomness of the tails.
Over time, the number of invalid individuals decreases sharply, and
933
GECCO ’22, July 9–13, 2022, Boston, MA, USA
3
A. de Lima, et al.
3
3
2
2
2
1
1
1
Lexicase
Lexi2 - nodes
Lexi2 - depth
Lexi2 - used codons
Lexi2 - nodes and depth
Lexi2 - used codons and depth
Lexi2 - used codons and nodes
log(𝜈)
2
3
1
0
50
100
150
200
0
50
100
150
200
0
50
100
150
200
0
50
100
Generations
Generations
Generations
Generations
(a) 11-bit Multiplexer
(b) 5-bit Parity
(c) Car Evaluation
(d) LED
150
200
Figure 4: Decimal logarithm of the average number of invalid individuals (except when equal to zero) across generations
500
300
200
100
0
Number of individuals
400
2,000
1,500
Lexicase (error)
Lexicase (random)
Lexi2 (error)
Lexi2 (tiebreaker 1)
Lexi2 (tiebreaker 2)
Lexi2 (random)
50
100
150
Generations
(a) 11-bit Multiplexer
1,000
1,000
800
800
600
600
400
400
200
200
1,000
500
200
0
50
100
150
200
Generations
0
50
100
150
Generations
(b) 5-bit Parity
(c) Car Evaluation
Figure 5: Average number of individuals being selected in each step of lexicase and
the depth as tie-breakers
The graphs related to the multiplexer and parity problems are
similar, but while the multiplexer results are clearly converged,
the parity results are still changing, since it is a harder problem.
Despite that, both graphs in both approaches start with the number of selections by error being almost equal to the total number
of selections, and then converging to zero. The approach using
Lexi2 converges slightly faster as it happens in the graph related
to the training fitness (Figure 1). In these problems, the number of
individuals selected by error is related to the fitness score, because
when one individual achieves a perfect score, it will be selected as
a parent every time in the next generation. Then, the evolutionary
process performs crossover and mutation operations using just this
single parent, generating a new population, in which we expect to
find distinct individuals with a perfect score. When using lexicase,
the selection process selects randomly within these individuals
from the next generation until the end, but using Lexi2 it tries to
break the tie before selecting randomly, and thus, the evolution advances in order to find better solutions. In short, there are no more
selections by error after we have more than one individual with
a perfect score in the population, and then the random selection
becomes predominant, with some individuals being selected in the
tie-breakings, when using Lexi2 .
Alternatively, in the Car Evaluation and LED problems, the selection by error is predominant throughout generations. It is especially
evident for the Car Evaluation problem, in which the convergence
value of the number of selections by error is approximately 90% of
the total number. It means that the number of random selections is
small even when using lexicase. Furthermore, it explains why the
differences between the results using lexicase and Lexi2 are small
Lexi2
200
0
50
100
150
200
Generations
(d) LED
when using the number of nodes and
in Figure 2, and even in Figure 3. However, despite the selections
by error being dominant across generations for the LED problem,
the number of random choices is approximately 40% of the total
number when using lexicase. Lexi2 is able to reduce this percentage
to less than 20%, but it does not necessarily mean that tie-breakers
make all the remaining choices. Surprisingly, Lexi2 increases significantly the number of individuals selected by error in this problem.
Surely, it has some relation with the improvement in the test fitness
shown in Figure 2. We know that smaller individuals are less likely
to overfit, which explains the decrease of the error score for the
LED problem, but it does not justify the reduction of the error for
the multiplexer and parity problems, where we need to overfit the
training set. We hypothesise that smaller individuals are closer to
the ideal size of a perfect individual, then the evolutionary process
when using Lexi2 traverses in a sense the search space in a more
limited dimensionality, and therefore is able to find better solutions.
Moreover, we could say regarding the LED problem that this limited
exploration of the search space neglects to some extent the noise
of the dataset, and the LED problem has a hard dataset with too
much noise.
6.5 Diversity
An important aspect of lexicase selection is that it is an excellent
method for maintaining a high level of diversity in a population [8].
Thus, one of the expectations of this work is to maintain at least
the same level of diversity when using Lexi2 . Figure 6 shows the
fitness diversity, and we can see that lexicase selection is better at
maintaining diversity in the multiplexer problem, but that Lexi2 is
934
Lexi2 : Lexicase Selection with Lexicographic Parsimony Pressure
0.4
0.2
Fitness diversity
0.6
0
50
100
150
200
GECCO ’22, July 9–13, 2022, Boston, MA, USA
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
50
Generations
100
150
0
200
50
Generations
150
200
0
50
Generations
(b) 5-bit Parity
(a) 11-bit Multiplexer
100
Lexicase
Lexi2 - nodes
Lexi2 - depth
Lexi2 - used codons
Lexi2 - nodes and depth
Lexi2 - used codons and depth
Lexi2 - used codons and nodes
100
150
200
Generations
(c) Car Evaluation
(d) LED
Figure 6: Average fitness diversity
1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.6
0.4
0.2
0
Remainders diversity
1
50
100
150
200
Generations
(a) 11-bit Multiplexer
0
50
100
150
200
0
50
Generations
100
150
Generations
(b) 5-bit Parity
(c) Car Evaluation
200
0
Lexicase
Lexi2 - nodes
Lexi2 - depth
Lexi2 - used codons
Lexi2 - nodes and depth
Lexi2 - used codons and depth
Lexi2 - used codons and nodes
50
100
150
200
Generations
(d) LED
Figure 7: Average remainders diversity
slightly better for the other three problems. We can make almost
identical observations in Figure 7, which shows the remainders
diversity. In both the parity and LED problems the results are so
similar that it is not possible to separate lexicase and Lexi2 . Note
that the conclusions derived for diversity are the same regardless
of which diversity measurement is used, which gives us confidence
in our proposed remainders diversity, which is easy to understand
and inexpensive to run.
statistical significance in this stricter scenario are highlighted with
an asterisk. Also, whenever Lexi2 did not outperform Lexicase an
italic font was used in the tables.
In the multiplexer problem, the best fitness has improved in
all scenarios, with five of them statistically significant before the
Bonferroni correction. Meanwhile, all the size-related metrics have
presented a drastic and significant improvement in all scenarios,
with many-fold decreases.
For the parity problem, while the best fitness presents better
averages in five of the cases, none of them were significantly better.
However, once again, all the size-related metrics have shown a statistically significant improvement, with a single loss of significance
after correction.
For the Car Evaluation problem, test fitness was not significantly
changed when compared to Lexicase. Meanwhile, size metrics presented smaller averages in all cases except two, but not all of them
with statistical significance, especially after the Bonferroni correction.
In the LED problem, test fitness has improved in all scenarios
where Lexi2 was used, with a statistically significant improvement
in four of the six strategies (three with Bonferroni). All three sizerelated metrics analysed have presented a significant reduction,
with resulting individuals notably smaller than the ones seen in
Lexicase selection.
6.6 Statistical Analysis
Given the various metrics and different tie-break strategies used in
the Lexi2 experiments presented on this paper, a statistical analysis
was performed to investigate the significance of the findings. The
expected outcome was that Lexi2 would be able to consistently
improve size-related metrics (such as length, number of nodes and
depth of the individuals) without damaging overall fitness metrics
when compared to the original lexicase method. Tables 3, 4, 5 and 6
present the results from these analyses, where statistically significant differences are underlined for a clearer interpretation. Pairwise
comparisons between each Lexi2 tie-break strategy and lexicase selection were conducted, and the tables represent the end-result for
each of the metrics, i.e., at the final generation of each of the runs on
a given problem. The t-student test with a p-value threshold of 0.05
for rejection of the null hypothesis (no difference between the metrics) was used for the comparisons, after the Shapiro-Wilk test has
indicated the normality of the data under analysis. In addition, even
though only standalone pairwise comparisons were performed, a
stricter analysis using the Bonferroni correction with a factor of 6
within each metric was also conducted, with similar results. The
corrected p-value was set to 0.00833, and the metrics that have lost
7
CONCLUSION
In this paper, we carried out an improvement in the lexicase selection technique by applying parsimony pressure inside the selection
process, which is able to perform fewer random choices when selecting parents. Our first target was to reduce the bloat, which
935
GECCO ’22, July 9–13, 2022, Boston, MA, USA
A. de Lima, et al.
Table 3: 11-bit Multiplexer: mean metrics for each approach and p-values between lexicase selection and each strategy of Lexi2
Fitness
Length
Nodes
Depth
Lexicase
Average
8.85E-03
2.16E+03
4.54E+02
2.23E+01
nodes
p-value
Average
2.21E-02* 2.60E-04
2.16E-06 3.47E+02
1.06E-13 5.49E+01
4.13E-17 7.03E+00
depth
p-value
Average
1.85E-02* 1.63E-05
6.48E-05 5.37E+02
2.67E-12 8.91E+01
2.60E-15 8.42E+00
used codons
p-value
Average
1.83E-02* 0.00E+00
1.05E-06 3.05E+02
2.20E-14 4.68E+01
5.26E-18 6.71E+00
Lexi2
nodes and depth
p-value
Average
1.83E-02* 0.00E+00
1.31E-06 3.15E+02
1.85E-14 4.52E+01
2.30E-18 6.56E+00
used codons and depth
p-value
Average
1.12E-01
2.08E-03
3.58E+02
2.18E-06
5.31E+01
4.95E-14
7.17E+00
2.63E-17
used codons and nodes
p-value
Average
0.00E+00
1.83E-02*
2.89E+02
8.73E-07
4.53E+01
1.86E-14
6.74E+00
4.26E-18
Table 4: 5-bit Parity: mean metrics for each approach and p-values between lexicase selection and each strategy of Lexi2
Fitness
Length
Nodes
Depth
Lexicase
Average
4.79E-02
4.10E+03
5.39E+02
2.77E+01
nodes
p-value Average
4.50E-01 4.06E-02
7.34E-04 2.73E+03
6.85E-08 3.45E+02
3.83E-07 2.28E+01
depth
p-value
Average
3.98E-01 3.96E-02
2.02E-03 2.92E+03
3.05E-02* 4.49E+02
4.49E-05 2.37E+01
used codons
p-value Average
3.92E-01 3.96E-02
4.07E-05 2.38E+03
1.36E-06 3.50E+02
3.74E-05 2.33E+01
Lexi2
nodes and depth
p-value
Average
1.00E+00 4.79E-02
3.16E-05 2.44E+03
1.31E-05 3.84E+02
2.41E-08 2.28E+01
used codons and depth
p-value
Average
7.41E-01
4.48E-02
2.71E-04
2.65E+03
1.04E-05
3.74E+02
1.41E-07
2.28E+01
used codons and nodes
p-value
Average
4.42E-01
3.96E-02
5.89E-05
2.51E+03
2.89E-08
3.38E+02
2.51E-08
2.26E+01
Table 5: Car Evaluation: mean metrics for each approach and p-values between lexicase selection and each strategy of Lexi2
Fitness
Length
Nodes
Depth
Lexicase
Average
1.02E-01
2.28E+03
8.50E+02
4.14E+01
nodes
p-value
Average
8.70E-01 1.01E-01
7.27E-01 2.38E+03
2.98E-02* 7.44E+02
4.62E-03 3.90E+01
depth
p-value Average
8.20E-01 1.01E-01
2.27E-01 2.05E+03
3.32E-01 8.00E+02
1.92E-04 3.79E+01
used codons
p-value
Average
4.95E-01 9.85E-02
6.49E-01 2.17E+03
5.11E-03 7.16E+02
1.21E-02* 3.88E+01
Lexi2
nodes and depth
p-value
Average
5.42E-01 1.05E-01
3.87E-01 2.11E+03
6.76E-01 8.22E+02
2.50E-02* 3.97E+01
used codons and depth
p-value
Average
6.49E-01
1.04E-01
7.87E-01
2.22E+03
9.90E-02
7.69E+02
3.30E-03
3.87E+01
used codons and nodes
p-value
Average
8.99E-01
1.01E-01
9.13E-01
2.31E+03
4.17E-02*
7.58E+02
7.01E-03
3.91E+01
Table 6: LED: mean metrics for each approach and p-values between lexicase selection and each strategy of Lexi2
Fitness
Length
Nodes
Depth
Lexicase
Average
3.49E-01
3.48E+03
3.43E+02
2.64E+01
nodes
p-value Average
3.45E-03 3.12E-01
2.77E-10 1.07E+03
9.77E-08 2.45E+02
2.99E-13 2.14E+01
depth
p-value
Average
2.02E-02* 3.19E-01
4.68E-10 1.11E+03
2.51E-09 2.45E+02
3.11E-14 2.14E+01
used codons
p-value Average
5.38E-02 3.22E-01
2.51E-09 1.19E+03
4.90E-08 2.51E+02
1.12E-10 2.20E+01
Lexi2
nodes and depth
p-value Average
1.54E-03 3.07E-01
1.59E-10 1.04E+03
3.44E-06 2.60E+02
1.19E-10 2.15E+01
used codons and depth
p-value
Average
3.13E-01
6.25E-03
1.08E+03
1.06E-09
2.49E+02
7.99E-09
2.14E+01
8.03E-14
used codons and nodes
p-value
Average
7.25E-02
3.25E-01
9.52E+02
5.42E-11
2.35E+02
5.79E-09
2.07E+01
6.84E-14
Lexi2 addressing attributes that are not related to the size of the
individuals, such as considering parsimonious power consumption
when aiming at evolving low-power digital circuits. Moreover, we
developed Lexi2 using GE, but this is not a GE-specific method, and
it can be easily applied to many forms of GP.
Another future work is to apply Lexi2 to regression problems.
Firstly, we need to define another way to specify what is a tie, since
the fitness space is not discrete. Following the same approach as
Epsilon-Lexicase selection [13], where individuals within a certain
threshold of the target are considered to be successful, we can
choose the smallest one among these individuals.
significantly happened for all problems addressed when considering the length of the genome, the number of nodes and the depth
in comparison with traditional lexicase. We achieved this while
maintaining at least a similar fitness score, and, for three of the
problems examined, Lexi2 produced better results. We believe that
smaller individuals present a better generalisation ability due to
their simplicity.
The statistical analysis of the experiments presented in this paper has validated the rationale behind the development of the Lexi2
method: it is capable of consistently reducing the size of the generated individuals while maintaining the same overall fitness results
obtained with Lexicase.
Lexi2 is also able to maintain a high level of diversity in the
population in three problems addressed, and this was measured
with two distinct diversity measurements, one of which, remainders diversity, is a new GE-specific way to measure diversity. In
future work, we plan to compare this new measure with a more
informative phenotypic diversity like behavioural diversity.
Although this work is related to the bloat issue, we believe that
we can implement this technique addressing any other attribute
as a tie-breaker. Thus, as another future work, we plan to apply
8
ACKNOWLEDGMENTS
This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant number
16/IA/4605. The third author is also financed by the Coordenação
de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES),
Finance Code 001, and the Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro (FAPERJ). Open Access funding provided by
the IRel Consortium.
936
Lexi2 : Lexicase Selection with Lexicographic Parsimony Pressure
GECCO ’22, July 9–13, 2022, Boston, MA, USA
REFERENCES
[20] Riccardo Poli, William B. Langdon, and Nicholas Freitag McPhee. 2008. A
field guide to genetic programming. Published via http://lulu.com and freely
available at http://www.gp-field-guide.org.uk, UK. http://www.gp-fieldguide.org.uk (With contributions by J. R. Koza).
[21] Conor Ryan, J. Collins, and Michael O’Neill. 1998. Grammatical Evolution:
Evolving Programs for an Arbitrary Language.. In Lecture Notes in Computer
Science. Springer, Berlin, Heidelberg, 83ś96. https://doi.org/10.1007/BFb0055930
[22] Conor Ryan, Michael O’Neill, and J. J. Collins (Eds.). 2018. Handbook of Grammatical Evolution. Springer, Boston, MA. https://doi.org/10.1007/978-3-319-78717-6
[23] Anil Kumar Saini and Lee Spector. 2019. Using Modularity Metrics as Design
Features to Guide Evolution in Genetic Programming. Springer, 165ś180.
[24] Lee Spector. 2012. Assessment of Problem Modality by Differential Performance of Lexicase Selection in Genetic Programming: A Preliminary Report.
In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation (Philadelphia, Pennsylvania, USA) (GECCO ’12). Association for Computing Machinery, New York, NY, USA, 401ś408. https:
//doi.org/10.1145/2330784.2330846
[25] Andrew Turner and Julian Miller. 2014. Cartesian Genetic Programming: Why
No Bloat?. In 17th European Conference on Genetic Programming (LNCS, Vol. 8599),
Miguel Nicolau, Krzysztof Krawiec, Malcolm I. Heywood, Mauro Castelli, Pablo
Garcia-Sanchez, Juan J. Merelo, Victor M. Rivas Santos, and Kevin Sim (Eds.).
Springer, Granada, Spain, 222ś233. https://doi.org/10.1007/978-3-662-443033_19
[1] Sneha Aenugu and Lee Spector. 2019. Lexicase Selection in Learning Classifier
Systems. In Proceedings of the Genetic and Evolutionary Computation Conference
(Prague, Czech Republic) (GECCO ’19). Association for Computing Machinery,
New York, NY, USA, 356ś364. https://doi.org/10.1145/3321707.3321828
[2] L Breiman. 1984. Classification and regression trees. CRC press, Boca Raton,
Florida. 47ś49 pages.
[3] Edmund Burke, Steven Gustafson, and Graham Kendall. 2002. A Survey and
Analysis of Diversity Measures in Genetic Programming. In Proceedings of the
4th Annual Conference on Genetic and Evolutionary Computation (New York City,
New York) (GECCO’02). Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 716ś723.
[4] David Fagan, Michael Fenton, and Michael O’Neill. 2016. Exploring position
independent initialisation in grammatical evolution. In 2016 IEEE Congress on
Evolutionary Computation (CEC). IEEE, Vancouver, Canada, 5060ś5067. https:
//doi.org/10.1109/CEC.2016.7748331
[5] David Fagan, Michael O’Neill, Edgar Galvan-Lopez, Anthony Brabazon, and Sean
McGarraghy. 2010. An Analysis of Genotype-Phenotype Maps in Grammatical
Evolution. In Proceedings of the 13th European Conference on Genetic Programming,
EuroGP 2010 (LNCS, Vol. 6021), Anna Isabel Esparcia-Alcazar, Aniko Ekart, Sara
Silva, Stephen Dignum, and A. Sima Uyar (Eds.). Springer, Istanbul, 62ś73. https:
//doi.org/10.1007/978-3-642-12148-7_6
[6] Michael Fenton, James McDermott, David Fagan, Stefan Forstenlechner, Erik
Hemberg, and Michael O’Neill. 2017. PonyGE2: Grammatical Evolution in Python.
In Proceedings of the Genetic and Evolutionary Computation Conference Companion
(GECCO ’17). ACM, Berlin, Germany, 1194ś1201. https://doi.org/10.1145/3067695.
3082469
[7] Thomas Helmuth, Nicholas Freitag McPhee, and Lee Spector. 2016. Effects of
Lexicase and Tournament Selection on Diversity Recovery and Maintenance.
In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference
Companion (Denver, Colorado, USA) (GECCO ’16 Companion). Association for
Computing Machinery, New York, NY, USA, 983ś990. https://doi.org/10.1145/
2908961.2931657
[8] Thomas Helmuth, Nicholas Freitag McPhee, and Lee Spector. 2016. Lexicase
Selection for Program Synthesis: A Diversity Analysis. Springer International
Publishing, Cham, 151ś167. https://doi.org/10.1007/978-3-319-34223-8_9
[9] Thomas Helmuth, Edward Pantridge, and Lee Spector. 2020. On the importance
of specialists for lexicase selection. Genetic Programming and Evolvable Machines
21, 3 (Sept. 2020), 349ś373. https://doi.org/10.1007/s10710-020-09377-2 Special
Issue: Highlights of Genetic Programming 2019 Events.
[10] Thomas Helmuth, Lee Spector, and James Matheson. 2014. Solving Uncompromising Problems With Lexicase Selection. IEEE Transactions on Evolutionary
Computation 19 (01 2014), 1ś1. https://doi.org/10.1109/TEVC.2014.2362729
[11] David Jackson. 2010. Promoting Phenotypic Diversity in Genetic Programming.
In PPSN 2010 11th International Conference on Parallel Problem Solving From
Nature (Lecture Notes in Computer Science, Vol. 6239), Robert Schaefer, Carlos
Cotta, Joanna Kolodziej, and Guenter Rudolph (Eds.). Springer, Krakow, Poland,
472ś481. https://doi.org/10.1007/978-3-642-15871-1_48
[12] John R. Koza. 1992. Genetic Programming - On the Programming of Computers by
Means of Natural Selection. MIT Press, Boston, USA. IśXVIII, 1ś419 pages.
[13] William La Cava, Lee Spector, and Kourosh Danai. 2016. Epsilon-lexicase Selection for Regression. In GECCO ’16: Proceedings of the 2016 Annual Conference
on Genetic and Evolutionary Computation, Tobias Friedrich (Ed.). ACM, Denver,
USA, 741ś748. https://doi.org/10.1145/2908812.2908898
[14] Sean Luke and Liviu Panait. 2002. Lexicographic Parsimony Pressure. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation
(New York City, New York) (GECCO’02). Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 829ś836.
[15] Miguel Nicolau. 2017. Understanding grammatical evolution: initialisation. Genetic Programming and Evolvable Machines 18 (12 2017), 467ś. https://doi.org/10.
1007/s10710-017-9309-9
[16] Miguel Nicolau, Michael O’Neill, and Anthony Brabazon. 2012. Termination
in Grammatical Evolution: grammar design, wrapping, and tails. In 2012 IEEE
Congress on Evolutionary Computation, CEC 2012. IEEE, Brisbane, Australia, 1ś8.
https://doi.org/10.1109/CEC.2012.6256563
[17] Michael O’Neill and Conor Ryan. 2003. Grammatical Evolution: Evolutionary
Automatic Programming in an Arbitrary Language. Vol. 4. Springer, Boston, MA.
https://doi.org/10.1007/978-1-4615-0447-4
[18] Michael O’Neill, Conor Ryan, Maarten Keijzer, and Mike Cattolico. 2003.
Crossover in Grammatical Evolution. Genetic Programming and Evolvable Machines 4, 1 (March 2003), 67ś93.
[19] Edward R. Pantridge, Thomas Helmuth, Nicholas Freitag McPhee, and Lee Spector. 2018. Specialization and elitism in lexicase and tournament selection. In
Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 2018, Kyoto, Japan, July 15-19, 2018, Hernán E. Aguirre and Keiki
Takadama (Eds.). ACM, Kyoto, Japan, 1914ś1917.
937