arXiv:2106.06085v1 [cs.NE] 10 Jun 2021
Problem-solving benefits of down-sampled
lexicase selection
Thomas Helmuth
Hamilton College, Clinton, NY 13323
thelmuth@hamilton.edu
Lee Spector
Amherst College, Amherst, MA 01002
Hampshire College, Amherst, MA 01002
University of Massachusetts, Amherst, MA 01003
lspector@amherst.edu
June 14, 2021
Abstract
In genetic programming, an evolutionary method for producing computer programs that solve specified computational problems, parent selection is ordinarily based on aggregate measures of performance across
an entire training set. Lexicase selection, by contrast, selects on the basis of performance on random sequences of training cases; this has been
shown to enhance problem-solving power in many circumstances. Lexicase selection can also be seen as better reflecting biological evolution, by
modeling sequences of challenges that organisms face over their lifetimes.
Recent work has demonstrated that the advantages of lexicase selection
can be amplified by down-sampling, meaning that only a random subsample of the training cases is used each generation. This can be seen
as modeling the fact that individual organisms encounter only subsets of
the possible environments, and that environments change over time. Here
we provide the most extensive benchmarking of down-sampled lexicase
selection to date, showing that its benefits hold up to increased scrutiny.
The reasons that down-sampling helps, however, are not yet fully understood. Hypotheses include that down-sampling allows for more generations to be processed with the same budget of program evaluations; that
the variation of training data across generations acts as a changing environment, encouraging adaptation; or that it reduces overfitting, leading
to more general solutions. We systematically evaluate these hypotheses,
finding evidence against all three, and instead draw the conclusion that
down-sampled lexicase selection’s main benefit stems from the fact that
it allows the evolutionary process to examine more individuals within the
1
same computational budget, even though each individual is examined less
completely.
Keywords— genetic programming, parent selection, lexicase selection, downsampled lexicase selection, program synthesis
1
Introduction
Genetic programming is an evolutionary method for producing computer programs
that solve specified computational problems (Koza, 1992). When used as a supervised
learning technique, genetic programming defines a problem’s specifications by a set
of training cases. It then judges the ability of evolved programs to solve the problem
by running each program on each training case, and measuring the distance between
the program’s output and the desired output. Genetic programming uses these error
values during parent selection to determine which individuals in the population it
selects to reproduce, and how many children they will produce.
The interaction between a program and the training cases is analogous to the interaction between a biological organism and the challenges presented by its environment.
Organisms that are better equipped to handle these challenges have better reproductive success, and in genetic programming the programs that produce outputs closer to
the desired outputs should produce more children.
Many parent selection methods have been developed for genetic programming, and
they vary in the ways that they model the interactions that biological organisms have
with their environments. In most, the performance of a program on all of the training
cases is aggregated into a single value, referred to as a fitness measure or total error,
and the probability that a program will produce offspring is partially or entirely determined by this aggregate value. Even multi-objective optimization methods, which
select on the basis of multiple objectives, generally nonetheless aggregate performance
across training cases into one objective (Deb et al., 2002; Kotanchek et al., 2006, 2008;
Schmidt & Lipson, 2010a). Similarly, the recent development of quality diversity algorithms (Cully & Demiris, 2018; Cully, 2019) such as MAP-Elites (Mouret & Clune,
2015; Vassiliades et al., 2018) use aggregate fitness as part of the basis for selection.
The aggregation of performance is akin to exposing all organisms to all challenges
that they could possibly face, and allowing those that perform best on average to
produce more children. In biology, by contrast, each organism may face different
challenges, and it will produce offspring if it survives the challenges that it happens
to face before it has the opportunity to reproduce.
The lexicase parent selection method differs from most other parent selection methods in that it avoids the aggregation of performance on different training cases into
a single value (Spector, 2012; Helmuth et al., 2015). Instead, it filters individuals by
performance on training cases that are presented in different random orders for each
parent selection event, with the result that different parents will be selected on the basis of good performance on different sequences of training cases. Additionally, children
in the next generation will face different randomly shuffled cases than their parents
did. For these reasons, lexicase selection can be thought of as more faithfully modeling
interactions between biological organisms and their environments.
Hernandez et al. (2019) recently proposed two methods for subsampling the training set each generation when using lexicase selection, which were further studied by
Ferguson et al. (2019). Down-sampled lexicase selection uses a different random subsample of cases each generation. Cohort lexicase selection groups individuals into co-
2
horts, and exposes each cohort to a different random subsample of the training cases.
Both methods effectively change the environment from generation to generation by
exposing individuals to different training cases. Crucially, both methods reduce the
amount of computational effort required to evaluate each individual, since they run
each program only on a subsample of the training cases. These computational savings
can be recouped by evaluating more individuals throughout evolution. Results from
Hernandez et al. (2019) and Ferguson et al. (2019) indicate that both of these methods
improve problem-solving performance compared to standard lexicase selection.
In this paper we concentrate on down-sampled lexicase selection, as it is simpler
in concept and implementation, and both Hernandez et al. (2019) and Ferguson et al.
(2019) found its benefits to be comparable to cohort lexicase selection. We first conduct a more expansive benchmarking of down-sampled lexicase selection than has been
conducted previously, using more benchmark problems and subsample sizes. These results confirm earlier findings that down-sampled lexicase selection produces substantial
improvements over lexicase selection, and that it is robust to a range of subsample
sizes.
We then turn to developing a better understanding of why down-sampled lexicase selection performs so well. One hypothesis put forward by Ferguson et al. (2019)
is that down-sampled lexicase selection’s success hinges on it enabling deeper evolutionary searches for more generations given the same computational effort. We
compare this hypothesis to the hypothesis that simply evaluating more individuals in
the search space is more important than deeper evolution specifically. We conduct experiments using increased maximum generations and increased population sizes (with
non-increased generations), and find that they perform commensurately, indicating
that deeper evolutionary lineages are not crucial to down-sampled lexicase selection’s
success.
We then examine the idea that by randomly down-sampling, we change the environment encountered by individuals each generation. In biology, many theorists
believe that changing environments play an important role in evolutionary adaptation and speciation (Levins, 1968). We hypothesize that changing the training cases
on which down-sampled lexicase selection evaluates individuals each generation contributes to the evolvability of the system, resulting in improved performance. We test
this hypothesis with an experiment that mimics down-sampled lexicase selection, except that it uses different training cases in every selection, meaning that every training
case gains exposure each generation. The results of this experiment provide evidence
against our hypothesis that “changing environments” are important for down-sampled
lexicase selection.
One area where down-sampling (without lexicase selection) has proven useful is in
avoiding overfitting and improving generalization, both in GP and in machine learning
more generally. We explore the hypothesis that down-sampled lexicase selection’s
improved performance is driven by better generalization, and find that it does not
hold up to the results of our experiments.
This article extends a preliminary report that was presented at the 2020 Artificial
Life conference (Helmuth & Spector, 2020). Aside from general improvements to the
clarity and completeness of the presentation in the conference paper, this article covers
experiments involving more subsampling levels and more benchmark problems, with
both of these extensions producing significant new results. One key area we explore
is in using extremely small subsampled sets of training cases, resulting in surprisingly
good performance with some notable drawbacks.
Our presentation below continues as follows: We first discuss lexicase selection and
3
Algorithm 1: Lexicase Selection (to select a parent)
Inputs: candidates, the entire population;
cases, a list of training cases
Shuffle cases into a random order
loop
Set f irst be the first case in cases
Set best be the best performance of any individual in candidates on
the f irst training case
Set candidates to be the subset of candidates that have exactly best
performance on f irst
if |candidates| = 1 then
Return the only individual in candidates
end if
if |cases| = 1 then
Return a randomly selected individual from candidates
end if
Remove the first case from cases
end loop
subsampling of training cases in more detail. Once we have covered these fundamental
algorithms, we describe our experimental methods and present our benchmark results.
We then address each of the above-described hypotheses in turn, and conclude with
our interpretation of the results and suggestions for future work.
2
Related Work
Unlike many evolutionary computation parent selection methods, lexicase selection
does not aggregate the performance of an individual into a single fitness value (Helmuth et al.,
2015). Instead, it considers each training case separately, never conflating the results
on different cases. We give pseudocode for the lexicase selection algorithm in Algorithm 1. After randomly shuffling the training cases, lexicase selection goes through
them one by one, removing any individuals that do not give the best performance
on each case until either a single individual or a single case remains. Lexicase selection has produced better performance than other parent selection methods in a variety of evolutionary computation systems and problem domains (Helmuth et al., 2015;
Helmuth & Spector, 2015; La Cava et al., 2018; Orzechowski et al., 2018; Forstenlechner et al.,
2017; Liskowski et al., 2015; Oksanen & Hu, 2017; Moore & Stanton, 2017, 2018, 2019,
2020; Aenugu & Spector, 2019; Metevier et al., 2019).
Hernandez et al. (2019) introduced down-sampled lexicase selection, a variant of
lexicase selection that was developed further by Ferguson et al. (2019). Down-sampled
lexicase selection aims to reduce the number of program executions used to evaluate
each individual by only running each program on a random subsample of the overall
set of training cases, which are resampled each generation. This method reduces the
per-individual computational effort, which can either be saved for decreased runtimes,
or can be allocated in other ways, such as increases in population size or maximum
4
number of generations. In order to compare with methods that do not subsample the
training cases, we take the latter approach, always comparing methods equitably by
limiting their total program executions per GP run.
Others have used subsampling of training data in GP, for reducing computation
per individual or for improving generalization. To our knowledge, the only other
work that has combined subsampling with lexicase selection besides Hernandez et al.
(2019) and Ferguson et al. (2019) is in evolutionary robotics, where subsampling is
necessary for improving runtimes because of slow simulation speeds, though this research did not include comparisons with non-subsampled methods (Moore & Stanton,
2017, 2018). Outside of lexicase selection, subsampling has been used largely to reduce the computational load of evaluating each individual, especially when considering large datasets (Hmida et al., 2016; Martinez et al., 2017; Curry & Heywood, 2004;
Gathercole & Ross, 1994; Zhang & Joung, 1999). Others have proposed subsampling
as a technique to reduce overfitting and improve generalization (Goncalves & Silva,
2013; Martinez et al., 2017; Schmidt & Lipson, 2006, 2008, 2010b). Additionally, subsampling data is common in machine learning for similar reasons (often referred to as
mini-batches), as in stochastic gradient descent for improving generalization (Kleinberg et al.,
2018).
The work we present here, along with that of Hernandez et al. (2019), Ferguson et al.
(2019) and Moore & Stanton (2017), is novel in its application of subsampling when
using lexicase selection, as well as applying subsampling to an already relatively small
set of training data. To the latter point, many previous applications of subsampling
aim to subsample a large set of example data (thousands or millions of cases) to a manageable size, say hundreds of cases. In our case, we start with a set of about 100-200
cases, and subsample to a set of 50 or less. When using a small set of n training cases,
lexicase selection can select parents with at most n! different error vectors, since this
is the number of different shufflings of cases. When n is as small as 4 or 5, this limits
selection to a small portion of the population, and often even less in practice. Lexicase
selection typically requires 8 to 10 cases minimum to produce performance benefits,
though others have successfully used it with as few as 4 cases (Moore & Stanton,
2017). With this in mind, it is not self-evident whether or not lexicase selection can
maintain empirical benefits such as increased population diversity and problem-solving
performance with such few cases.
3
Experimental Methods
To explore the effects of down-sampled lexicase selection, we use benchmark problems from the domain of automatic program synthesis, which previous studies of
down-sampled lexicase selection have used (Hernandez et al., 2019; Ferguson et al.,
2019). In particular, we use problems from the “General Program Synthesis Benchmark Suite” (Helmuth & Spector, 2015), which require solution programs to manipulate a variety of data types and control flow structures. These problems originate from
introductory computer science textbooks, allowing us to test the ability of evolution
to perform the same types of programming we expect humans to perform. We use a
core set of 12 problems with a range of difficulties and requirements for many of our
experiments, and expand that set to 26 problems (all of the problems from the suite
that have been solved by at least one program synthesis system) for one experiment.
We additionally compare down-sampled lexicase selection to standard lexicase selection on the 25 problems of PSB2, the second iteration of general program synthesis
5
Table 1: Full training set size and program execution limit for each problem.
Training
Problems
Set Size Executions
Number IO
25
7,500,000
Sum Of Squares
50
15,000,000
Compare String Lengths, Digits, Double Letters, Even Squares, For Loop Index, Median,
Mirror Image, Replace Space With Newline,
Smallest, Small Or Large, String Lengths Backwards, Syllables
100
30,000,000
Last Index of Zero, Vectors Summed, X-Word
Lines
150
45,000,000
Count Odds, Grade, Negative to Zero, Pig
Latin, Scrabble Score, String Differences, Super Anagrams
200
60,000,000
Vector Average
250
75,000,000
Checksum
300
90,000,000
benchmark problems (Helmuth & Kelly, 2021).1
As in Helmuth & Spector (2015), we define each problem’s specifications as a set
of input/output examples, so that GP has no knowledge of the underlying problems
besides these examples.2 For each problem we use a small set of training cases to
evaluate each individual: between 25 and 300 cases per run (see Table 1) and 200
cases for every problem in PSB2. We use a larger set of unseen test cases, which are
used to determine whether an evolved program that passes all of the training cases
generalizes to unseen data. Before testing a potential solution for generalization, we
use an automatic simplification procedure that has been shown to improve generalization (Helmuth et al., 2017); finding a simplified program that passes all of the unseen
test cases is considered a successful GP run. We test the significance of differences in
numbers of successes between sets of runs using a chi-square test with a 0.05 significance level, using Holm’s correction for multiple comparisons whenever there are more
than two methods run on a single problem in one experiment.
When a run using down-sampled lexicase selection finds a program that passes all
of the subsampled training cases, we do not immediately terminate the run. Instead,
we run the program on the full training set (using it as a validation set), and terminate
the run if the program passes all of those cases. If it does not, we continue to the next
generation, as the individual (or its children) may not pass some of the cases in the
newly subsampled set of cases. As we will detail in Section 4.2, with an extremely
low subsampling level that leaves the subsampled training set with 1 or 2 cases, it is
easier for GP to generate individuals that perfectly pass those cases without passing
1 More
information
can
be
found
at
the
benchmark
https://cs.hamilton.edu/~ thelmuth/PSB2/PSB2.html.
2 Datasets for these problems can be found at https://git.io/fjPeh.
6
suite’s
website:
Table 2: PushGP system parameters.
Parameter
Value
population size
max generations for runs using full training set
genetic operator
UMAD addition rate
1000
300
UMAD
0.09
the full training set; with enough of these individuals, the process of verifying that
they pass the full training set may dominate the running time of evolution. Note that
if only a single individual passes all cases in the subsampled training set but evolution
continues, it will receive every single parent selection in that generation. These hyperselection events (Helmuth et al., 2016) may have strong effects on population diversity,
a potential avenue for future study.
We evolve programs with the PushGP genetic programming system, which uses
programs represented in the Push programming language (Spector et al., 2005; Spector & Robinson,
2002). Push was designed with genetic programming in mind, in particular to enable
autoconstruction, in which evolving programs not only need to try to solve a problem,
but are also run to produce their children (Spector & Robinson, 2002; Spector et al.,
2016). Push programs utilize a handful of typed stacks, from which instructions pop
their arguments and to which instructions push their results. Push programs can be
any hierarchically-nested list of instructions and literals, the latter of which the interpreter pushes onto the relevant stack. We use the Clojush, the Clojure implementation
of PushGP, for our experiments.3
We present the PushGP system parameters used in our experiments in Table 2.
Our only genetic operator, uniform mutation with additions and deletions (UMAD),
adds random genes before each gene in a parent’s genome at the UMAD addition
rate, and then deletes random genes at a rate to remain size-neutral on average. We
use UMAD to produce 100% of the children, instead of also using a crossover operator, since thus far it has produced the best results of any operator tested on these
problems (Helmuth et al., 2018).
Each problem in the benchmark suite prescribes a number of training cases to
use (Helmuth & Spector, 2015). In our default configuration, we run every individual
on every training case, meaning the total number of program executions allowed in
one GP run is the number of training cases multiplied by the population size and
generations. Since our down-sampled lexicase selection experiments use fewer cases to
evaluate each individual, we limit our GP runs by a program execution limit, as given
in Table 1, to ensure that each method receives equal training time.
4
Benchmarking Down-sampled Lexicase Selection
In the work introducing down-sampled lexicase selection, experiments benchmarked
down-sampled lexicase selection with subsampling levels of 0.05, 0.1, 0.25, and 0.5 on
five program synthesis problems (Hernandez et al., 2019; Ferguson et al., 2019). We
3 https://github.com/lspector/Clojush
7
Table 3: Number of successes out of 100 GP runs of down-sampled lexicase
selection with proportional increases in maximum generations per run across
seven different subsampling levels, as well as 1.0, which is equivalent to standard lexicase selection. The mean rank calculates the average rank of each
method among all methods across the problems, excluding Mirror Image and
Smallest, easy problems where results differ only in random changes in solution
generalization. For 6 sets of runs at subsampling levels 0.01 and 0.02, we were
not able to finish all 100 runs, as described in Section 4.2; the number of finished
runs is given after the /.
Problem
CSL
Double Letters
LIOZ
Mirror Image
Negative To Zero
RSWN
Scrabble Score
Smallest
SLB
Syllables
Vector Average
X-Word Lines
Mean
Mean Rank
Subsampling Level
0.05
0.1 0.175 0.25
0.5
1.0
48/97
85
90
100
84
100
7
100
100
47
100
96
38
87
72
100
84
99
18
100
99
48
100
98
25
72
68
99
86
96
19
99
96
61
98
95
60
55
61
99
86
97
24
100
96
68
99
94
51
50
65
99
82
100
31
98
95
64
97
91
40
29
63
100
78
93
28
100
94
54
95
86
32
19
62
100
80
87
13
100
94
38
88
61
79.8
3.2
78.6
3.4
76.2
4.2
78.3
3.7
76.9
4.0
71.7
5.8
64.5
7.1
0.01
0.02
0/4
5/42
94
100
96
100
1/98
100
100
11/60
100
25/60
61.0
4.8
expand on those benchmarks by testing 3 additional subsampling levels, 0.01, 0.02,
and 0.175, with the first two explicitly trying to gauge how low the subsampling rate
can get before having deleterious effects. Our experiments increase the number of
benchmark problems to 12, and additionally test the subsampling level of 0.25 on 39
other program synthesis benchmark problems to broaden our assessment. As described
above, our experiments use PushGP, showing that the benefits of down-sampling generalize beyond the linear GP system used in the initial experiments (Hernandez et al.,
2019; Ferguson et al., 2019).
4.1
Subsampling Levels
Table 3 presents the success rates for down-sampled lexicase selection using seven
different subsampling levels across twelve representative benchmark problems, along
with the mean number of successes. The last column of 1.0 performs no down-sampling,
and therefore represents standard lexicase selection. For these runs, we proportionally
increase the maximum number of generations that evolution can run to keep a constant
number of program executions; for example, while standard lexicase selection runs for
at most 300 generations, the runs with a subsampling level of 0.02 run for at most
8
1
0.02
= 50 times as many, at 15000 generations. For each problem, we calculate the
rank of each subsampling level, and average those to calculate the mean rank, where
lower values are better. Six sets of runs (five at subsampling level of 0.01 and one at
level 0.02) were not able to complete in a reasonable amount of time, as discussed in
Section 4.2.
The subsampling level of 0.02 performed the best on average, propelled by its significantly better results on the difficult Double Letters and Last Index of Zero problems.
However, every subsampling level performed well, and all considerably better than
standard lexicase (i.e. subsampling level of 1.0). The level of 0.5 performed worst of
the subsampling levels, likely because it only runs for twice as many generations as
standard lexicase selection, where the other subsampling levels run longer.
It is surprising that the subsampling level of 0.02 performed best, as it only uses
2 training cases per generation for seven of the problems, significantly limiting the
information contained in the errors on which lexicase selection bases selection. In
fact, with only 2 training cases, lexicase selection can only select individuals with 2
different error vectors corresponding to the 2 possible orderings of the cases! Even so,
this extreme constraint on selection introduced by down-sampling seems to be largely
outweighed by increasing the maximum number of generations manyfold.
Even though a subsampling level of 0.02 performed best, subsampling levels 0.02,
0.05, 0.1, 0.175, and 0.25 performed nearly identically, showing that down-sampled
lexicase selection is robust to wide a variety of subsampling levels across an order of
magnitude.
4.2
Lower bounds of subsampling level
Since down-sampled lexicase selection performs well at quite low levels of subsampling,
are there any drawbacks? Is there a lower bound to the benefits of subsampling?
First, we will examine the results on our lowest subsampling level, 0.01. We see
that it performed excellently on 7 out of 12 problems, including four where it operated
on a single training case (Mirror Image, Replace Space with Newline, String Lengths
Backwards, and Smallest) and three others with only 2 or 3 cases. These results
include producing the absolute best results on two problems, Last Index of Zero and
Negative to Zero. However, it gave polarized performance, producing the worst results
on the remaining 5 problems. For two of these problems (Scrabble Score and Syllables),
there is a clear trend toward worse performance with the lowest subsampling levels,
but for the other three, down-sampled lexicase selection performs well even at the
0.02 subsampling level. We interpret these findings to suggest that, at least for some
problems, 1 to 3 cases is not sufficient information to drive evolution toward solutions,
likely resulting in either catastrophic lack of diversity, thrashing of the population
between trying to solve different cases, or other detriments.
Beyond the problem solving performance considerations, using extremely low subsampling levels results in other unwanted behaviors of the GP system. Typically in
GP, we consider the program executions to be the time limiting factor of running GP,
and therefore tune our experiments to use the same number of program executions
regardless of down-sampling. However, as we proportionally increase the number of
maximum generations to make up for fewer program executions per generation, the
remaining components of the GP system (such as genetic operators and data logging)
take up a larger proportion of the running time in practice. Additionally, if we run
evolution for many generations (for example 100 times as many with subsampling level
of 0.01), we will require that many times more hard drive space to log data from runs.
9
Table 4: Number of successful runs comparing lexicase selection to downsampled lexicase selection with a subsampling level of 0.25 on 26 benchmark
problems. Underlined values indicate significant improvement of down-sampled
lexicase over lexicase using a chi-squared test. Lexicase was never significantly
better than down-sampled lexicase. “problems solved” counts the number of
problems each method solved at least once.
Problem
Down-sampled
Lexicase
18
51
11
28
50
2
5
2
65
69
99
82
99
0
100
31
22
98
1
95
25
4
64
97
21
91
1
32
8
19
19
0
2
0
62
55
100
80
98
0
87
13
7
100
0
94
21
4
38
88
11
61
25
22
Checksum
CSL
Count Odds
Digits
Double Letters
Even Squares
For Loop Index
Grade
Last Index of Zero
Median
Mirror Image
Negative To Zero
Number IO
Pig Latin
RSWN
Scrabble Score
Small Or Large
Smallest
String Differences
SLB
Sum of Squares
Super Anagrams
Syllables
Vector Average
Vectors Summed
X-Word Lines
problems solved
10
Similar issues exist with reserving sufficient RAM when increasing the population size
instead of the maximum generations.
In Table 3, six of our sets of runs at low subsampling levels were not able to finish
all 100 runs in a reasonable amount of time, and were cutoff before finishing. Some
of the extreme length of these runs is likely attributable to the effects discussed in
the previous paragraph. However, a subtler and potentially more harmful effect is at
play as well. As described in Section 3, when GP finds a program that passes all of
the subsampled training cases, we must test it on the remaining training cases before
calling it a potential solution and halting evolution; if it does not pass all training
cases, evolution continues. With extremely small subsampled sets, it becomes easier
for evolution to find (many) individuals that pass all of the subsampled data, requiring
us to fully evaluate those individuals, which many times do not pass the full training
set. This problem is compounded for problems that have Boolean outputs (such as
Compare String Lengths), since even if the entire population chooses between True and
False randomly, if there is only 1 case in the subsampled set, half of the population
will answer that case correctly and need to be evaluated on every training case every
generation, negating the benefits of quick evaluation per generation. This certainly
impacted the low number of finished runs of the Compare String Lengths problem at
the 0.01 subsampling level, and likely contributed to unfinished runs on other problems
at that level.
With these drawbacks in mind, we see subsampling levels between 0.05 and 0.25
producing good compromises between problem solving performance and real running
times. In the following section, we benchmark down-sampled lexicase selection using a
subsampling level of 0.25, though we expect the results would look similar at a variety
of subsampling levels.
4.3
Expanding benchmarking of down-sampled lexicase
selection to more problems
After extensively testing a variety of subsampling levels on 12 benchmark problems,
we want to exhibit its performance on a larger set of benchmark problems. We only
had the computational resources to test one subsampling level on this larger set of
problems, and chose 0.25. While the subsampling level of 0.25 did not produce the
best results in Table 3, it performed almost as well as any level, and was less computationally demanding than much lower subsampling levels for the reasons discussed in
Section 4.2.
Table 4 compares standard lexicase selection (i.e. the column 1.0 in Table 3) to
down-sampled lexicase selection with a subsampling level of 0.25 on 26 benchmark
problems from Helmuth & Spector (2015), including the 12 from Table 3. Downsampled lexicase selection produced significantly more successful runs than lexicase
selection on 9 out of the 26 problems. It additionally found solutions to 3 of the
problems that lexicase selection never solved, and had fewer successes on only two of
the problems, neither of which were significantly different.
Table 5 continues the comparison from Table 4 on 25 new problems from PSB2 (Helmuth & Kelly,
2021). These problems were designed to be a step more difficult than those from
Helmuth & Spector (2015), and show lower success rates for both standard lexicase
selection and down-sampled lexicase selection. However, down-sampled lexicase selection continues to clearly outperform standard lexicase selection, solving 4 problems that standard lexicase never solved, and performing significantly better on 8 of
11
Table 5: Number of successful runs comparing lexicase selection to downsampled lexicase selection with a subsampling level of 0.25 on the 25 new benchmark problems of PSB2. Underlined values indicate significant improvement of
down-sampled lexicase over lexicase using a chi-squared test. Lexicase was never
significantly better than down-sampled lexicase. “problems solved” counts the
number of problems each method solved at least once.
Problem
Down-sampled Lexicase
Basement
Bouncing Balls
Bowling
Camel Case
Coin Sums
Cut Vector
Dice Game
Find Pair
Fizz Buzz
Fuel Cost
GCD
Indices of Substring
Leaders
Luhn
Mastermind
Middle Character
Paired Digits
Shopping List
Snow Day
Solve Boolean
Spin Words
Square Digits
Substitution Cipher
Twitter
Vector Distance
2
3
0
4
39
0
1
20
74
67
20
4
0
0
0
79
17
0
7
5
0
2
86
52
0
1
0
0
1
2
0
0
4
25
50
8
0
0
0
0
57
8
0
4
5
0
0
61
31
0
problems solved
17
13
12
Table 6: Number of successful runs comparing down-sampled lexicase at a 0.1
subsampling level (DS 0.1) to using lexicase selection with a static set of 10 random training cases, which do not change during evolution. Underlined successes
are significantly better using a chi-squared test.
Problem
DS 0.1 Static
Compare String Lengths
Double Letters
Last Index of Zero
Mirror Image
Negative To Zero
Replace Space with Newline
Scrabble Score
Smallest
String Lengths Backwards
Syllables
Vector Average
X-Word Lines
25
72
68
99
86
96
19
99
96
61
98
95
0
4
7
13
31
57
13
40
35
9
71
35
the problems. In fact, down-sampled lexicase never produced fewer solutions than
standard lexicase on any of the 25 problems. This expanded benchmarking confirms
previous findings that down-sampled lexicase selection creates great improvements in
performance compared to lexicase selection.
4.4
Comparison with static subsample of cases
One question raised by Ferguson et al. (2019) is whether down-sampled lexicase selection’s method of randomly replacing the subsampled training cases each generation is
beneficial, or if a static subsample of training cases would be just as good. To examine
this question, we performed a set of runs that uses lexicase selection with a static, randomly subsampled set of 10 training cases that do not change during evolution; this
uses an increased number of maximum generations like with down-sampled lexicase
selection. Since each problem uses a different number of training cases (100 or 200 for
most benchmark problems), this is not equal in number to any one subsampling level,
but is often equal to a subsampling level of 0.1 or 0.05. We compare down-sampled lexicase selection with subsampling level of 0.1 to lexicase selection using a static set of 10
cases in Table 6. Down-sampled lexicase performed significantly better on 11 of the 12
problems tested. This gives strong evidence for the importance of randomly changing
the subsample each generation, which was the conclusion also found by Ferguson et al.
(2019).
13
5
Hypotheses for Down-sampled Lexicase Selection’s Performance
All of our results point to the considerable benefits of down-sampled lexicase selection compared to standard lexicase selection. Additional evidence comes from a recent benchmarking of parent selection techniques for program synthesis, which found
down-sampled lexicase selection to perform best out of a field of 21 parent selection
techniques (Helmuth & Abdelhady, 2020). We therefore turn to the question of what
makes down-sampled lexicase selection better than other parent selection methods. In
this section, we present three distinct hypotheses examining the origins of the benefits bestowed by down-sampled lexicase selection, and conduct experiments to provide
evidence for or against these hypotheses.
5.1
Hypothesis: Depth of Search
It seems clear that a primary (and possibly the only) benefit of down-sampled lexicase
selection is that it allows GP to consider more individuals (i.e. points in the search
space) within the same budget of program executions. Ferguson et al. (2019) argue in
particular that “deeper evolutionary searches”, i.e. having a larger maximum number
of generations, leading to longer lineages of evolution, is responsible for improvements
in performance—we call this the generations hypothesis. We present a competing
hypothesis, the search space hypothesis, that down-sampled lexicase selection’s better
performance is simply due to evaluating a larger number of individuals, but not related
to the depth of the search.
To test these hypotheses, we devised an experiment in which we use down-sampled
lexicase selection, but instead of increasing the maximum number of generations per
run, we increase the population size while maintaining a fixed number of program executions. For example, with a subsampling level of 0.25, we will increase the population
size by 4 times, from 1000 to 4000. This experiment will have GP evaluate the same
number of points in the search space as using an increased maximum generations, but
will not allow for longer evolutionary lineages than standard lexicase selection, as each
run is limited to 300 generations. We test three representative subsampling levels for
increased population size, and compare them to the equivalent subsampling levels with
increased maximum generations, using the same data as Table 3.
We present results using down-sampled lexicase selection with increased population sizes in Table 7. We compare results at the same subsampling level between
increased generations and increased population sizes. Out of the 36 comparisons, 2
sets of runs were significantly better with increased population, and 4 were significantly
worse. The mean success rates across problems are comparable to those with increased
generations. We additionally present the average ranking of 6 down-sampled lexicase
selection methods (3 that increase population size and 3 that increase maximum generations) across 10 of the problems, excluding the easy problems Mirror Image and
Smallest, for which differences only reflect minor differences in generalization rate.
The average ranks are all quite close to the overall average rank of 3.5, with increased
population having a slightly better average rank across the three subsampling levels,
3.3 versus 3.8.
We take these results as evidence against the generations hypothesis, in that increasing population size while fixing the maximum number of generations produces
very similar performance to increasing generations. These results give credence to the
14
Table 7: Number of successes out of 100 GP runs of down-sampled lexicase
selection at three different subsampling levels. This compares increasing population size to increasing maximum generations, with the latter being identical
to the data in Table 3. Underlined results are significantly better than the corresponding results at the same subsampling level using a chi-squared test. The
“Mean Rank” gives the average rank of each of the six treatments, so that ranks
vary from 1 to 6.
Problem
Population
0.05
0.1 0.25
Generations
0.05
0.1 0.25
Compare String Lengths
Double Letters
Last Index of Zero
Mirror Image
Negative To Zero
Replace Space With Newline
Scrabble Score
Smallest
String Lengths Backwards
Syllables
Vector Average
X-Word Lines
48
53
76
100
86
99
18
99
100
24
100
94
32
42
72
100
86
100
50
100
100
55
93
96
42
35
77
100
91
95
64
100
98
76
99
84
38
87
72
100
84
99
18
100
99
48
100
98
25
72
68
99
86
96
19
99
96
61
98
95
51
50
65
99
82
100
31
98
95
64
97
91
Mean
Mean Rank
74.7
3.2
77.2
3.4
80.1
3.2
78.6
3.3
76.2
4.0
76.9
4.0
search space hypothesis, that we only need to have down-sampled lexicase selection
increase the number of individuals we evaluate during evolution, whether that increase
comes from increases in population size or more generations. While these conclusions
reflect the general results, there are some interesting problem-specific trends to note in
Table 7. Increasing generations produced significantly better results on the Last Index
of Zero problem at all three subsampling levels, and the inverse was true on Scrabble
Score for two of the three subsampling levels. Keeping this in mind, we recommend
utilizing the bonus program evaluations allowed by down-sampling on increasing the
maximum generations or population size, as both lead to similarly good performance;
the choice between the two may come down to other factors within the GP system or
to a particular problem.
5.2
Hypothesis: Changing Environment
One interesting aspect of down-sampled lexicase selection is that it changes the set of
subsampled training cases every generation. If we think of the set of training cases as
the challenges encountered by each individual, this corresponds to an environment that
changes over time, requiring the evolving population to adapt to new circumstances
(i.e. cases). In contrast, with a fixed set of training cases, lexicase selection provides a
static environment, though one in which individuals encounter challenges in a different order for each selection. Changing environments often have interesting effects on
15
Table 8: Number of successes out of 100 GP runs of down-sampled lexicase
and truncated lexicase selections, both at the 0.1 level, and both over 3000
generations. Underlined results are significantly better using a chi-squared test.
Problem
Down-sampled Truncated
Double Letters
Scrabble Score
Vector Average
72
19
98
69
90
100
evolutionary dynamics (Levins, 1968), and empirical studies of evolving populations
of Saccharomyces cerevisiae yeast (Boyer et al., 2021), logic functions (Kashtan et al.,
2007), and digital organisms (Nahum et al., 2017; Canino-Koning et al., 2019) have
demonstrated that the speed and effectiveness of adaptive evolution can be affected,
and in some cases enhanced, by environmental variation. This led us to ask whether
environmental variation might be responsible for the benefits of down-sampled lexicase selection. Here we explore the hypothesis that down-sampled lexicase selection
changes the evolutionary dynamics in a positive way beyond increasing the number of
individuals that are evaluated.
To test this hypothesis, we designed an experiment that uses a static set of training
cases like with lexicase selection, but has each selection only use a subsample of those
cases, like with down-sampled lexicase selection. In particular, we use truncated lexicase selection, which evaluates every individual on every training case each generation,
but cuts off each lexicase selection after using a fixed number of cases (Spector et al.,
2017). In our experiment, we compare down-sampled lexicase selection at the 0.1 subsampling level with truncated lexicase selection also using only 10% of the cases for
each selection. The main difference between the two is that across all selections, truncated lexicase selection uses every training case each generation, where down-sampled
lexicase selection uses the same subsample for every selection.4
In our experiment, we run both down-sampled lexicase and truncated lexicase selections for 3000 generations. As truncated lexicase selection requires every individual
to be evaluated on every training case each generation, this is not a fair comparison
in terms of total program executions, but it is not meant to be. If the “changing
environments” hypothesis holds, then down-sampled lexicase selection should produce
better results than truncated lexicase selection, since its environment changes each
generation where truncated lexicase selection’s does not. We chose three problems for
which down-sampled lexicase selection performed much better than standard lexicase
selection over 300 generations, ensuring there is a possibility of performing worse than
down-sampled lexicase selection.
Table 8 presents the number of successful runs of down-sampled lexicase selection
and truncated lexicase selection with a maximum of 3000 generations. Over these
three problems, truncated lexicase selection performed significantly better than downsampled lexicase selection on the Scrabble Score problem, and very similarly on the
other two problems. So, not only was down-sampled lexicase selection not better, it
was a bit worse. This gives some evidence against the hypothesis that the “changing
4 Ferguson et al. (2019) also conduct an experiment comparing truncated lexicase selection
to down-sampled lexicase selection, but to address a different question; we see no contradiction
between their results and the ones we present here.
16
environment” of down-sampled lexicase selection contributes to its success, though we
admit that there may be other beneficial evolutionary dynamics at play not captured by
this experiment. We also want to emphasize that this experiment does not suggest that
truncated lexicase selection should be preferred over down-sampled lexicase selection,
or even standard lexicase selection for that matter; truncated lexicase selection used
10 times as many program executions in these runs as down-sampled lexicase selection,
meaning they are not being compared on a level playing field.
5.3
Hypothesis: Better Generalization
As discussed in the Related Work section above, down-sampling has been used (without lexicase selection) in both GP and machine learning more broadly as a method
to combat overfitting and increase the generalization of solutions. There is plenty of
room for improvement in generalization on some of our benchmark problems, with 6
problems having generalization rates below 0.7 when using lexicase selection. Does
down-sampling improve generalization when using lexicase selection?
All of our successful run counts above only include generalizing solutions that pass a
large set of random, unseen test cases. We look at the proportion of solution programs
that pass the training set that also pass the test set to calculate the generalization
rate for each set of runs. For the extended set of 26 benchmark problems presented in
Table 4, we present the generalization rate for each problem in Table 9. Even though
there are some minor differences in generalization between lexicase and down-sampled
lexicase selections, none of them are significantly different using a chi-squared test.
Problems that appear to have a large gap between the two, such as For Loop Index
and Super Anagrams, do not have enough solutions to show significance.
At this point we have no evidence to suggest that down-sampling improves lexicase selection’s generalization rate. In fact, down-sampled lexicase selection displays
poor generalization on many of the same problems that lexicase selection does. Thus
we cannot attribute the improved performance of down-sampled lexicase selection to
avoiding overfitting and improving generalization.
6
Conclusions
In this paper we have shed more light on the performance and mechanisms of downsampled lexicase selection. We conducted more extensive benchmarking of downsampled lexicase selection than has been conducted before, finding that it performs well
across a large range of benchmark problems and subsampling levels. We describe some
of the drawbacks of using very low subsampling levels, despite their ability to produce
competitive problem solving performance. We find that it is important to change
training cases every generation within a larger set of training cases, as a subsampling
method that uses a static set of cases throughout evolution performed much worse
than down-sampled lexicase selection.
We then considered the hypothesis that down-sampled lexicase selection performs
well because of its ability to search for more generations, leading to deeper evolutionary lineages. Our experiment that makes use of down-sampled lexicase selection’s
extra program executions to increase the population size rather than extending evolutionary time provides evidence against this hypothesis, since approximately the same
benefit is obtained with larger populations as with more generations. We also examine
17
Table 9: Comparing generalization rates of lexicase selection and down-sampled
lexicase selection with a subsampling level of 0.25. These generalization rates
are for the success rates in Table 4. None of the differences in generalization
were significant.
Problem
Down-sampled
Lexicase
1.00
0.61
1.00
0.60
0.98
1.00
1.00
1.00
0.66
0.69
0.99
0.83
0.99
1.00
1.00
0.42
0.98
1.00
1.00
1.00
1.00
0.96
1.00
0.95
0.98
1.00
0.49
1.00
0.66
0.95
0.67
0.67
0.57
1.00
0.84
0.98
1.00
0.93
0.32
1.00
1.00
1.00
0.80
0.97
1.00
0.92
1.00
Checksum
CSL
Count Odds
Digits
Double Letters
Even Squares
For Loop Index
Grade
Last Index of Zero
Median
Mirror Image
Negative To Zero
Number IO
Pig Latin
RSWN
Scrabble Score
Small Or Large
Smallest
String Differences
SLB
Sum of Squares
Super Anagrams
Syllables
Vector Average
Vectors Summed
X-Word Lines
18
the hypothesis that down-sampled lexicase selection’s changing of training cases every generation acts like an environment changing over evolutionary time, contributing
to its success. Our experiment using truncated lexicase selection provides evidence
against this hypothesis, though other environmental effects could be at play. A third
experiment showed that down-sampled lexicase selection does not produce better generalization rates of solution programs compared to lexicase selection, despite this being
a benefit of down-sampling in other machine learning systems. These experiments lead
us to believe that the primary cause of down-sampled lexicase selection’s success is
that it allows evolution to consider more programs throughout evolution.
This work and that of Ferguson et al. (2019) and Hernandez et al. (2019) use problems from the same general program synthesis benchmark suite. We would certainly
like to see similar experiments performed in other problem domains, where training
set subsampling has been used previously, but not to our knowledge in conjunction
with lexicase selection.
This research points to the importance of maximizing the number of points in the
search space—individuals—that genetic programming considers throughout evolution.
In this paper we push the abilities of down-sampled lexicase selection to increase
the number of individuals considered to the extreme, and find that at the 0.01 and
0.02 subsampling levels, problem-solving performance remains surprisingly good, while
actual processor performance diminishes. We would be interested to see what effects
such low subsampling levels have on population dynamics such as diversity, considering
they allow lexicase to only select a tiny fraction of the individuals in the population.
Other methods that increase the number of individuals considered by genetic programming without sacrificing information about individuals’ performances (or even
ones that do sacrifice some information, as in down-sampled lexicase selection) could
provide additional benefits. Exploring this avenue illuminated by down-sampled lexicase selection may yield other techniques that, possibly in combination with downsampled lexicase selection, could continue to drive the field forward.
7
Acknowledgements
We thank Emily Dolson, Amr Abdelhady, and the Hampshire College Computational
Intelligence Lab for discussions that improved this work. This material is based upon
work supported by the National Science Foundation under Grant No. 1617087. Any
opinions, findings, and conclusions or recommendations expressed in this publication
are those of the authors and do not necessarily reflect the views of the National Science
Foundation.
References
Aenugu, S., & Spector, L. (2019). Lexicase selection in learning classifier systems. In Proceedings of the Genetic and Evolutionary Computation Conference, (pp. 356–364).
Boyer, S., Hérissant, L., & Sherlock, G. (2021). Adaptation is influenced by the complexity
of environmental change during evolution in a dynamic environment. PLoS Genet, 17 , 1.
URL https://doi.org/10.1371/journal.pgen.1009314
Canino-Koning, R., Wiser, M. J., & Ofria, C. (2019). Fluctuating environments select for
short-term phenotypic variation leading to long-term exploration. PLoS Comput Biol , 15 ,
4.
URL https://doi.org/10.1371/journal.pcbi.1006445
19
Cully, A. (2019). Autonomous skill discovery with Quality-Diversity and Unsupervised Descriptors. In GECCO ’19: Proceedings of the Genetic and Evolutionary Computation
Conference Companion. Prague, Czech Republic: ACM.
Cully, A., & Demiris, Y. (2018). Quality and Diversity Optimization: A Unifying Modular
Framework. IEEE Transactions on Evolutionary Computation, 22 (2), 245–259.
Curry, R., & Heywood, M. I. (2004). Towards efficient training on large datasets for genetic
programming. In 17th Conference of the Canadian Society for Computational Studies of
Intelligence, vol. 3060 of LNAI , (pp. 161–174). London, Ontario, Canada: Springer-Verlag.
URL http://users.cs.dal.ca/~ mheywood/X-files/Publications/robert-CaAI04.pdf
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective
genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6 (2), 182–
197.
Ferguson, A. J., Hernandez, J. G., Junghans, D., Lalejini, A., Dolson, E., & Ofria, C. (2019).
Characterizing the effects of random subsampling and dilution on lexicase selection. In
W. Banzhaf, E. Goodman, L. Sheneman, L. Trujillo, & B. Worzel (Eds.) Genetic Programming Theory and Practice XVII , (pp. 1–23). East Lansing, MI, USA: Springer.
Forstenlechner, S., Fagan, D., Nicolau, M., & O’Neill, M. (2017). A grammar design pattern
for arbitrary program synthesis problems in genetic programming. In EuroGP 2017: Proceedings of the 20th European Conference on Genetic Programming, vol. 10196 of LNCS ,
(pp. 262–277). Amsterdam: Springer Verlag.
Gathercole, C., & Ross, P. (1994). Dynamic training subset selection for supervised learning
in genetic programming. In Parallel Problem Solving from Nature III , vol. 866 of LNCS ,
(pp. 312–321). Jerusalem: Springer-Verlag.
URL http://citeseer.ist.psu.edu/gathercole94dynamic.html
Goncalves, I., & Silva, S. (2013). Balancing learning and overfitting in genetic programming
with interleaved sampling of training data. In Proceedings of the 16th European Conference
on Genetic Programming, EuroGP 2013 , vol. 7831 of LNCS , (pp. 73–84). Vienna, Austria:
Springer Verlag.
Helmuth, T., & Abdelhady, A. (2020). Benchmarking parent selection for program synthesis
by genetic programming. In GECCO ’20: Proceedings of the 2015 Annual Conference on
Genetic and Evolutionary Computation Companion. ACM.
Helmuth, T., & Kelly, P. (2021). PSB2: The second program synthesis benchmark suite.
In 2021 Genetic and Evolutionary Computation Conference, GECCO ’21. Lille, France:
ACM.
Helmuth, T., McPhee, N. F., Pantridge, E., & Spector, L. (2017). Improving generalization of
evolved programs through automatic simplification. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’17, (pp. 937–944). Berlin, Germany: ACM.
URL http://doi.acm.org/10.1145/3071178.3071330
Helmuth, T., McPhee, N. F., & Spector, L. (2016). The impact of hyperselection on lexicase
selection. In T. Friedrich (Ed.) GECCO ’16: Proceedings of the 2016 Annual Conference
on Genetic and Evolutionary Computation, (pp. 717–724). Denver, USA: ACM.
Helmuth, T., McPhee, N. F., & Spector, L. (2018). Program synthesis using uniform mutation
by addition and deletion. In Proceedings of the Genetic and Evolutionary Computation
Conference, GECCO ’18, (pp. 1127–1134). Kyoto, Japan: ACM.
URL http://doi.acm.org/10.1145/3205455.3205603
20
Helmuth, T., & Spector, L. (2015). General program synthesis benchmark suite. In GECCO
’15: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, (pp. 1039–1046). Madrid, Spain: ACM.
URL http://doi.acm.org/10.1145/2739480.2754769
Helmuth, T., & Spector, L. (2020). Explaining and exploiting the advantages of down-sampled
lexicase selection. In Artificial Life Conference Proceedings, (pp. 341–349). MIT Press.
URL https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00334
Helmuth, T., Spector, L., & Matheson, J. (2015). Solving uncompromising problems with
lexicase selection. IEEE Transactions on Evolutionary Computation, 19 (5), 630–643.
URL http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6920034
Hernandez, J. G., Lalejini, A., Dolson, E., & Ofria, C. (2019). Random subsampling improves
performance in lexicase selection. In GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference Companion, (pp. 2028–2031). Prague, Czech Republic:
ACM.
Hmida, H., Ben Hamida, S., Borgi, A., & Rukoz, M. (2016). Sampling methods in genetic
programming learners from large datasets: A comparative study. In INNS Conference on
Big Data, vol. 529 of Advances in Intelligent Systems and Computing, (pp. 50–60).
Kashtan, N., Noor, E., & Alon, U. (2007). Varying environments can speed up evolution.
Proceedings of the National Academy of Sciences, 104 (34), 13711–13716.
URL https://www.pnas.org/content/104/34/13711
Kleinberg, R., Li, Y., & Yuan, Y. (2018). An alternative view: When does SGD escape local
minima?
Kotanchek, M., Smits, G., & Vladislavleva, E. (2006). Pursuing the pareto paradigm tournaments, algorithm variations & ordinal optimization. In R. L. Riolo, T. Soule, & B. Worzel
(Eds.) Genetic Programming Theory and Practice IV , vol. 5 of Genetic and Evolutionary
Computation, (pp. 167–185). Ann Arbor: Springer.
Kotanchek, M., Smits, G., & Vladislavleva, E. (2008). Exploiting trustable models via pareto
gp for targeted data collection. In R. L. Riolo, T. Soule, & B. Worzel (Eds.) Genetic
Programming Theory and Practice VI , Genetic and Evolutionary Computation, chap. 10,
(pp. 145–163). Ann Arbor: Springer.
Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of
Natural Selection. Cambridge, MA, USA: MIT Press.
URL http://mitpress.mit.edu/books/genetic-programming
La Cava, W., Helmuth, T., Spector, L., & Moore, J. H. (2018). A probabilistic and multiobjective analysis of lexicase selection and epsilon-lexicase selection. Evolutionary Computation.
Levins, R. (1968). Evolution in Changing Environments: Some Theoretical Explorations.
Monographs in Population Biology. Princeton University Press.
URL https://books.google.com/books?id=ZSVJ8pA1RFIC
Liskowski, P., Krawiec, K., Helmuth, T., & Spector, L. (2015). Comparison of semantic-aware
selection methods in genetic programming. In GECCO 2015 Semantic Methods in Genetic
Programming (SMGP’15) Workshop, (pp. 1301–1307). Madrid, Spain: ACM.
URL http://doi.acm.org/10.1145/2739482.2768505
Martinez, Y., Naredo, E., Trujillo, L., Legrand, P., & Lopez, U. (2017). A comparison
of fitness-case sampling methods for genetic programming. Journal of Experimental &
Theoretical Artificial Intelligence, 29 (6), 1203–1224.
21
Metevier, B., Saini, A. K., & Spector, L. (2019). Lexicase selection beyond genetic programming. In Genetic Programming Theory and Practice XVI , (pp. 123–136). Cham: Springer
International Publishing.
URL https://doi.org/10.1007/978-3-030-04735-1_7
Moore, J. M., & Stanton, A. (2017). Lexicase selection outperforms previous strategies for
incremental evolution of virtual creature controllers. Proceedings of the European Conference on Artificial Life, (pp. 290–297).
URL https://www.mitpressjournals.org/doi/abs/10.1162/ecal_a_0050_14
Moore, J. M., & Stanton, A. (2018). Tiebreaks and diversity: Isolating effects in lexicase
selection. The 2018 Conference on Artificial Life, (pp. 590–597).
URL https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00109
Moore, J. M., & Stanton, A. (2019). The limits of lexicase selection in an evolutionary robotics
task. In The 2019 Conference on Artificial Life, (pp. 551–558). MIT Press.
Moore, J. M., & Stanton, A. (2020). When specialists transition to generalists: Evolutionary
pressure in lexicase selection. In Artificial Life Conference Proceedings, (pp. 719–726).
MIT Press.
Mouret, J.-B., & Clune, J. (2015). Illuminating search spaces by mapping elites.
Nahum, J. R., West, J., Althouse, B. M., Zaman, L., Ofria, C., & Kerr, B. (2017). Improved
adaptation in exogenously and endogenously changing environments. vol. ECAL 2017, the
Fourteenth European Conference on Artificial Life of ALIFE 2020: The 2020 Conference
on Artificial Life, (pp. 306–313).
URL https://doi.org/10.1162/isal_a_052
Oksanen, K., & Hu, T. (2017). Lexicase selection promotes effective search and behavioural
diversity of solutions in linear genetic programming. In J. A. Lozano (Ed.) 2017 IEEE
Congress on Evolutionary Computation (CEC), (pp. 169–176). Donostia, San Sebastian,
Spain: IEEE.
Orzechowski, P., La Cava, W., & Moore, J. H. (2018). Where are we now? A large benchmark
study of recent symbolic regression methods. In Proceedings of the 2018 Genetic and Evolutionary Computation Conference, GECCO ’18. Tex.ids: orzechowskiWhereAreWe2018a
arXiv: 1804.09331.
URL http://arxiv.org/abs/1804.09331
Schmidt, M., & Lipson, H. (2010a). Age-fitness pareto optimization. In R. Riolo, T. McConaghy, & E. Vladislavleva (Eds.) Genetic Programming Theory and Practice VIII ,
vol. 8 of Genetic and Evolutionary Computation, chap. 8, (pp. 129–146). Ann Arbor, USA:
Springer.
URL http://www.springer.com/computer/ai/book/978-1-4419-7746-5
Schmidt, M. D., & Lipson, H. (2006). Co-evolving fitness predictors for accelerating and
reducing evaluations. In R. L. Riolo, T. Soule, & B. Worzel (Eds.) Genetic Programming
Theory and Practice IV , vol. 5 of Genetic and Evolutionary Computation, (pp. 113–130).
Ann Arbor: Springer.
Schmidt, M. D., & Lipson, H. (2008). Coevolution of fitness predictors. IEEE Transactions
on Evolutionary Computation, 12 (6), 736–749.
Schmidt, M. D., & Lipson, H. (2010b). Predicting solution rank to improve performance.
In GECCO ’10: Proceedings of the 12th annual conference on Genetic and evolutionary
computation, (pp. 949–956). Portland, Oregon, USA: ACM.
22
Spector, L. (2012). Assessment of problem modality by differential performance of lexicase
selection in genetic programming: A preliminary report. In K. McClymont, & E. Keedwell
(Eds.) 1st workshop on Understanding Problems (GECCO-UP), (pp. 401–408). Philadelphia, Pennsylvania, USA: ACM.
URL http://hampshire.edu/lspector/pubs/wk09p4-spector.pdf
Spector, L., Klein, J., & Keijzer, M. (2005). The Push3 execution stack and the evolution of
control. In GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary
computation, vol. 2, (pp. 1689–1696). Washington DC, USA: ACM Press.
URL http://www.cs.bham.ac.uk/~ wbl/biblio/gecco2005/docs/p1689.pdf
Spector, L., La Cava, W., Shanabrook, S., Helmuth, T., & Pantridge, E. (2017). Relaxations
of lexicase parent selection. In Genetic Programming Theory and Practice XV , Genetic
and Evolutionary Computation, (pp. 105–120). University of Michigan in Ann Arbor, USA:
Springer.
URL https://link.springer.com/chapter/10.1007/978-3-319-90512-9_7
Spector, L., McPhee, N. F., Helmuth, T., Casale, M. M., & Oks, J. (2016). Evolution evolves
with autoconstruction. In GECCO ’16 Companion: Proceedings of the Companion Publication of the 2016 Annual Conference on Genetic and Evolutionary Computation, (pp.
1349–1356). Denver, Colorado, USA: ACM.
Spector, L., & Robinson, A. (2002). Genetic programming and autoconstructive evolution
with the push programming language. Genetic Programming and Evolvable Machines,
3 (1), 7–40.
URL http://hampshire.edu/lspector/pubs/push-gpem-final.pdf
Vassiliades, V., Chatzilygeroudis, K., & Mouret, J. B. (2018). Using Centroidal Voronoi
Tessellations to Scale Up the Multidimensional Archive of Phenotypic Elites Algorithm.
IEEE Transactions on Evolutionary Computation, 22 (4), 623–630.
Zhang, B.-T., & Joung, J.-G. (1999). Genetic programming with incremental data inheritance.
In Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, (pp. 1217–
1224). Orlando, Florida, USA: Morgan Kaufmann.
URL http://gpbib.cs.ucl.ac.uk/gecco1999/GP-460.pdf
23