[go: up one dir, main page]

Academia.eduAcademia.edu
arXiv:2106.06085v1 [cs.NE] 10 Jun 2021 Problem-solving benefits of down-sampled lexicase selection Thomas Helmuth Hamilton College, Clinton, NY 13323 thelmuth@hamilton.edu Lee Spector Amherst College, Amherst, MA 01002 Hampshire College, Amherst, MA 01002 University of Massachusetts, Amherst, MA 01003 lspector@amherst.edu June 14, 2021 Abstract In genetic programming, an evolutionary method for producing computer programs that solve specified computational problems, parent selection is ordinarily based on aggregate measures of performance across an entire training set. Lexicase selection, by contrast, selects on the basis of performance on random sequences of training cases; this has been shown to enhance problem-solving power in many circumstances. Lexicase selection can also be seen as better reflecting biological evolution, by modeling sequences of challenges that organisms face over their lifetimes. Recent work has demonstrated that the advantages of lexicase selection can be amplified by down-sampling, meaning that only a random subsample of the training cases is used each generation. This can be seen as modeling the fact that individual organisms encounter only subsets of the possible environments, and that environments change over time. Here we provide the most extensive benchmarking of down-sampled lexicase selection to date, showing that its benefits hold up to increased scrutiny. The reasons that down-sampling helps, however, are not yet fully understood. Hypotheses include that down-sampling allows for more generations to be processed with the same budget of program evaluations; that the variation of training data across generations acts as a changing environment, encouraging adaptation; or that it reduces overfitting, leading to more general solutions. We systematically evaluate these hypotheses, finding evidence against all three, and instead draw the conclusion that down-sampled lexicase selection’s main benefit stems from the fact that it allows the evolutionary process to examine more individuals within the 1 same computational budget, even though each individual is examined less completely. Keywords— genetic programming, parent selection, lexicase selection, downsampled lexicase selection, program synthesis 1 Introduction Genetic programming is an evolutionary method for producing computer programs that solve specified computational problems (Koza, 1992). When used as a supervised learning technique, genetic programming defines a problem’s specifications by a set of training cases. It then judges the ability of evolved programs to solve the problem by running each program on each training case, and measuring the distance between the program’s output and the desired output. Genetic programming uses these error values during parent selection to determine which individuals in the population it selects to reproduce, and how many children they will produce. The interaction between a program and the training cases is analogous to the interaction between a biological organism and the challenges presented by its environment. Organisms that are better equipped to handle these challenges have better reproductive success, and in genetic programming the programs that produce outputs closer to the desired outputs should produce more children. Many parent selection methods have been developed for genetic programming, and they vary in the ways that they model the interactions that biological organisms have with their environments. In most, the performance of a program on all of the training cases is aggregated into a single value, referred to as a fitness measure or total error, and the probability that a program will produce offspring is partially or entirely determined by this aggregate value. Even multi-objective optimization methods, which select on the basis of multiple objectives, generally nonetheless aggregate performance across training cases into one objective (Deb et al., 2002; Kotanchek et al., 2006, 2008; Schmidt & Lipson, 2010a). Similarly, the recent development of quality diversity algorithms (Cully & Demiris, 2018; Cully, 2019) such as MAP-Elites (Mouret & Clune, 2015; Vassiliades et al., 2018) use aggregate fitness as part of the basis for selection. The aggregation of performance is akin to exposing all organisms to all challenges that they could possibly face, and allowing those that perform best on average to produce more children. In biology, by contrast, each organism may face different challenges, and it will produce offspring if it survives the challenges that it happens to face before it has the opportunity to reproduce. The lexicase parent selection method differs from most other parent selection methods in that it avoids the aggregation of performance on different training cases into a single value (Spector, 2012; Helmuth et al., 2015). Instead, it filters individuals by performance on training cases that are presented in different random orders for each parent selection event, with the result that different parents will be selected on the basis of good performance on different sequences of training cases. Additionally, children in the next generation will face different randomly shuffled cases than their parents did. For these reasons, lexicase selection can be thought of as more faithfully modeling interactions between biological organisms and their environments. Hernandez et al. (2019) recently proposed two methods for subsampling the training set each generation when using lexicase selection, which were further studied by Ferguson et al. (2019). Down-sampled lexicase selection uses a different random subsample of cases each generation. Cohort lexicase selection groups individuals into co- 2 horts, and exposes each cohort to a different random subsample of the training cases. Both methods effectively change the environment from generation to generation by exposing individuals to different training cases. Crucially, both methods reduce the amount of computational effort required to evaluate each individual, since they run each program only on a subsample of the training cases. These computational savings can be recouped by evaluating more individuals throughout evolution. Results from Hernandez et al. (2019) and Ferguson et al. (2019) indicate that both of these methods improve problem-solving performance compared to standard lexicase selection. In this paper we concentrate on down-sampled lexicase selection, as it is simpler in concept and implementation, and both Hernandez et al. (2019) and Ferguson et al. (2019) found its benefits to be comparable to cohort lexicase selection. We first conduct a more expansive benchmarking of down-sampled lexicase selection than has been conducted previously, using more benchmark problems and subsample sizes. These results confirm earlier findings that down-sampled lexicase selection produces substantial improvements over lexicase selection, and that it is robust to a range of subsample sizes. We then turn to developing a better understanding of why down-sampled lexicase selection performs so well. One hypothesis put forward by Ferguson et al. (2019) is that down-sampled lexicase selection’s success hinges on it enabling deeper evolutionary searches for more generations given the same computational effort. We compare this hypothesis to the hypothesis that simply evaluating more individuals in the search space is more important than deeper evolution specifically. We conduct experiments using increased maximum generations and increased population sizes (with non-increased generations), and find that they perform commensurately, indicating that deeper evolutionary lineages are not crucial to down-sampled lexicase selection’s success. We then examine the idea that by randomly down-sampling, we change the environment encountered by individuals each generation. In biology, many theorists believe that changing environments play an important role in evolutionary adaptation and speciation (Levins, 1968). We hypothesize that changing the training cases on which down-sampled lexicase selection evaluates individuals each generation contributes to the evolvability of the system, resulting in improved performance. We test this hypothesis with an experiment that mimics down-sampled lexicase selection, except that it uses different training cases in every selection, meaning that every training case gains exposure each generation. The results of this experiment provide evidence against our hypothesis that “changing environments” are important for down-sampled lexicase selection. One area where down-sampling (without lexicase selection) has proven useful is in avoiding overfitting and improving generalization, both in GP and in machine learning more generally. We explore the hypothesis that down-sampled lexicase selection’s improved performance is driven by better generalization, and find that it does not hold up to the results of our experiments. This article extends a preliminary report that was presented at the 2020 Artificial Life conference (Helmuth & Spector, 2020). Aside from general improvements to the clarity and completeness of the presentation in the conference paper, this article covers experiments involving more subsampling levels and more benchmark problems, with both of these extensions producing significant new results. One key area we explore is in using extremely small subsampled sets of training cases, resulting in surprisingly good performance with some notable drawbacks. Our presentation below continues as follows: We first discuss lexicase selection and 3 Algorithm 1: Lexicase Selection (to select a parent) Inputs: candidates, the entire population; cases, a list of training cases Shuffle cases into a random order loop Set f irst be the first case in cases Set best be the best performance of any individual in candidates on the f irst training case Set candidates to be the subset of candidates that have exactly best performance on f irst if |candidates| = 1 then Return the only individual in candidates end if if |cases| = 1 then Return a randomly selected individual from candidates end if Remove the first case from cases end loop subsampling of training cases in more detail. Once we have covered these fundamental algorithms, we describe our experimental methods and present our benchmark results. We then address each of the above-described hypotheses in turn, and conclude with our interpretation of the results and suggestions for future work. 2 Related Work Unlike many evolutionary computation parent selection methods, lexicase selection does not aggregate the performance of an individual into a single fitness value (Helmuth et al., 2015). Instead, it considers each training case separately, never conflating the results on different cases. We give pseudocode for the lexicase selection algorithm in Algorithm 1. After randomly shuffling the training cases, lexicase selection goes through them one by one, removing any individuals that do not give the best performance on each case until either a single individual or a single case remains. Lexicase selection has produced better performance than other parent selection methods in a variety of evolutionary computation systems and problem domains (Helmuth et al., 2015; Helmuth & Spector, 2015; La Cava et al., 2018; Orzechowski et al., 2018; Forstenlechner et al., 2017; Liskowski et al., 2015; Oksanen & Hu, 2017; Moore & Stanton, 2017, 2018, 2019, 2020; Aenugu & Spector, 2019; Metevier et al., 2019). Hernandez et al. (2019) introduced down-sampled lexicase selection, a variant of lexicase selection that was developed further by Ferguson et al. (2019). Down-sampled lexicase selection aims to reduce the number of program executions used to evaluate each individual by only running each program on a random subsample of the overall set of training cases, which are resampled each generation. This method reduces the per-individual computational effort, which can either be saved for decreased runtimes, or can be allocated in other ways, such as increases in population size or maximum 4 number of generations. In order to compare with methods that do not subsample the training cases, we take the latter approach, always comparing methods equitably by limiting their total program executions per GP run. Others have used subsampling of training data in GP, for reducing computation per individual or for improving generalization. To our knowledge, the only other work that has combined subsampling with lexicase selection besides Hernandez et al. (2019) and Ferguson et al. (2019) is in evolutionary robotics, where subsampling is necessary for improving runtimes because of slow simulation speeds, though this research did not include comparisons with non-subsampled methods (Moore & Stanton, 2017, 2018). Outside of lexicase selection, subsampling has been used largely to reduce the computational load of evaluating each individual, especially when considering large datasets (Hmida et al., 2016; Martinez et al., 2017; Curry & Heywood, 2004; Gathercole & Ross, 1994; Zhang & Joung, 1999). Others have proposed subsampling as a technique to reduce overfitting and improve generalization (Goncalves & Silva, 2013; Martinez et al., 2017; Schmidt & Lipson, 2006, 2008, 2010b). Additionally, subsampling data is common in machine learning for similar reasons (often referred to as mini-batches), as in stochastic gradient descent for improving generalization (Kleinberg et al., 2018). The work we present here, along with that of Hernandez et al. (2019), Ferguson et al. (2019) and Moore & Stanton (2017), is novel in its application of subsampling when using lexicase selection, as well as applying subsampling to an already relatively small set of training data. To the latter point, many previous applications of subsampling aim to subsample a large set of example data (thousands or millions of cases) to a manageable size, say hundreds of cases. In our case, we start with a set of about 100-200 cases, and subsample to a set of 50 or less. When using a small set of n training cases, lexicase selection can select parents with at most n! different error vectors, since this is the number of different shufflings of cases. When n is as small as 4 or 5, this limits selection to a small portion of the population, and often even less in practice. Lexicase selection typically requires 8 to 10 cases minimum to produce performance benefits, though others have successfully used it with as few as 4 cases (Moore & Stanton, 2017). With this in mind, it is not self-evident whether or not lexicase selection can maintain empirical benefits such as increased population diversity and problem-solving performance with such few cases. 3 Experimental Methods To explore the effects of down-sampled lexicase selection, we use benchmark problems from the domain of automatic program synthesis, which previous studies of down-sampled lexicase selection have used (Hernandez et al., 2019; Ferguson et al., 2019). In particular, we use problems from the “General Program Synthesis Benchmark Suite” (Helmuth & Spector, 2015), which require solution programs to manipulate a variety of data types and control flow structures. These problems originate from introductory computer science textbooks, allowing us to test the ability of evolution to perform the same types of programming we expect humans to perform. We use a core set of 12 problems with a range of difficulties and requirements for many of our experiments, and expand that set to 26 problems (all of the problems from the suite that have been solved by at least one program synthesis system) for one experiment. We additionally compare down-sampled lexicase selection to standard lexicase selection on the 25 problems of PSB2, the second iteration of general program synthesis 5 Table 1: Full training set size and program execution limit for each problem. Training Problems Set Size Executions Number IO 25 7,500,000 Sum Of Squares 50 15,000,000 Compare String Lengths, Digits, Double Letters, Even Squares, For Loop Index, Median, Mirror Image, Replace Space With Newline, Smallest, Small Or Large, String Lengths Backwards, Syllables 100 30,000,000 Last Index of Zero, Vectors Summed, X-Word Lines 150 45,000,000 Count Odds, Grade, Negative to Zero, Pig Latin, Scrabble Score, String Differences, Super Anagrams 200 60,000,000 Vector Average 250 75,000,000 Checksum 300 90,000,000 benchmark problems (Helmuth & Kelly, 2021).1 As in Helmuth & Spector (2015), we define each problem’s specifications as a set of input/output examples, so that GP has no knowledge of the underlying problems besides these examples.2 For each problem we use a small set of training cases to evaluate each individual: between 25 and 300 cases per run (see Table 1) and 200 cases for every problem in PSB2. We use a larger set of unseen test cases, which are used to determine whether an evolved program that passes all of the training cases generalizes to unseen data. Before testing a potential solution for generalization, we use an automatic simplification procedure that has been shown to improve generalization (Helmuth et al., 2017); finding a simplified program that passes all of the unseen test cases is considered a successful GP run. We test the significance of differences in numbers of successes between sets of runs using a chi-square test with a 0.05 significance level, using Holm’s correction for multiple comparisons whenever there are more than two methods run on a single problem in one experiment. When a run using down-sampled lexicase selection finds a program that passes all of the subsampled training cases, we do not immediately terminate the run. Instead, we run the program on the full training set (using it as a validation set), and terminate the run if the program passes all of those cases. If it does not, we continue to the next generation, as the individual (or its children) may not pass some of the cases in the newly subsampled set of cases. As we will detail in Section 4.2, with an extremely low subsampling level that leaves the subsampled training set with 1 or 2 cases, it is easier for GP to generate individuals that perfectly pass those cases without passing 1 More information can be found at the benchmark https://cs.hamilton.edu/~ thelmuth/PSB2/PSB2.html. 2 Datasets for these problems can be found at https://git.io/fjPeh. 6 suite’s website: Table 2: PushGP system parameters. Parameter Value population size max generations for runs using full training set genetic operator UMAD addition rate 1000 300 UMAD 0.09 the full training set; with enough of these individuals, the process of verifying that they pass the full training set may dominate the running time of evolution. Note that if only a single individual passes all cases in the subsampled training set but evolution continues, it will receive every single parent selection in that generation. These hyperselection events (Helmuth et al., 2016) may have strong effects on population diversity, a potential avenue for future study. We evolve programs with the PushGP genetic programming system, which uses programs represented in the Push programming language (Spector et al., 2005; Spector & Robinson, 2002). Push was designed with genetic programming in mind, in particular to enable autoconstruction, in which evolving programs not only need to try to solve a problem, but are also run to produce their children (Spector & Robinson, 2002; Spector et al., 2016). Push programs utilize a handful of typed stacks, from which instructions pop their arguments and to which instructions push their results. Push programs can be any hierarchically-nested list of instructions and literals, the latter of which the interpreter pushes onto the relevant stack. We use the Clojush, the Clojure implementation of PushGP, for our experiments.3 We present the PushGP system parameters used in our experiments in Table 2. Our only genetic operator, uniform mutation with additions and deletions (UMAD), adds random genes before each gene in a parent’s genome at the UMAD addition rate, and then deletes random genes at a rate to remain size-neutral on average. We use UMAD to produce 100% of the children, instead of also using a crossover operator, since thus far it has produced the best results of any operator tested on these problems (Helmuth et al., 2018). Each problem in the benchmark suite prescribes a number of training cases to use (Helmuth & Spector, 2015). In our default configuration, we run every individual on every training case, meaning the total number of program executions allowed in one GP run is the number of training cases multiplied by the population size and generations. Since our down-sampled lexicase selection experiments use fewer cases to evaluate each individual, we limit our GP runs by a program execution limit, as given in Table 1, to ensure that each method receives equal training time. 4 Benchmarking Down-sampled Lexicase Selection In the work introducing down-sampled lexicase selection, experiments benchmarked down-sampled lexicase selection with subsampling levels of 0.05, 0.1, 0.25, and 0.5 on five program synthesis problems (Hernandez et al., 2019; Ferguson et al., 2019). We 3 https://github.com/lspector/Clojush 7 Table 3: Number of successes out of 100 GP runs of down-sampled lexicase selection with proportional increases in maximum generations per run across seven different subsampling levels, as well as 1.0, which is equivalent to standard lexicase selection. The mean rank calculates the average rank of each method among all methods across the problems, excluding Mirror Image and Smallest, easy problems where results differ only in random changes in solution generalization. For 6 sets of runs at subsampling levels 0.01 and 0.02, we were not able to finish all 100 runs, as described in Section 4.2; the number of finished runs is given after the /. Problem CSL Double Letters LIOZ Mirror Image Negative To Zero RSWN Scrabble Score Smallest SLB Syllables Vector Average X-Word Lines Mean Mean Rank Subsampling Level 0.05 0.1 0.175 0.25 0.5 1.0 48/97 85 90 100 84 100 7 100 100 47 100 96 38 87 72 100 84 99 18 100 99 48 100 98 25 72 68 99 86 96 19 99 96 61 98 95 60 55 61 99 86 97 24 100 96 68 99 94 51 50 65 99 82 100 31 98 95 64 97 91 40 29 63 100 78 93 28 100 94 54 95 86 32 19 62 100 80 87 13 100 94 38 88 61 79.8 3.2 78.6 3.4 76.2 4.2 78.3 3.7 76.9 4.0 71.7 5.8 64.5 7.1 0.01 0.02 0/4 5/42 94 100 96 100 1/98 100 100 11/60 100 25/60 61.0 4.8 expand on those benchmarks by testing 3 additional subsampling levels, 0.01, 0.02, and 0.175, with the first two explicitly trying to gauge how low the subsampling rate can get before having deleterious effects. Our experiments increase the number of benchmark problems to 12, and additionally test the subsampling level of 0.25 on 39 other program synthesis benchmark problems to broaden our assessment. As described above, our experiments use PushGP, showing that the benefits of down-sampling generalize beyond the linear GP system used in the initial experiments (Hernandez et al., 2019; Ferguson et al., 2019). 4.1 Subsampling Levels Table 3 presents the success rates for down-sampled lexicase selection using seven different subsampling levels across twelve representative benchmark problems, along with the mean number of successes. The last column of 1.0 performs no down-sampling, and therefore represents standard lexicase selection. For these runs, we proportionally increase the maximum number of generations that evolution can run to keep a constant number of program executions; for example, while standard lexicase selection runs for at most 300 generations, the runs with a subsampling level of 0.02 run for at most 8 1 0.02 = 50 times as many, at 15000 generations. For each problem, we calculate the rank of each subsampling level, and average those to calculate the mean rank, where lower values are better. Six sets of runs (five at subsampling level of 0.01 and one at level 0.02) were not able to complete in a reasonable amount of time, as discussed in Section 4.2. The subsampling level of 0.02 performed the best on average, propelled by its significantly better results on the difficult Double Letters and Last Index of Zero problems. However, every subsampling level performed well, and all considerably better than standard lexicase (i.e. subsampling level of 1.0). The level of 0.5 performed worst of the subsampling levels, likely because it only runs for twice as many generations as standard lexicase selection, where the other subsampling levels run longer. It is surprising that the subsampling level of 0.02 performed best, as it only uses 2 training cases per generation for seven of the problems, significantly limiting the information contained in the errors on which lexicase selection bases selection. In fact, with only 2 training cases, lexicase selection can only select individuals with 2 different error vectors corresponding to the 2 possible orderings of the cases! Even so, this extreme constraint on selection introduced by down-sampling seems to be largely outweighed by increasing the maximum number of generations manyfold. Even though a subsampling level of 0.02 performed best, subsampling levels 0.02, 0.05, 0.1, 0.175, and 0.25 performed nearly identically, showing that down-sampled lexicase selection is robust to wide a variety of subsampling levels across an order of magnitude. 4.2 Lower bounds of subsampling level Since down-sampled lexicase selection performs well at quite low levels of subsampling, are there any drawbacks? Is there a lower bound to the benefits of subsampling? First, we will examine the results on our lowest subsampling level, 0.01. We see that it performed excellently on 7 out of 12 problems, including four where it operated on a single training case (Mirror Image, Replace Space with Newline, String Lengths Backwards, and Smallest) and three others with only 2 or 3 cases. These results include producing the absolute best results on two problems, Last Index of Zero and Negative to Zero. However, it gave polarized performance, producing the worst results on the remaining 5 problems. For two of these problems (Scrabble Score and Syllables), there is a clear trend toward worse performance with the lowest subsampling levels, but for the other three, down-sampled lexicase selection performs well even at the 0.02 subsampling level. We interpret these findings to suggest that, at least for some problems, 1 to 3 cases is not sufficient information to drive evolution toward solutions, likely resulting in either catastrophic lack of diversity, thrashing of the population between trying to solve different cases, or other detriments. Beyond the problem solving performance considerations, using extremely low subsampling levels results in other unwanted behaviors of the GP system. Typically in GP, we consider the program executions to be the time limiting factor of running GP, and therefore tune our experiments to use the same number of program executions regardless of down-sampling. However, as we proportionally increase the number of maximum generations to make up for fewer program executions per generation, the remaining components of the GP system (such as genetic operators and data logging) take up a larger proportion of the running time in practice. Additionally, if we run evolution for many generations (for example 100 times as many with subsampling level of 0.01), we will require that many times more hard drive space to log data from runs. 9 Table 4: Number of successful runs comparing lexicase selection to downsampled lexicase selection with a subsampling level of 0.25 on 26 benchmark problems. Underlined values indicate significant improvement of down-sampled lexicase over lexicase using a chi-squared test. Lexicase was never significantly better than down-sampled lexicase. “problems solved” counts the number of problems each method solved at least once. Problem Down-sampled Lexicase 18 51 11 28 50 2 5 2 65 69 99 82 99 0 100 31 22 98 1 95 25 4 64 97 21 91 1 32 8 19 19 0 2 0 62 55 100 80 98 0 87 13 7 100 0 94 21 4 38 88 11 61 25 22 Checksum CSL Count Odds Digits Double Letters Even Squares For Loop Index Grade Last Index of Zero Median Mirror Image Negative To Zero Number IO Pig Latin RSWN Scrabble Score Small Or Large Smallest String Differences SLB Sum of Squares Super Anagrams Syllables Vector Average Vectors Summed X-Word Lines problems solved 10 Similar issues exist with reserving sufficient RAM when increasing the population size instead of the maximum generations. In Table 3, six of our sets of runs at low subsampling levels were not able to finish all 100 runs in a reasonable amount of time, and were cutoff before finishing. Some of the extreme length of these runs is likely attributable to the effects discussed in the previous paragraph. However, a subtler and potentially more harmful effect is at play as well. As described in Section 3, when GP finds a program that passes all of the subsampled training cases, we must test it on the remaining training cases before calling it a potential solution and halting evolution; if it does not pass all training cases, evolution continues. With extremely small subsampled sets, it becomes easier for evolution to find (many) individuals that pass all of the subsampled data, requiring us to fully evaluate those individuals, which many times do not pass the full training set. This problem is compounded for problems that have Boolean outputs (such as Compare String Lengths), since even if the entire population chooses between True and False randomly, if there is only 1 case in the subsampled set, half of the population will answer that case correctly and need to be evaluated on every training case every generation, negating the benefits of quick evaluation per generation. This certainly impacted the low number of finished runs of the Compare String Lengths problem at the 0.01 subsampling level, and likely contributed to unfinished runs on other problems at that level. With these drawbacks in mind, we see subsampling levels between 0.05 and 0.25 producing good compromises between problem solving performance and real running times. In the following section, we benchmark down-sampled lexicase selection using a subsampling level of 0.25, though we expect the results would look similar at a variety of subsampling levels. 4.3 Expanding benchmarking of down-sampled lexicase selection to more problems After extensively testing a variety of subsampling levels on 12 benchmark problems, we want to exhibit its performance on a larger set of benchmark problems. We only had the computational resources to test one subsampling level on this larger set of problems, and chose 0.25. While the subsampling level of 0.25 did not produce the best results in Table 3, it performed almost as well as any level, and was less computationally demanding than much lower subsampling levels for the reasons discussed in Section 4.2. Table 4 compares standard lexicase selection (i.e. the column 1.0 in Table 3) to down-sampled lexicase selection with a subsampling level of 0.25 on 26 benchmark problems from Helmuth & Spector (2015), including the 12 from Table 3. Downsampled lexicase selection produced significantly more successful runs than lexicase selection on 9 out of the 26 problems. It additionally found solutions to 3 of the problems that lexicase selection never solved, and had fewer successes on only two of the problems, neither of which were significantly different. Table 5 continues the comparison from Table 4 on 25 new problems from PSB2 (Helmuth & Kelly, 2021). These problems were designed to be a step more difficult than those from Helmuth & Spector (2015), and show lower success rates for both standard lexicase selection and down-sampled lexicase selection. However, down-sampled lexicase selection continues to clearly outperform standard lexicase selection, solving 4 problems that standard lexicase never solved, and performing significantly better on 8 of 11 Table 5: Number of successful runs comparing lexicase selection to downsampled lexicase selection with a subsampling level of 0.25 on the 25 new benchmark problems of PSB2. Underlined values indicate significant improvement of down-sampled lexicase over lexicase using a chi-squared test. Lexicase was never significantly better than down-sampled lexicase. “problems solved” counts the number of problems each method solved at least once. Problem Down-sampled Lexicase Basement Bouncing Balls Bowling Camel Case Coin Sums Cut Vector Dice Game Find Pair Fizz Buzz Fuel Cost GCD Indices of Substring Leaders Luhn Mastermind Middle Character Paired Digits Shopping List Snow Day Solve Boolean Spin Words Square Digits Substitution Cipher Twitter Vector Distance 2 3 0 4 39 0 1 20 74 67 20 4 0 0 0 79 17 0 7 5 0 2 86 52 0 1 0 0 1 2 0 0 4 25 50 8 0 0 0 0 57 8 0 4 5 0 0 61 31 0 problems solved 17 13 12 Table 6: Number of successful runs comparing down-sampled lexicase at a 0.1 subsampling level (DS 0.1) to using lexicase selection with a static set of 10 random training cases, which do not change during evolution. Underlined successes are significantly better using a chi-squared test. Problem DS 0.1 Static Compare String Lengths Double Letters Last Index of Zero Mirror Image Negative To Zero Replace Space with Newline Scrabble Score Smallest String Lengths Backwards Syllables Vector Average X-Word Lines 25 72 68 99 86 96 19 99 96 61 98 95 0 4 7 13 31 57 13 40 35 9 71 35 the problems. In fact, down-sampled lexicase never produced fewer solutions than standard lexicase on any of the 25 problems. This expanded benchmarking confirms previous findings that down-sampled lexicase selection creates great improvements in performance compared to lexicase selection. 4.4 Comparison with static subsample of cases One question raised by Ferguson et al. (2019) is whether down-sampled lexicase selection’s method of randomly replacing the subsampled training cases each generation is beneficial, or if a static subsample of training cases would be just as good. To examine this question, we performed a set of runs that uses lexicase selection with a static, randomly subsampled set of 10 training cases that do not change during evolution; this uses an increased number of maximum generations like with down-sampled lexicase selection. Since each problem uses a different number of training cases (100 or 200 for most benchmark problems), this is not equal in number to any one subsampling level, but is often equal to a subsampling level of 0.1 or 0.05. We compare down-sampled lexicase selection with subsampling level of 0.1 to lexicase selection using a static set of 10 cases in Table 6. Down-sampled lexicase performed significantly better on 11 of the 12 problems tested. This gives strong evidence for the importance of randomly changing the subsample each generation, which was the conclusion also found by Ferguson et al. (2019). 13 5 Hypotheses for Down-sampled Lexicase Selection’s Performance All of our results point to the considerable benefits of down-sampled lexicase selection compared to standard lexicase selection. Additional evidence comes from a recent benchmarking of parent selection techniques for program synthesis, which found down-sampled lexicase selection to perform best out of a field of 21 parent selection techniques (Helmuth & Abdelhady, 2020). We therefore turn to the question of what makes down-sampled lexicase selection better than other parent selection methods. In this section, we present three distinct hypotheses examining the origins of the benefits bestowed by down-sampled lexicase selection, and conduct experiments to provide evidence for or against these hypotheses. 5.1 Hypothesis: Depth of Search It seems clear that a primary (and possibly the only) benefit of down-sampled lexicase selection is that it allows GP to consider more individuals (i.e. points in the search space) within the same budget of program executions. Ferguson et al. (2019) argue in particular that “deeper evolutionary searches”, i.e. having a larger maximum number of generations, leading to longer lineages of evolution, is responsible for improvements in performance—we call this the generations hypothesis. We present a competing hypothesis, the search space hypothesis, that down-sampled lexicase selection’s better performance is simply due to evaluating a larger number of individuals, but not related to the depth of the search. To test these hypotheses, we devised an experiment in which we use down-sampled lexicase selection, but instead of increasing the maximum number of generations per run, we increase the population size while maintaining a fixed number of program executions. For example, with a subsampling level of 0.25, we will increase the population size by 4 times, from 1000 to 4000. This experiment will have GP evaluate the same number of points in the search space as using an increased maximum generations, but will not allow for longer evolutionary lineages than standard lexicase selection, as each run is limited to 300 generations. We test three representative subsampling levels for increased population size, and compare them to the equivalent subsampling levels with increased maximum generations, using the same data as Table 3. We present results using down-sampled lexicase selection with increased population sizes in Table 7. We compare results at the same subsampling level between increased generations and increased population sizes. Out of the 36 comparisons, 2 sets of runs were significantly better with increased population, and 4 were significantly worse. The mean success rates across problems are comparable to those with increased generations. We additionally present the average ranking of 6 down-sampled lexicase selection methods (3 that increase population size and 3 that increase maximum generations) across 10 of the problems, excluding the easy problems Mirror Image and Smallest, for which differences only reflect minor differences in generalization rate. The average ranks are all quite close to the overall average rank of 3.5, with increased population having a slightly better average rank across the three subsampling levels, 3.3 versus 3.8. We take these results as evidence against the generations hypothesis, in that increasing population size while fixing the maximum number of generations produces very similar performance to increasing generations. These results give credence to the 14 Table 7: Number of successes out of 100 GP runs of down-sampled lexicase selection at three different subsampling levels. This compares increasing population size to increasing maximum generations, with the latter being identical to the data in Table 3. Underlined results are significantly better than the corresponding results at the same subsampling level using a chi-squared test. The “Mean Rank” gives the average rank of each of the six treatments, so that ranks vary from 1 to 6. Problem Population 0.05 0.1 0.25 Generations 0.05 0.1 0.25 Compare String Lengths Double Letters Last Index of Zero Mirror Image Negative To Zero Replace Space With Newline Scrabble Score Smallest String Lengths Backwards Syllables Vector Average X-Word Lines 48 53 76 100 86 99 18 99 100 24 100 94 32 42 72 100 86 100 50 100 100 55 93 96 42 35 77 100 91 95 64 100 98 76 99 84 38 87 72 100 84 99 18 100 99 48 100 98 25 72 68 99 86 96 19 99 96 61 98 95 51 50 65 99 82 100 31 98 95 64 97 91 Mean Mean Rank 74.7 3.2 77.2 3.4 80.1 3.2 78.6 3.3 76.2 4.0 76.9 4.0 search space hypothesis, that we only need to have down-sampled lexicase selection increase the number of individuals we evaluate during evolution, whether that increase comes from increases in population size or more generations. While these conclusions reflect the general results, there are some interesting problem-specific trends to note in Table 7. Increasing generations produced significantly better results on the Last Index of Zero problem at all three subsampling levels, and the inverse was true on Scrabble Score for two of the three subsampling levels. Keeping this in mind, we recommend utilizing the bonus program evaluations allowed by down-sampling on increasing the maximum generations or population size, as both lead to similarly good performance; the choice between the two may come down to other factors within the GP system or to a particular problem. 5.2 Hypothesis: Changing Environment One interesting aspect of down-sampled lexicase selection is that it changes the set of subsampled training cases every generation. If we think of the set of training cases as the challenges encountered by each individual, this corresponds to an environment that changes over time, requiring the evolving population to adapt to new circumstances (i.e. cases). In contrast, with a fixed set of training cases, lexicase selection provides a static environment, though one in which individuals encounter challenges in a different order for each selection. Changing environments often have interesting effects on 15 Table 8: Number of successes out of 100 GP runs of down-sampled lexicase and truncated lexicase selections, both at the 0.1 level, and both over 3000 generations. Underlined results are significantly better using a chi-squared test. Problem Down-sampled Truncated Double Letters Scrabble Score Vector Average 72 19 98 69 90 100 evolutionary dynamics (Levins, 1968), and empirical studies of evolving populations of Saccharomyces cerevisiae yeast (Boyer et al., 2021), logic functions (Kashtan et al., 2007), and digital organisms (Nahum et al., 2017; Canino-Koning et al., 2019) have demonstrated that the speed and effectiveness of adaptive evolution can be affected, and in some cases enhanced, by environmental variation. This led us to ask whether environmental variation might be responsible for the benefits of down-sampled lexicase selection. Here we explore the hypothesis that down-sampled lexicase selection changes the evolutionary dynamics in a positive way beyond increasing the number of individuals that are evaluated. To test this hypothesis, we designed an experiment that uses a static set of training cases like with lexicase selection, but has each selection only use a subsample of those cases, like with down-sampled lexicase selection. In particular, we use truncated lexicase selection, which evaluates every individual on every training case each generation, but cuts off each lexicase selection after using a fixed number of cases (Spector et al., 2017). In our experiment, we compare down-sampled lexicase selection at the 0.1 subsampling level with truncated lexicase selection also using only 10% of the cases for each selection. The main difference between the two is that across all selections, truncated lexicase selection uses every training case each generation, where down-sampled lexicase selection uses the same subsample for every selection.4 In our experiment, we run both down-sampled lexicase and truncated lexicase selections for 3000 generations. As truncated lexicase selection requires every individual to be evaluated on every training case each generation, this is not a fair comparison in terms of total program executions, but it is not meant to be. If the “changing environments” hypothesis holds, then down-sampled lexicase selection should produce better results than truncated lexicase selection, since its environment changes each generation where truncated lexicase selection’s does not. We chose three problems for which down-sampled lexicase selection performed much better than standard lexicase selection over 300 generations, ensuring there is a possibility of performing worse than down-sampled lexicase selection. Table 8 presents the number of successful runs of down-sampled lexicase selection and truncated lexicase selection with a maximum of 3000 generations. Over these three problems, truncated lexicase selection performed significantly better than downsampled lexicase selection on the Scrabble Score problem, and very similarly on the other two problems. So, not only was down-sampled lexicase selection not better, it was a bit worse. This gives some evidence against the hypothesis that the “changing 4 Ferguson et al. (2019) also conduct an experiment comparing truncated lexicase selection to down-sampled lexicase selection, but to address a different question; we see no contradiction between their results and the ones we present here. 16 environment” of down-sampled lexicase selection contributes to its success, though we admit that there may be other beneficial evolutionary dynamics at play not captured by this experiment. We also want to emphasize that this experiment does not suggest that truncated lexicase selection should be preferred over down-sampled lexicase selection, or even standard lexicase selection for that matter; truncated lexicase selection used 10 times as many program executions in these runs as down-sampled lexicase selection, meaning they are not being compared on a level playing field. 5.3 Hypothesis: Better Generalization As discussed in the Related Work section above, down-sampling has been used (without lexicase selection) in both GP and machine learning more broadly as a method to combat overfitting and increase the generalization of solutions. There is plenty of room for improvement in generalization on some of our benchmark problems, with 6 problems having generalization rates below 0.7 when using lexicase selection. Does down-sampling improve generalization when using lexicase selection? All of our successful run counts above only include generalizing solutions that pass a large set of random, unseen test cases. We look at the proportion of solution programs that pass the training set that also pass the test set to calculate the generalization rate for each set of runs. For the extended set of 26 benchmark problems presented in Table 4, we present the generalization rate for each problem in Table 9. Even though there are some minor differences in generalization between lexicase and down-sampled lexicase selections, none of them are significantly different using a chi-squared test. Problems that appear to have a large gap between the two, such as For Loop Index and Super Anagrams, do not have enough solutions to show significance. At this point we have no evidence to suggest that down-sampling improves lexicase selection’s generalization rate. In fact, down-sampled lexicase selection displays poor generalization on many of the same problems that lexicase selection does. Thus we cannot attribute the improved performance of down-sampled lexicase selection to avoiding overfitting and improving generalization. 6 Conclusions In this paper we have shed more light on the performance and mechanisms of downsampled lexicase selection. We conducted more extensive benchmarking of downsampled lexicase selection than has been conducted before, finding that it performs well across a large range of benchmark problems and subsampling levels. We describe some of the drawbacks of using very low subsampling levels, despite their ability to produce competitive problem solving performance. We find that it is important to change training cases every generation within a larger set of training cases, as a subsampling method that uses a static set of cases throughout evolution performed much worse than down-sampled lexicase selection. We then considered the hypothesis that down-sampled lexicase selection performs well because of its ability to search for more generations, leading to deeper evolutionary lineages. Our experiment that makes use of down-sampled lexicase selection’s extra program executions to increase the population size rather than extending evolutionary time provides evidence against this hypothesis, since approximately the same benefit is obtained with larger populations as with more generations. We also examine 17 Table 9: Comparing generalization rates of lexicase selection and down-sampled lexicase selection with a subsampling level of 0.25. These generalization rates are for the success rates in Table 4. None of the differences in generalization were significant. Problem Down-sampled Lexicase 1.00 0.61 1.00 0.60 0.98 1.00 1.00 1.00 0.66 0.69 0.99 0.83 0.99 1.00 1.00 0.42 0.98 1.00 1.00 1.00 1.00 0.96 1.00 0.95 0.98 1.00 0.49 1.00 0.66 0.95 0.67 0.67 0.57 1.00 0.84 0.98 1.00 0.93 0.32 1.00 1.00 1.00 0.80 0.97 1.00 0.92 1.00 Checksum CSL Count Odds Digits Double Letters Even Squares For Loop Index Grade Last Index of Zero Median Mirror Image Negative To Zero Number IO Pig Latin RSWN Scrabble Score Small Or Large Smallest String Differences SLB Sum of Squares Super Anagrams Syllables Vector Average Vectors Summed X-Word Lines 18 the hypothesis that down-sampled lexicase selection’s changing of training cases every generation acts like an environment changing over evolutionary time, contributing to its success. Our experiment using truncated lexicase selection provides evidence against this hypothesis, though other environmental effects could be at play. A third experiment showed that down-sampled lexicase selection does not produce better generalization rates of solution programs compared to lexicase selection, despite this being a benefit of down-sampling in other machine learning systems. These experiments lead us to believe that the primary cause of down-sampled lexicase selection’s success is that it allows evolution to consider more programs throughout evolution. This work and that of Ferguson et al. (2019) and Hernandez et al. (2019) use problems from the same general program synthesis benchmark suite. We would certainly like to see similar experiments performed in other problem domains, where training set subsampling has been used previously, but not to our knowledge in conjunction with lexicase selection. This research points to the importance of maximizing the number of points in the search space—individuals—that genetic programming considers throughout evolution. In this paper we push the abilities of down-sampled lexicase selection to increase the number of individuals considered to the extreme, and find that at the 0.01 and 0.02 subsampling levels, problem-solving performance remains surprisingly good, while actual processor performance diminishes. We would be interested to see what effects such low subsampling levels have on population dynamics such as diversity, considering they allow lexicase to only select a tiny fraction of the individuals in the population. Other methods that increase the number of individuals considered by genetic programming without sacrificing information about individuals’ performances (or even ones that do sacrifice some information, as in down-sampled lexicase selection) could provide additional benefits. Exploring this avenue illuminated by down-sampled lexicase selection may yield other techniques that, possibly in combination with downsampled lexicase selection, could continue to drive the field forward. 7 Acknowledgements We thank Emily Dolson, Amr Abdelhady, and the Hampshire College Computational Intelligence Lab for discussions that improved this work. This material is based upon work supported by the National Science Foundation under Grant No. 1617087. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. References Aenugu, S., & Spector, L. (2019). Lexicase selection in learning classifier systems. In Proceedings of the Genetic and Evolutionary Computation Conference, (pp. 356–364). Boyer, S., Hérissant, L., & Sherlock, G. (2021). Adaptation is influenced by the complexity of environmental change during evolution in a dynamic environment. PLoS Genet, 17 , 1. URL https://doi.org/10.1371/journal.pgen.1009314 Canino-Koning, R., Wiser, M. J., & Ofria, C. (2019). Fluctuating environments select for short-term phenotypic variation leading to long-term exploration. PLoS Comput Biol , 15 , 4. URL https://doi.org/10.1371/journal.pcbi.1006445 19 Cully, A. (2019). Autonomous skill discovery with Quality-Diversity and Unsupervised Descriptors. In GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference Companion. Prague, Czech Republic: ACM. Cully, A., & Demiris, Y. (2018). Quality and Diversity Optimization: A Unifying Modular Framework. IEEE Transactions on Evolutionary Computation, 22 (2), 245–259. Curry, R., & Heywood, M. I. (2004). Towards efficient training on large datasets for genetic programming. In 17th Conference of the Canadian Society for Computational Studies of Intelligence, vol. 3060 of LNAI , (pp. 161–174). London, Ontario, Canada: Springer-Verlag. URL http://users.cs.dal.ca/~ mheywood/X-files/Publications/robert-CaAI04.pdf Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6 (2), 182– 197. Ferguson, A. J., Hernandez, J. G., Junghans, D., Lalejini, A., Dolson, E., & Ofria, C. (2019). Characterizing the effects of random subsampling and dilution on lexicase selection. In W. Banzhaf, E. Goodman, L. Sheneman, L. Trujillo, & B. Worzel (Eds.) Genetic Programming Theory and Practice XVII , (pp. 1–23). East Lansing, MI, USA: Springer. Forstenlechner, S., Fagan, D., Nicolau, M., & O’Neill, M. (2017). A grammar design pattern for arbitrary program synthesis problems in genetic programming. In EuroGP 2017: Proceedings of the 20th European Conference on Genetic Programming, vol. 10196 of LNCS , (pp. 262–277). Amsterdam: Springer Verlag. Gathercole, C., & Ross, P. (1994). Dynamic training subset selection for supervised learning in genetic programming. In Parallel Problem Solving from Nature III , vol. 866 of LNCS , (pp. 312–321). Jerusalem: Springer-Verlag. URL http://citeseer.ist.psu.edu/gathercole94dynamic.html Goncalves, I., & Silva, S. (2013). Balancing learning and overfitting in genetic programming with interleaved sampling of training data. In Proceedings of the 16th European Conference on Genetic Programming, EuroGP 2013 , vol. 7831 of LNCS , (pp. 73–84). Vienna, Austria: Springer Verlag. Helmuth, T., & Abdelhady, A. (2020). Benchmarking parent selection for program synthesis by genetic programming. In GECCO ’20: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation Companion. ACM. Helmuth, T., & Kelly, P. (2021). PSB2: The second program synthesis benchmark suite. In 2021 Genetic and Evolutionary Computation Conference, GECCO ’21. Lille, France: ACM. Helmuth, T., McPhee, N. F., Pantridge, E., & Spector, L. (2017). Improving generalization of evolved programs through automatic simplification. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’17, (pp. 937–944). Berlin, Germany: ACM. URL http://doi.acm.org/10.1145/3071178.3071330 Helmuth, T., McPhee, N. F., & Spector, L. (2016). The impact of hyperselection on lexicase selection. In T. Friedrich (Ed.) GECCO ’16: Proceedings of the 2016 Annual Conference on Genetic and Evolutionary Computation, (pp. 717–724). Denver, USA: ACM. Helmuth, T., McPhee, N. F., & Spector, L. (2018). Program synthesis using uniform mutation by addition and deletion. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18, (pp. 1127–1134). Kyoto, Japan: ACM. URL http://doi.acm.org/10.1145/3205455.3205603 20 Helmuth, T., & Spector, L. (2015). General program synthesis benchmark suite. In GECCO ’15: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, (pp. 1039–1046). Madrid, Spain: ACM. URL http://doi.acm.org/10.1145/2739480.2754769 Helmuth, T., & Spector, L. (2020). Explaining and exploiting the advantages of down-sampled lexicase selection. In Artificial Life Conference Proceedings, (pp. 341–349). MIT Press. URL https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00334 Helmuth, T., Spector, L., & Matheson, J. (2015). Solving uncompromising problems with lexicase selection. IEEE Transactions on Evolutionary Computation, 19 (5), 630–643. URL http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6920034 Hernandez, J. G., Lalejini, A., Dolson, E., & Ofria, C. (2019). Random subsampling improves performance in lexicase selection. In GECCO ’19: Proceedings of the Genetic and Evolutionary Computation Conference Companion, (pp. 2028–2031). Prague, Czech Republic: ACM. Hmida, H., Ben Hamida, S., Borgi, A., & Rukoz, M. (2016). Sampling methods in genetic programming learners from large datasets: A comparative study. In INNS Conference on Big Data, vol. 529 of Advances in Intelligent Systems and Computing, (pp. 50–60). Kashtan, N., Noor, E., & Alon, U. (2007). Varying environments can speed up evolution. Proceedings of the National Academy of Sciences, 104 (34), 13711–13716. URL https://www.pnas.org/content/104/34/13711 Kleinberg, R., Li, Y., & Yuan, Y. (2018). An alternative view: When does SGD escape local minima? Kotanchek, M., Smits, G., & Vladislavleva, E. (2006). Pursuing the pareto paradigm tournaments, algorithm variations & ordinal optimization. In R. L. Riolo, T. Soule, & B. Worzel (Eds.) Genetic Programming Theory and Practice IV , vol. 5 of Genetic and Evolutionary Computation, (pp. 167–185). Ann Arbor: Springer. Kotanchek, M., Smits, G., & Vladislavleva, E. (2008). Exploiting trustable models via pareto gp for targeted data collection. In R. L. Riolo, T. Soule, & B. Worzel (Eds.) Genetic Programming Theory and Practice VI , Genetic and Evolutionary Computation, chap. 10, (pp. 145–163). Ann Arbor: Springer. Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA, USA: MIT Press. URL http://mitpress.mit.edu/books/genetic-programming La Cava, W., Helmuth, T., Spector, L., & Moore, J. H. (2018). A probabilistic and multiobjective analysis of lexicase selection and epsilon-lexicase selection. Evolutionary Computation. Levins, R. (1968). Evolution in Changing Environments: Some Theoretical Explorations. Monographs in Population Biology. Princeton University Press. URL https://books.google.com/books?id=ZSVJ8pA1RFIC Liskowski, P., Krawiec, K., Helmuth, T., & Spector, L. (2015). Comparison of semantic-aware selection methods in genetic programming. In GECCO 2015 Semantic Methods in Genetic Programming (SMGP’15) Workshop, (pp. 1301–1307). Madrid, Spain: ACM. URL http://doi.acm.org/10.1145/2739482.2768505 Martinez, Y., Naredo, E., Trujillo, L., Legrand, P., & Lopez, U. (2017). A comparison of fitness-case sampling methods for genetic programming. Journal of Experimental & Theoretical Artificial Intelligence, 29 (6), 1203–1224. 21 Metevier, B., Saini, A. K., & Spector, L. (2019). Lexicase selection beyond genetic programming. In Genetic Programming Theory and Practice XVI , (pp. 123–136). Cham: Springer International Publishing. URL https://doi.org/10.1007/978-3-030-04735-1_7 Moore, J. M., & Stanton, A. (2017). Lexicase selection outperforms previous strategies for incremental evolution of virtual creature controllers. Proceedings of the European Conference on Artificial Life, (pp. 290–297). URL https://www.mitpressjournals.org/doi/abs/10.1162/ecal_a_0050_14 Moore, J. M., & Stanton, A. (2018). Tiebreaks and diversity: Isolating effects in lexicase selection. The 2018 Conference on Artificial Life, (pp. 590–597). URL https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00109 Moore, J. M., & Stanton, A. (2019). The limits of lexicase selection in an evolutionary robotics task. In The 2019 Conference on Artificial Life, (pp. 551–558). MIT Press. Moore, J. M., & Stanton, A. (2020). When specialists transition to generalists: Evolutionary pressure in lexicase selection. In Artificial Life Conference Proceedings, (pp. 719–726). MIT Press. Mouret, J.-B., & Clune, J. (2015). Illuminating search spaces by mapping elites. Nahum, J. R., West, J., Althouse, B. M., Zaman, L., Ofria, C., & Kerr, B. (2017). Improved adaptation in exogenously and endogenously changing environments. vol. ECAL 2017, the Fourteenth European Conference on Artificial Life of ALIFE 2020: The 2020 Conference on Artificial Life, (pp. 306–313). URL https://doi.org/10.1162/isal_a_052 Oksanen, K., & Hu, T. (2017). Lexicase selection promotes effective search and behavioural diversity of solutions in linear genetic programming. In J. A. Lozano (Ed.) 2017 IEEE Congress on Evolutionary Computation (CEC), (pp. 169–176). Donostia, San Sebastian, Spain: IEEE. Orzechowski, P., La Cava, W., & Moore, J. H. (2018). Where are we now? A large benchmark study of recent symbolic regression methods. In Proceedings of the 2018 Genetic and Evolutionary Computation Conference, GECCO ’18. Tex.ids: orzechowskiWhereAreWe2018a arXiv: 1804.09331. URL http://arxiv.org/abs/1804.09331 Schmidt, M., & Lipson, H. (2010a). Age-fitness pareto optimization. In R. Riolo, T. McConaghy, & E. Vladislavleva (Eds.) Genetic Programming Theory and Practice VIII , vol. 8 of Genetic and Evolutionary Computation, chap. 8, (pp. 129–146). Ann Arbor, USA: Springer. URL http://www.springer.com/computer/ai/book/978-1-4419-7746-5 Schmidt, M. D., & Lipson, H. (2006). Co-evolving fitness predictors for accelerating and reducing evaluations. In R. L. Riolo, T. Soule, & B. Worzel (Eds.) Genetic Programming Theory and Practice IV , vol. 5 of Genetic and Evolutionary Computation, (pp. 113–130). Ann Arbor: Springer. Schmidt, M. D., & Lipson, H. (2008). Coevolution of fitness predictors. IEEE Transactions on Evolutionary Computation, 12 (6), 736–749. Schmidt, M. D., & Lipson, H. (2010b). Predicting solution rank to improve performance. In GECCO ’10: Proceedings of the 12th annual conference on Genetic and evolutionary computation, (pp. 949–956). Portland, Oregon, USA: ACM. 22 Spector, L. (2012). Assessment of problem modality by differential performance of lexicase selection in genetic programming: A preliminary report. In K. McClymont, & E. Keedwell (Eds.) 1st workshop on Understanding Problems (GECCO-UP), (pp. 401–408). Philadelphia, Pennsylvania, USA: ACM. URL http://hampshire.edu/lspector/pubs/wk09p4-spector.pdf Spector, L., Klein, J., & Keijzer, M. (2005). The Push3 execution stack and the evolution of control. In GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, vol. 2, (pp. 1689–1696). Washington DC, USA: ACM Press. URL http://www.cs.bham.ac.uk/~ wbl/biblio/gecco2005/docs/p1689.pdf Spector, L., La Cava, W., Shanabrook, S., Helmuth, T., & Pantridge, E. (2017). Relaxations of lexicase parent selection. In Genetic Programming Theory and Practice XV , Genetic and Evolutionary Computation, (pp. 105–120). University of Michigan in Ann Arbor, USA: Springer. URL https://link.springer.com/chapter/10.1007/978-3-319-90512-9_7 Spector, L., McPhee, N. F., Helmuth, T., Casale, M. M., & Oks, J. (2016). Evolution evolves with autoconstruction. In GECCO ’16 Companion: Proceedings of the Companion Publication of the 2016 Annual Conference on Genetic and Evolutionary Computation, (pp. 1349–1356). Denver, Colorado, USA: ACM. Spector, L., & Robinson, A. (2002). Genetic programming and autoconstructive evolution with the push programming language. Genetic Programming and Evolvable Machines, 3 (1), 7–40. URL http://hampshire.edu/lspector/pubs/push-gpem-final.pdf Vassiliades, V., Chatzilygeroudis, K., & Mouret, J. B. (2018). Using Centroidal Voronoi Tessellations to Scale Up the Multidimensional Archive of Phenotypic Elites Algorithm. IEEE Transactions on Evolutionary Computation, 22 (4), 623–630. Zhang, B.-T., & Joung, J.-G. (1999). Genetic programming with incremental data inheritance. In Proceedings of the Genetic and Evolutionary Computation Conference, vol. 2, (pp. 1217– 1224). Orlando, Florida, USA: Morgan Kaufmann. URL http://gpbib.cs.ucl.ac.uk/gecco1999/GP-460.pdf 23