John Koza's 24-page paper entitled

"A Response to the ML-95 Paper Entitled 'Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's ' "


Version 1 distributed at the Twelfth International Machine Learning Conference (ML-95) held in Tahoe City, California on July 9&shyp;12, 1995. Version 2 distributed at the Sixth International Conference on Genetic Algorithms (ICGA-95) to be held in Pittsburgh on July 15&shyp;19, 1995. Version 3 deposited at WWW site on May 26,1996.


Section 10 of this response is entitled "A Peer Review of the Peer Reviewing Process of the International Machine Learning Conference"


A Response to the ML-95 Paper Entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's"

John R. Koza
Computer Science Department
Stanford University
Stanford, California 94305
Koza@CS.Stanford.Edu
415-941-0336
http://www-cs-faculty.stanford.edu/~koza/


Abstract

This paper comments on the paper by Kevin Lang entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" that appears in the proceedings of the 1995 Twelfth International Machine Learning Conference.

1. Introduction

In a paper entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" at the 1995 International Machine Learning Conference, Kevin Lang raises the question of whether the existence of the population of individual entities (a key characteristic of genetic algorithms) actually performs any useful role in searching complicated search spaces. In particular, the paper focuses on searches in the space of computer programs (i.e., compositions of functions and terminals) searched by the technique of genetic programming (GP). The paper says,

"Genetic programming has attracted a large following ... probably because it comes with an appealing story about how the power of evolution is being harnessed by the technique. A crucial part of this story is the idea that the pool of highly fit individuals selected from earlier generations constitutes a valuable genetic resource that facilitates the creation of even more fit individuals in the future. This paper describes an experiment that casts doubt not only on the relative efficiency of genetic programming, but on the validity of the story about the value of the high fitness gene pool."

As an alternative to the population-based search employed in the genetic algorithm, the paper proposes a population-free hill climbing algorithm that goes from a single point in the search space of the problem to a single new point. The proposed hill climbing algorithm does not use (and has no use for) a population.

Specifically, the proposed population-free search algorithm starts with a single random individual and, at each time step, remembers the single best individual ever seen. Then, on each time step, one candidate new individual is created and compared with the current single best individual. If the new candidate is at least as good as the current best individual, the candidate becomes the new best individual.

The candidate is created differently on alternate time steps. On the even-numbered steps (starting with time step 0), the new candidate is a randomly created entirely new individual. On the odd-numbered steps, the new candidate is the offspring produced by "crossing" the current best individual with a randomly created new individual. In this "crossing" operation, a randomly chosen subtree from an especially created new random individual is inserted at a randomly chosen point of the current best individual (thereby replacing the subtree currently located at that point).

The paper explains,

"The main difference is that our algorithm generates new hypotheses by crossing the current champion with totally random individuals, while GP generates them by crossing pairs of individuals drawn from a high-fitness pool."


Using a testbed of 100 runs on each of the 80 distinct 3-argument Boolean functions, the paper concludes,

"We found hill climbing to be more efficient than GP, with the advantage increasing to a factor of 40:1 on the hardest problem instances."


The paper's interpretation is that

"Evidently, the high-fitness pool is worse than [a] random source of components for building improved individuals."


The 1995 International Conference on Machine Learning accepted and published this paper (Lang 1995).

2. Suitability of 3-Argument Boolean Functions as a Testbed

When a scientific claim must be established by experimental evidence (as opposed to proof), the persuasive quality of the experimental testbed is necessarily a threshold issue. One would think that most researchers in automated learning would be very surprised to see the family of 3-argument Boolean functions proffered as a suitable testbed for reaching generalizable conclusions about machine learning.

The paper attempts to justify its choice of 3-argument Boolean functions as follows:

"The testbed for the experiment is the Boolean circuit synthesis task described in chapter 9 of Genetic Programming. This task was selected by the book's author on which to battle an earlier batch of critics who suspected that genetic programming was nothing more than a disguised version of random-generation-and-test algorithm ... " (Emphasis added)

However, chapter 9 of Genetic Programming: On the Programming of Computers by Means of Natural Selection (Koza 1992) says precisely the opposite. The 3-argument Boolean functions were not "selected by the book's author" as a suitable testbed for disproving the claim that genetic search is no better than blind random search - much less for basing generalizations about machine learning algorithms. In fact, in the trivial world of 3-argument Boolean functions, blind random search actually outperforms genetic programming in a good many instances. Far from endorsing 3-argument Boolean functions as a suitable testbed, chapter 9 specifically dismisses them (page 206) as

"... the obvious exception of certain designated trivial problems (some of which [namely, the 3-argument Boolean functions] appear later in this chapter ... ) ... " (Emphasis added).

A glance at chapter 9 shows that the category of experimental evidence that Genetic Programming does embrace for the purpose of comparison with random-generation-and-test is the "generation 0" performance of genetic programming (i.e., blind random search). The book provides a mass of experimental evidence about "generation 0" performance in the form of 20 fitness curves, 43 performance curves, and 44 probability of success curves. This collection of curves (each based on up to several hundred runs) cover a wide variety of different problems, including many benchmark problems from robotics, control, system identification, optimization, and game playing. Only three of the 81 or so different problems in Genetic Programming are Boolean problems (e.g., the 6-argument multiplexer, the 11-argument multiplexer, and the even-5-parity function). (The contents of chapter 9 are intentionally not even considered as "problems" by the index of book).

The experimental evidence that chapter 9 embraces to show that genetic programming is not just blind random search is not the trivial family of 80 3-argument Boolean functions, but, instead,

"[t]he fact that we do not solve a problem on generation 0, but do solve the problem before generation 50 on a high percentage of the runs ... "

Boolean functions of any arity are, by themselves, of questionable value for basing generalizations about the performance of learning algorithms. Boolean functions are unusual in that both the range and domain are discrete and their performance can be efficiently represented by finite truth tables. Boolean functions of arity 3 are particularly inadequate as a suitable testbed since they exhibit a mere 9 discrete levels of fitness (i.e. from 0 to 8 correct answers for the 8 possible combinations of the 3 Boolean arguments). Indeed, 3-argument Boolean functions are about as useful as looking at the first three integers to conclude that all integers are prime numbers.

The inadequacies of Boolean functions in general and 3-argument Boolean functions in particular are so obvious and so well known that it is difficult to discern the reason why an experimental paper that did no more than mechanically tabulate the results of 100 runs of 80 3-argument Boolean functions could possibly have been accepted for publication by the International Conference on Machine Learning.

In any event, the burden is on the paper to justify its surprising choice of the manifestly trivial 3-argument Boolean functions as a suitable testbed. The attempt to invoke support for this surprising choice from chapter 9 of Genetic Programming is just plain wrong. Other than this mischaracterization of chapter 9, the paper provides no other justification to support its surprising choice on this threshold issue. The inadequacy of this testbed and its effect on the validity of the paper's conclusions will become apparent momentarily.

3. Failure of Attribution

When a published paper bombastically proclaims in its title that it is going to "beat" a particular published result and over-personalizes the issue by including the name of the author of the previously published result in its title, one can reasonably expect that the paper will present minimal familiarity with the subject matter it promises to "beat."

Lang refers (section 1) to "our version of the hill climbing algorithm" in which "[t]he main difference is that our algorithm generates new hypotheses by crossing the current champion with totally random individuals, while GP generates them by crossing pairs of individuals drawn from a high-fitness pool" (Emphasis added).

However, anyone familiar with even the basic elements of genetic programming knows that what Lang calls "our" approach of "crossing the current champion with totally random individuals" is nothing more than the familiar GP operation called mutation. This ordinary GP operation is described in detail in chapter 6 on page 106 of Genetic Programming (and is properly credited in chapter 4 on page 65 to several previous researchers from the mid 1980's - the most extensive and detailed of which is Hicklin's 1986 work). As clearly stated on page 106 of Genetic Programming,

"The mutation operation ... removes whatever is currently at the selected point and whatever is below the selected point and inserts a randomly generated subtree at that point."


Lang's "crossing" is not sexual recombination or crossover at all. Nor is it new. It is ordinary GP mutation. Once this unacknowledged equivalence is recognized, the way that paper's proposed hill climbing algorithm goes about creating each successive new point in its search can be restated in the following simpler way: The new candidate is either a randomly created entirely new individual (i.e., identical to the way individuals are created in "generation 0" of a GP run) or the offspring produced by a random GP mutation of the current best individual. In other words, we are about to imbibe of the very old wine of "mutate and save the best."

4. What Really Happens in a Run of the Proposed Hill Climbing Algorithm

We will use the even-3-parity function (one of the 80 members of the family of 3-argument Boolean functions) to illustrate the fallacy behind the paper's bombastic proclamation that it "beats" genetic programming.

For convenience, we will refer to the already-published histogram showing how well 1,000,000 randomly created programs perform the even-3-parity task found in table 26.1 of Genetic Programming II (Koza 1994a) and shown here as figure 1. Out of 1,000,000 randomly generated programs, 67.5% have fitness 4 (i.e., produce the correct answer for half of the 8 combinations of the 3 Boolean inputs), 15.6% have fitness 5 and 3 each, 0.5% have fitness 2 and 6, only 0.33% have fitness 7 or 1, and none (out of 1,000,000) have fitness 8 (the best) or 0 (the worst). Note that the vertical axis of this figure has a logarithmic scale running between 100 and 106 (thereby highlighting the otherwise imperceptibly small number of programs that have scores that stray very far from the mean value of 4).


Figure 1 Histogram for the even-3-parity problem.

Because two thirds of the random population have fitness 4, a run of the proposed population-free hill climbing algorithm will typically start at time step 0 with a randomly created individual with a fitness of 4. On step 1 (and every succeeding odd-numbered step), a random GP mutation of the current best individual will be executed. On step 2 (and every succeeding even-numbered step), a randomly created entirely new individual will be generated. This entirely new random individual (and every succeeding entirely new random individual generated on these even-numbered steps) will have a 2/3 probability of having fitness 4 and a 1/6 probability of having fitness 5 (as shown in the histogram).

After alternating these odd and even time steps a few times, the current best individual will probably work its way up to a fitness value of 5 and later to 6.

Once the current best individual reaches a fitness level of 6, the even-numbered steps starts to become irrelevant since there is only a 0.33% (1 in 300) chance of creating a randomly created entirely new individual with fitness 7. The point is that the even-numbered steps always draw from the same well (i.e., the known distribution as shown in figure 1).

And, when the fitness of the current best individual reaches the level of 7 (most likely due to an odd-numbered step), the chance that an even-numbered step will create an entirely new individual with fitness 8 is considerably less than one in a million. These odds are essentially "never" in relation to the total of 1,250 steps involved in the runs in the paper.

In other words, as soon as the fitness of the current best individual moves just a few standard deviations away from the mean value for randomly created new individuals, the even-numbered steps in the proposed hill climbing algorithm become totally ineffective and irrelevant in advancing the progress of the search. In fact, the even-numbered steps matter only when the best current individual in this hillclimbing algorithm is very close to the mean (e.g., at the very beginning of the run).

The irrelevance of the even-numbered steps would have been obvious to the paper's author had not made the threshold error (endorsed by the 1995 International Machine Learning Conference) of employing the very trivial 3-argument Boolean functions as its chosen testbed and had, instead, studied the ever-so-slightly less trivial 4-argument or 5-argument Boolean functions.

For example, figure 2 shows the histogram of fitness values for 1,000,000 randomly generated programs for the even-4-parity problem. 100% of the 1,000,000 randomly created individuals for the even-4-parity problem have fitness between 4 and 12. None of these 1,000,000 randomly created individuals even come close to the best level of fitness of 16 because a 16 is many, many standard deviations away from the mean of 8. Note again that the vertical axis of this figure has a logarithmic scale.


Figure 2 Histogram for the even-4-parity problem.

Similarly, figure 3 shows the histogram of fitness values for 1,000,000 randomly generated programs for the even-5-parity problem. 100% of the 1,000,000 randomly created individuals for the even-5-parity task have fitness between 12 and 20. Again, a perfect score of 32 is many, many standard deviations away from the mean value of 16. Note again that the vertical axis of this figure has a logarithmic scale.


Figure 3 Histograms for the even-5-parity problem.

Thus, once the proposed population-free hillclimbing algorithm moves even slightly above the mean, there is essentially no chance that the even-numbered steps will produce a new individual that will dethrone the current best individual. The entire burden of advancing the progress of the search falls, almost immediately, upon the odd-numbered steps (i.e., the ordinary GP operation of mutation).

Having eliminated the even-numbered steps from the picture, the question then becomes whether the odd-numbered steps can successfully search a complicated search space. That is, the question is whether the ordinary GP mutation operation is capable of successfully searching the tiny world of 3-argument Boolean functions.

We need not speculate on the answer to this question because the answer is already in the paper. Lang apparently first tried an approach consisting only of the odd-numbered steps (i.e., the ordinary GP mutation operation) and apparently discovered that this approach didn't work even in the trivial world of 3-argument Boolean functions. The paper says,

"Fully half of our candidates circuits were random, drawn from the same distribution that yielded poor performance for [random-generation-and-test]."

Tellingly, the paper continues,

"While this seems like a waste of resources, we needed to have some mechanism for escaping from the local optima that would have resulted from keeping only one good circuit around at a time. (Emphasis added)."


In other words, the paper concedes the inability of the odd-numbered steps alone (i.e., the ordinary GP mutation operation) for searching a non-linear space because of the well-known problem of lingering for long stretches of time at various suboptimal levels of fitness.

Given that the paper concedes the inability of the odd-numbered steps to work and given the provable inability of the even-numbered steps to play any role in the search (once the search breaks out of the immediate neighborhood of the mean value of the initial random distribution), it is clear that neither part of the proposed algorithm works. The proposed population-free hill climbing algorithm is a pasting together of an approach that the paper concedes doesn't work with an ineffective and irrelevant approach that provably can't work.

Presumably, Lang came to the realization that he "needed to have some mechanism" from his own preliminary experimentation on the family of 3-argument Boolean functions. Then, because of the threshold error of conducting these experiments in the trivial testbed of 3-argument Boolean functions (an glaringly obvious error accepted uncritically by the International Machine Learning Conference), all the search activity of consequence was concentrated in the highly confined interval between a fitness level of 6 and 8. In this highly misleading and trivial world, the author of the paper managed to convince himself that his even-numbered steps (i.e., the resort to totally random "generation 0" individuals) actually contributed something to the progress of the search.

It is noteworthy that the proposed population-free algorithm dethrones the current best individual whenever the fitness of a newly created individual is better or equal to the current best individual. In the highly confined interval between a fitness level of 6, 7 and 8 for the 3-argument Boolean functions, this "fine print" of the hill climbing algorithm played a role in further misleading the author of the paper. The author convinced himself (and the 1995 International Machine Learning Conference uncritically accepted his conclusion) that he "beat" genetic programming because of the peculiarities of his own, unjustified choice of the trivial, highly confined, unsuitable testbed of 3-argument Boolean functions.

5. Failure to Cite Contrary Evidence

When an accepted and published paper proclaims in its title that it is going to "beat" previously published results from a specific book, the reader is reasonably entitled to assume that the author of the paper (and the scholarly institution accepting and publishing the paper) has accorded at least minimal attention to relevant evidence contained in the specific book that is being "beaten."

However, the paper inexplicably pays no heed at all to the experiment (on page 607 in chapter 25 of Genetic Programming) that contains unwelcome evidence contrary to its position. That experiment considers the ability of the GP mutation operation alone (i.e., the odd-numbered steps of the proposed hill climbing algorithm) to successfully search spaces of computer programs for the 6-argument Boolean multiplexer problem.

Figure 4 (identical to figure 25.15 from Genetic Programming) shows, by generation, the probability of success for 200 runs of the Boolean 6-multiplexer problem with a population of 1,000 for runs employing the normal GP operations (i.e., with ordinary two-parent GP crossover) and for 200 other runs employing only the mutation operation. The probability of success after 50 generations reaches 100% for this problem for runs employing the normal GP operations. However, for runs employing the mutation operation, the probability of success after 50 generations is only 8%.


Figure 4. Contrary evidence from Genetic Programming (1992) that was inexplicably ignored in the paper proclaiming that it "beats genetic programming."


This experiment constitutes evidence, sitting right in the very same book that the paper bombastically proclaims it is going to "beat" and that the paper (and the scholarly institution accepting and publishing the paper) had no right to ignore. While Boolean 6-argument functions are, in my opinion, also very simple-minded (in relation to the mass of non-Boolean problems in Genetic Programming and Genetic Programming II), when one travels in the world of the trivial, the evidential value of the six argument Boolean functions surely far exceeds that of the three argument domain.

In addition, the extended run of the 6-multiplexer (on page 614 in section 25.13 of Genetic Programming) provides additional contrary evidence indicating that the improving nature of the subtrees available for crossover in later generations of a run of genetic programming.

6. Misunderstanding of How the Genetic Algorithm Works

A misstatement contained in section 2 of Lang's paper as to how the genetic algorithm works directly relates to the author's misunderstanding of what he was doing. The misstatement relates to the important issue of why the author perceived that he "needed to have some mechanism" for escaping from the local optima.

Specifically, the misstatement is that, "...successive generations ... are created by selecting the best trees from preceding generations." (Emphasis added). This is precisely how the genetic algorithm does not work.

In the genetic algorithm, the selection is probabilistic, so that any tree may be selected. Known-bad trees are often selected. Even the worst tree can potentially be selected. The probability of selection increases with increased observed fitness, so that a better tree definitely has a greater probability of being chosen than a less fit tree. However, while the genetic algorithm is somewhat greedy, it is not entirely greedy. The genetic algorithm intentionally allocates some new trials in its search to known-inferior individuals.

A consideration of simulated annealing further clarifies the importance of not exclusively doing hill climbing in searching a non-trivial space. Simulated annealing is a population-free search algorithm that moves, on each generation (time step), from a single point in the search space of the problem to a new single point. Simulated annealing generates a candidate new point from the search space of the problem. However, unlike simple hill climbing (but like the genetic algorithm), simulated annealing occasionally (probabilistically) discards a current good point in exchange for a known-inferior new point. Indeed, without this ability to occasionally go to known-inferior points, simulated annealing degenerates to simple hill climbing. Then, like simple hill climbing searches, the search would necessarily often become trapped on locally-optimal (but globally-non-optimal) points of a non-trivial search space.

The intentional allocation of some trials to known-inferior individuals (in favor of known-better individuals) is one of the features that distinguishes the genetic algorithm (and simulated annealing) from simple hill climbing.

Moreover, for both simulated annealing and the genetic algorithm, this occasional preference toward known-inferior points is not done haphazardly, but, instead, is done in a principled (exponentially increasing way) way that is closely related to the solution of the multi-armed bandit problem that turns out to be near-optimal under certain reasonable definitions and assumptions (Holland 1975; Kirkpatrick, Gelatt, and Vecchi 1983; and Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller 1953).

None of this is new. The deficiencies of hill climbing are well known because in any non-linear search space (which is to say, any non-trivial search space), hill climbing will become trapped on local optima.

7. What Constitutes "Little Evidence"?

It is interesting to note the vastly different weight that the author of the paper proclaiming to "beat" genetic programming accords to his own 100 runs of 80 3-argument Boolean functions versus the amount of weight he gives to mass of evidence contained in Genetic Programming. The paper tells us,

"The book Genetic Programming demonstrates the method's generality by exhibiting solutions to a variety of problems. However, it provides little evidence for believing that GP is better than the many existing algorithms for general learning and optimization, much less than the specialized algorithms that exists for many of the various problems."


Skipping past the puzzling question as to the identity of these unnamed "existing algorithms for general learning" that may be capable of creating computer programs to solve problems, let's look at the composition of this "little evidence" that the paper dismisses in preference to its own 100 runs of the family of 80 3-argument Boolean functions.

First, the book Genetic Programming provides evidence that genetic programming can evolve satisfactory solutions for 81 or so problems from a variety of fields, including many benchmark problems from robotics, control, system identification, optimization, and game playing.

Is that "little evidence" in relation to a mechanical tabulation of 100 runs of 80 3-argument Boolean functions?

Second, Genetic Programming II (Koza 1994a) describes how to evolve multi-part programs consisting of a main program and one or more reusable, parameterized, hierarchically-called subprograms (called "automatically defined functions"). Genetic Programming II contains evidence of the effectiveness of automatically defined functions from a couple of dozen distinct areas, including one where the genetically evolved program was superior to that written by several different human investigators.

Third, Genetic programming II also provides evidence that genetic programming requires less computational effort (i.e., fewer fitness evaluations to yield a solution with a satisfactorily high probability) with automatically defined functions than without them (provided the difficulty of the problem is above a certain relatively low break-even point). Genetic programming II also provides evidence that genetic programming usually yields solutions with smaller average overall size with automatically defined functions than without them (provided, again, that the problem is not too simple). That is, both learning efficiency and parsimony appear to be properties of genetic programming with automatically defined functions.

Fourth, Genetic programming II also provides evidence that genetic programming with automatically defined functions is scalable. For several problems for which a progression of scaled-up versions was studied, the computational effort increases as a function of problem size at a slower rate with automatically defined functions than without them. In addition, the average size of solutions similarly increases as a function of problem size at a slower rate with automatically defined functions than without them. This observed scalability results from the profitable reuse of hierarchically-callable, parameterized subprograms within the overall program.

Fifth, six new architecture-altering operations have recently been developed (Koza 1994b, 1995) for genetic programming that are patterned after the naturally-occurring chromosomal operations of gene duplication and gene deletion. When these new operations are included in a run of genetic programming, genetic programming can dynamically change, during the run, the architecture of a multi-part program consisting of a main program and a set of hierarchically-called subprograms. These on-the-fly architectural changes occur while genetic programming is concurrently evolving the work-performing steps of the main program and the hierarchically-called subprograms. The new operations can be interpreted as an automated way to change the representation of a problem while solving the problem. Equivalently, these operations can be viewed as an automated way to decompose a problem into an non-pre-specified number of subproblems of non-pre-specified dimensionality; solve the subproblems; and assemble the solutions of the subproblems into a solution of the overall problem. These operations can also be interpreted as providing an automated way to specialize and generalize.

Sixth, over 250 articles using genetic programming have now been published by other authors, including one collection of two dozen papers on advances in genetic programming (Kinnear 1994, Koza 1994a).

All of the above is dismissed as "little evidence" while the paper (and, more importantly, the 1995 International Machine Learning Conference) has promoted a tabulation of 100 runs of 80 3-argument Boolean functions into the only accepted and published paper relating to genetic programming to be presented to the 1995 Machine Learning Conference.

8. Ineffectiveness of the Proposed Hill Climbing Algorithm When the Number of Arguments is Increased Above 3

The question arises as to whether the proposed population-free hill climbing algorithm continues to "beat" genetic programming when the number of Boolean arguments is expanded from 3 to the domain of 4- and 5-argument Boolean functions.

8.1. 4-Argument Boolean Functions

The paper claims that its algorithm "beat" genetic programming by a largest margin on the most challenging (including the even-3-parity function) 3-argument Boolean functions (given the representation, fitness measure, and operations involved in these discussions).

Thus, we made two series of 12 runs of the Boolean even-4-parity problem. One series used ordinary GP and the other used the proposed population-free hill climbing algorithm.

A maximum time limit of processing 4,800,000 individuals was used for both series of runs here (and for both series of runs for the even-5-parity in the following section of this paper).

All 12 runs using ordinary GP produced solutions before generation 20 using a population size of 32,000. The average number of individuals that were processed to yield a solution using ordinary GP was 528,500 (i.e., 16.5 generations).

The proposed hill climbing algorithm solved the problem in 9 of the 12 runs within the 4,800,000 limit (32,000 times 150). The 9 solutions came on very early time steps (the latest being at step 58,879). Similarly, the best value of fitness in the 3 failed runs came on very early time steps (70, 2,329, and 3,803). No subsequent improvement occurred during the last 99.93% of the duration of these 3 failed runs. There is an early burst of activity followed by a long freeze.

Table 1 shows the outcome of the 12 runs of the proposed population-free hill climbing algorithm, the time step on which the best value of fitness was attained, and the length of the run. As can be seen, all 12 runs became stabilized very early - either because a solution was found very early (for 9 of the 12 runs) or because the run droned on futilely until step 4,800,000 (for 3 of the 9 runs).

If we make the overly generous assumption that lightening will strike on all 3 failed runs of the proposed hill climbing algorithm on time step 4,800,001 (after having been moribund for 99.93% or more of the first 4,800,000 steps), the average number of individuals that would need to be processed to yield a solution would be 1,218,640 with the proposed hill climbing algorithm. That is 2.3 times the 528,500 required for ordinary GP, so the proposed hill climbing algorithm does not "beat" genetic programming for this problem at the 4 level. That is, the ranking between the proposed hill climbing algorithm and genetic programming has already reversed.

Table 1 Summary of hill climbing runs for even-4-parity problem.


Run Outcome of run Best fitness attained on step Length of run
1 Solved 14,285 14,285
2 Solved 16,347 16,347
3 Solved 16,897 16,897
4 Solved 17,437 17,437
5 Solved 20,261 20,261
6 Solved 21,859 21,859
7 Solved 22,991 22,991
8 Solved 34,723 34,723
9 Solved 58,879 58,879
10 Failed 70 4,800,000
11 Failed 2,329 4,800,000
12 Failed 3,803 4,800,000

8.2. Even-5-Parity Problem

We then made two series of 12 runs of the Boolean even-5-parity problem.

Again all 12 runs using ordinary GP produced solutions prior to generation 71 using a population size of 64,000. The average number of individuals that were processed to yield a solution using ordinary GP was 2,913,583 (i.e., 45.5 generations).

The gap in performance between the proposed hill climbing algorithm and genetic programming widens for this problem at the 5 level. Only 3 of the 12 runs using the proposed hill climbing algorithm solved the problem after processing 4,800,000 (75 times 64,000) individuals. Again, these 100%-correct solutions emerged on very early steps (37,028, 59,643, and 60,226). Similarly, the best value of fitness ever attained during the 9 failed runs emerged on very early steps (between time steps 0 and 16,747). No subsequent improvement occurred during the last 99.7% of the duration of these 9 failed runs.

We believe that there is little chance of soon dislodging these seemingly-unending runs from the plateau of suboptimal values of fitness on which they are lingering. We also believe that there is little chance of soon making any additional progress had these runs been continued for many millions of additional steps. Of course, given enough time, Lang's random mutation of program trees will, eventually, solve any problem whatsoever. All runs of Lang's algorithm will eventually become successful at some time in the distant future.

Table 2 shows the outcome of the 12 runs of the proposed population-free hill climbing algorithm, the time step on which the best value of fitness was attained, and the length of the run. As can be seen, all 12 runs stabilized very early - either a solution was found very early (in 3 of the 12 runs) or because the run droned on futilely until step 4,800,000 (for 9 of the 12 runs). Four of the 9 failed runs never broke out above 20 (out of 32) in fitness.

Table 2 Summary of hill climbing runs for even-5-parity problem.


Run Outcome of run Best fitness attained on step Length of run
1 Solved 37,028 37,028
2 Solved 59,643 59,643
3 Solved 60,226 60,226
4 Failed 0 4,800,000
5 Failed 39 4,800,000
6 Failed 187 4,800,000
7 Failed 284 4,800,000
8 Failed 4,009 4,800,000
9 Failed 5,985 4,800,000
10 Failed 8,631 4,800,000
11 Failed 14,279 4,800,000
12 Failed 16,747 4,800,000

Practical considerations of computer time motivated the decision to terminate all runs of both the 4- and 5-parity problems for both genetic programming and the hill climbing algorithm after processing 4,800,000 (75 times 64,000) individuals. Each run of the hill climbing algorithm consumed about 20 hours on our 90 MHz Pentium PC-type computer.

If we again make the overly generous assumption that lightening will strike on all 9 failed runs of the proposed hill climbing algorithm on time step 4,800,001, the average number of individuals that would need to be processed to yield a solution would be 3,763,076 with the proposed hill climbing algorithm. That is 1.3 times the 2,913,583 for ordinary GP.

Of course, the overly generous assumption that all 9 failed runs will suddenly succeed at step 4,800,001 substantially reduces the lead of genetic programming over the hill climbing algorithm for the even-5-parity problem (as compared to the even-4-parity problem). At the time of termination of the 9 failed runs, the reigning champions had been sitting on their thrones for at least 99.7% of the 4,800,000 steps of the runs.

As explained in detail in chapter 8 of Genetic Programming, the use of arithmetic averages is an oversimplification when there are failed runs in a probabilistic algorithm. However, because of the slowness of the proposed hill climbing algorithm, construction of a performance curve was not an option for us. In any case, the collapse of the hill climbing algorithm is obvious as the arity is scaled up from 3 to 4 to 5.

Based on these experiments, we totally concur with the author's own concession that he

" ... needed to have some mechanism for escaping from the local optima ... "

Unfortunately, his proposed hill climbing algorithm contains no such mechanism.

The reversal of the ranking of genetic programming and the proposed hill climbing algorithm between the even-3-parity problem and the even-4-parity problem and the widening of the lead of genetic programming over the hill climbing algorithm between the even-4-parity problem (3 failed runs out of 12 for the hill climbing algorithm) and the even-5-parity problem (9 failed runs out of 12 for the hill climbing algorithm) strongly suggests that the proposed hill climbing algorithm simply doesn't work on anything but the most trivial, unrepresentative, and misleading of test beds.

The proposed algorithm doesn't work because it is based on hill climbing (instead of the less greedy, but mathematically near-optimal allocation of trials inherent in the genetic algorithm and simulated annealing) and because the pool of individuals in the population does, in fact, provide useful subtrees that are beneficially exploited by the true two-parent sexual recombination operation (crossover).

9. Conclusions About the ML-95 Hill Climbing Paper that "Beats" Genetic Programming

The paper by Kevin Lang entitled "Hill Climbing Beats Genetic Search on a Boolean Circuit Synthesis Problem of Koza's" that appears in the proceedings of the 1995 International Machine Learning Conference

10. A Peer Review of the Peer Reviewing Process of the International Machine Learning Conference

The acceptance and publication of an experimental paper that mechanically tabulated the results of 100 runs of the family of 80 3-argument Boolean functions by the 1995 International Conference on Machine Learning raises questions about the ML-95 reviewing process.

Such a mechanical tabulation (even if its conclusions were correct) would hardly qualify for a passing grade in reputable university course. The paper becomes further removed from the realm of the appropriate by its bombastic title proclaiming that it "beats" genetic programming, its over-personalized inclusion of another author's name in its title, and its sociological introduction characterizing GP's "large following" as gullibly embracing some kind of "appealing" Pied Piper "story" of evolution.

One does not need to be a rocket scientist to know that 3-argument Boolean functions are about as useful for purposes of generalization as looking at the first three integers to conclude that all integers are prime numbers. Nor does one have to be a rocket scientist to know about the deficiencies of hill climbing as a search technique for non-linear search spaces.

Acceptance of the Lang paper by ML-95 therefore cannot therefore be viewed as a matter of bad judgment, but rather a suspension of judgment.

Why were good judgment and normal scientific standards suspended in favor of this paper that proclaims that it "beats" genetic programming?

This question does not arise in a vacuum, as if this particular GP-bashing paper is an isolated event in the history of the International Machine Learning Conference. Concurrent with the acceptance of the Lang paper "beating" genetic programming, 100% of the genetic programming papers that were submitted to the 1995 Machine Learning Conference were rejected. And, all but one of the papers on genetic programming submitted to the Machine Learning Conference over the past five years have been rejected.

(The exact number of submitted 1995 GP papers is not fully known to us at this time; however, it is at least 7. Based on the co-chair's statement that there were 15-20 submitted GA-GP papers and other information that we will be verifying in the next few weeks as we attend other conferences, the number is almost certainly higher than 7).

Focusing on the 7 currently known rejected 1995 GP papers, is there anyone who is prepared to step forward and straight-facedly assert that 0% of the papers in this entire category of papers had any merit (while the same conference was accepting and publishing 68 other papers on machine learning)?

And, since Lang's paper provides a standard as to what is acceptable at the ML-95 conference, is there anyone with any scientific integrity who is prepared to step forward and straight-facedly assert that 100% of the papers in the GP category were inferior to Lang's accepted GP-bashing paper?

And, looking back over previous years, is it reasonable to say that almost 100% of the many GP papers submitted to ML in previous years had no merit?

Surely, the problem with the GP papers is not relevance, since it is hard to think of anything more clearly embraced by Art Samuel's term "machine learning" than a method for creating computer programs to solve a broad range of problems. Thus, ML's long history of rejections must be based on the proposition that there is a persistent lack of scientific merit among the GP papers.

Surely all processes that aspire to any degree of integrity and credibility are susceptible to some accountability. If the peer review process is appropriate for assessing the quality of submitted papers, then it seems altogether fitting and proper that a peer review process should be used to assess the quality of ML's peer review process in connection with the question of why essentially 100% of GP papers are consistently rejected by the Machine Learning Conference.

Of course, in undertaking a peer review of ML's peer review process, mere differences of opinion can be resolved in favor of ML and quibbling over minor matters is to be avoided. However, after generously granting an a priori presumption in favor of ML, it is nonetheless the case that this presumption remains susceptible to reversal upon presentation of evidence of a course of conduct that is intellectually dishonest and preposterous.

We, therefore, now present a review of the currently available ML-95 reviews (as supplied by the authors of the 7 currently known rejected 1995 GP papers).

10.1. Paper 302 - GP is Only "Slightly Better" than Human Performance

Paper 302 uses genetic programming to evolve a motif for families of proteins. The use of automatically defined functions in conjunction with genetic programming overcomes the deficiency that existing motif languages used by molecular biologists do not provide for reusability of common subexpressions. Paper 302 compares a genetically evolved motif to the motif determined by the committee of human molecular biologists that meets in Geneva to identify and archive regularities in protein sequence data. In paper 302, GP evolved a feature that the committee of human experts missed. The genetically evolved motif (which takes the missed feature into account) does slightly better than the committee of human experts on the human committee's own data. Paper 302 suggests a reasonable biological interpretation for the feature missed by the human committee.

The 1995 International Machine Learning Conference complains,

"The problem attacked in the paper has already been solved (GP does give a very slightly better solution, but the original solution was already almost perfect)"
(Emphasis added)

Yes, indeed, the problem has already been solved. But, the reviewer fails to mention that this existing solution was accomplished by humans! The "original solution" that GP only "slightly" exceeded was expert human performance on an international committee of human experts on molecular biology.

One would think that the purpose of the field of machine learning was to develop automated means to rival human performance on non-trivial problems.

Just how many published papers at the ML-95 conference deliver anything near human-level performance, much less the merely "slightly better" than human-level performance?

Since Lang's paper was accepted and paper 302 was rejected, one can conclude that a mechanical tabulation of 100 runs of 80 3-argument Boolean functions has more scientific merit, in the eyes of the ML-95 conference, than exceeding human-level performance on a problem ordinarily handled by an international committee of human experts.

10.2. Paper 228 - Absence of Cross-Paradigm Comparisons

ML-95 complains about paper 228 because

" ... there [were no] ... comparisons with other search methods."


The above is just one of several examples of precisely the same criticism, in almost the same words (coming, no doubt, from precisely the same reviewer) daisy-chaining their way through ML-95's reviews of the 7 rejected GP papers.

Just how many published papers at the ML conference make cross-paradigm comparisons?

After leafing through the proceedings of the most recent (1994) ML conference, I could not find a single cross-paradigm comparison in any papers employing decision trees, inductive logic programming, reinforcement learning, or any other machine learning paradigm.

I also briefly leafed through six randomly chosen issues of the Machine Learning journal, including the first issue in 1986. I did not find any cross-paradigm comparisons in those particular issues (although there are several that I recall seeing in other issues in the past).

In the first issue of the Machine Learning journal, the founding editor held up the four papers in that issue as models that authors should consider following for machine learning papers (Langley 1986). None of these four excellent papers (which included, among others, Quinlan's often-cited and well-regarded paper on decision trees[1986]) had any cross-paradigm comparisons.

Of course, the idea of doing cross-paradigm comparisons is itself a highly debatable proposition.
For one thing, doing a cross-paradigm comparison is a major undertaking for a researcher. Such comparisons require writing, debugging, and running computer programs for each of these other methodologies involved (or, alternately, coming up with an all-encompassing analytic theory definitively modeling the comparative performance of the different technologies). About two years ago, I reviewed a government grant proposal that carefully laid out the costs and difficulties of doing such comparisons on a rather limited set of such paradigms and problems. A mid-six-figure budget spanning a couple of years was proposed for the project.

Second, doing a cross-paradigm comparison even-handedly is itself extremely difficult. The possiblity of unintentional bias presents the real risk that the very considerable effort involved will not prove to be at all persuasive to anyone.

Third, and more importantly, preventing publication of a new result merely because it has not (and perhaps cannot) also be produced by all other techniques of machine learning would mean that only the most inconsequential and incremental results would ever be published in the field. Cross-paradigm comparisons can, by definition, only be done on problems that can be solved by several different paradigms. Given the inability of many paradigms to solve any breadth of problems, such cross-paradigm comparisons are necessarily limited to trivial problem domains. The challenege for machine learning at the present time is how to get non-trivial results on non-trivial problems - not the relative speed by which trivial and useless results can be simultaneously obtained by various different paradigms.

AUTHOR'S NOTE: Click here for additional discussion about cross-paradigm comparisons


But, in any event, the issue here is not whether more authors should, or should not, do cross-paradigm comparisons. The point is that there is a long-standing, well-known, prevailing practice at ML conferences (and throughout the ML community) generally to not do cross-paradigm comparisons. (In fact, given the difficulty, the very few papers that do cross-paradigm comparisons generally do only that).

Why are papers on genetic programming being subjected to a unique and onerous higher standard than the long-standing, well-known, prevailing standard applied to papers employing decision trees, inductive logic programming, reinforcement learning, and all other paradigms?

Certainly, researchers in decision trees can get their papers published without also solving the very same problem using inductive logic programming, reinforcement learning, or genetic programming. So why should ML-95 demand that researchers in genetic programming also work with decision trees, inductive logic programming, and reinforcement learning?
Do all of this reviewer's own papers contain such cross-paradigm comparisons?

Indeed, do any of his papers contain such a comparison?

If doing such cross-paradigm comparisons are not the prevailing practice of ML and if doing them is not the reviewer's own practice, then why is the reviewer criticizing this paper (and other GP papers) for not doing such comparisons?

10.3. Paper 302 - Author Failed to Say Why GP was Used and Also Failed to Do Cross-Paradigm Comparisons

The (no doubt same) ML-95 reviewer further complains about paper 302,

" ... no comparisons are made with other search methods (e.g., why use GP instead of some other method?)."
(Emphasis added)

I searched in vain for a paper in the recent ML proceedings employing decision trees, inductive logic programming, reinforcement learning, or any other paradigm where the author gives any kind of organized and detailed recital as to why he used his chosen paradigm "instead of some other method."

On the whole, the fact is that researchers use the paradigms that they use in their papers because that is what they do.

Apparently this reviewer thinks that a researcher wishing to use genetic programming owes some special explanation or apology (not demanded from users of other paradigms) for using GP. Perhaps the reviewer views GP as some forbidden technique that may be used only as a last resort after one has first obligingly tried other (presumably more legitimate) techniques.

Again, the question is why the papers on genetic programming are being singled out, criticized, and penalized for an omission common to almost every other paper ever published by the ML conference?

10.4. Paper 221 - Author Dared to Say "Why Use GP"

Curiously, the author of one of the rejected GP papers ventured one sentence to say "why use GP instead of some other method?". ML-95 then bristled,

"There are unsubstantiated claims, such as 'Genetic programming...is particularly suited to problems in which some underlying regularity or structure must be discovered'"
(Emphasis added)

It is hard to imagine a more innocuous or inoffensive one-sentence statement as to "why use GP" (specifically, GP with automatically defined functions).

In any event, paper 221's author was just giving the reviewer what he complained that paper's 302's author did not.

Moreover, the reviewer's accusation that this offending sentence is "unsubstantiated" is itself wrong. There is a considerable literature on GP's usefulness and applicability to regularity discovery and they are cited right in the paper.

If it were the prevailing practice of ML to require each paper to contain an organized and detailed recital of the specific reasons for choosing a particular paradigm (and it is not) and if each author's reasons also had to be "substantiated" in the paper (as the reviewer is demanding of author 221), there would be no room for the paper!

It is totally unreasonable to demand inclusion of this kind of introductory material and then to take offense when such necessarily terse material is "unsubstantiated."

The point here is that the same reviewer criticizes and penalizes GP authors when they fail to fess up as to why they use GP and then criticizes and penalizes them when they try to explain why they use GP.

10.5. Paper 055 - Intra-Paradigm Comparisons are Not Enough - There Must Also be Cross-Paradigm Comparisons

The stated purpose of paper 055 was to compare the author's proposed enhancement to GP (adding recursion) with existing forms of GP (without recursion). Thus, the author made intra-paradigm comparisons among the different ways of doing GP in his paper.

However, this apparently was not enough, and paper 055 was dismissed because

"... comparisons were done with other forms of GP but not with other search methods."
(Emphasis added)

Skipping past the obvious and embarrassing question of whether any other such other search method exists that could even handle the problem presented in this paper, did ML-95 even-handedly criticize and penalize authors of papers on decision trees, inductive logic programming, reinforcement learning, or any other paradigm for not making cross-paradigm comparisons?

The essence of institutionalized bigotry is the imposition of an unreasonable standard against the group at which improper discrimination is aimed. Evidently, none of the other categories of accepted papers at ML conferences have been subjected to any demand, criticism, or penalty concerning cross-paradigm comparisons. And, there is certainly no conceivable reason why cross-paradigm comparisons are differentially more desirable or more appropriate for GP papers than for any other category of papers.

So why are papers 228, 302, 055 and others on genetic programming singled out for criticism and penalty because they failed to do that which almost every other paper ever published by the 1994 ML conference (and the Machine Learning journal) also fails to do?

Of course, the reviewer is factually correct when he says that papers 228, 302, and 055 (and others) make no cross-paradigm comparisons. But does this nominally true statement really have anything to do with the reviewer's decision-making process in rejecting these papers?

If not, why are these idle words in his review?

Given his privileged and insulated position as a peer reviewer, why would the reviewer not feel entirely comfortable in stating the actual reasons for his negative decisions in his written reviews?

Can it be because he knows that the actual reasons for his decision were illegitimate and unethical and that the words in his reviews are merely hypocritical window-dressing concealing the reviewer's hidden agenda?

10.5.1 Paper 222 - Did A Cross-Paradigm Comparison, but the Reviewer Didn't Seem to be Interested

Tellingly, paper 222 did do a cross-paradigm comparison involving three paradigms: genetic programming, the ID3 decision tree algorithm, and a nearest neighbor algorithm.

If the absence of cross-paradigm comparisons is so important to the reviewer, why didn't the ML-95 review of paper 222 make a positive comment about the presence of this cross-paradigm comparison in paper 222?

Certainly, cross-paradigm comparisons are infrequent enough in published machine learning papers.
Could it be that ML-95 doesn't really care that much about cross-paradigm comparisons after all?

10.6. Paper 055 - Found a "General Solution" to Just One Problem

ML-95 criticizes paper 055 (which demonstrated how to do recursion in GP),

"The strengths and limitations of GP are not discussed, except for the single fact that GP succeeded in evolving a general solution to this problem." (Emphasis added).

What "strengths" beyond solving one "general problem" must a short paper at the ML conference possess?

How many other papers at the ML conference had more "strengths" than that of finding "a general solution" to one problem?

Indeed, how many papers at the ML conference, when examined, yielded a "general solution" to even one problem?

The important point in the above quotation is that reviewer does acknowledge, on the face of his review, that paper 055 did, in fact, yield a "general solution" to what any open-minded person in machine learning would acknowledge to be an important problem (adding recursion into a machine learning paradigm).

Why isn't that enough?

Far less seems to be more than enough when ML is looking at papers outside the GP category.

10.7. Paper 302 - Too Many Papers Already Published about GP

The same reviewer who complained that the genetically evolved solution to the molecular biology problem did only "slightly better" than human performance now tells us,

"GP is by now not a new method, and there have been a large number of papers (and two large books) detailing GP's success on various problems, so it is not clear what this one additional example adds to our knowledge about GP."


What does the number of previously published papers have to do with the process of judging any newly submitted paper?

Can anyone conceive of any possible circumstances under which the ML-95 conference would tell the author of a submitted paper on, say, inductive logic programming, that

"Inductive logic programming is by now not a new method and there have been a large number of papers (and XXX large books) ... "


Or, can anyone conceive of any possible circumstances under which the ML-95 conference would say something like that to the author of a submitted paper on, say, decisions trees?

Is this a Freudian slip indicating that the reviewer has mentally established a global cap on GP papers and that GP has already been too pushy?

In any event, if there has already been an intolerable excess of GP papers published, it's safe to say that it hasn't been at the ML conference or in the ML community.

10.8. Paper 302 - Information Not in the Right Place

ML-95 complains,

" ... not enough information about the particular GP algorithm is given ... . Missing is information about the probabilities of the various functions and terminals in the initial population ... Also, some particulars of the method (e.g., percent of the population copied to the next generation, probability of crossing over at various points in the trees) are not given but one of the first author's books is cited. It would be helpful to give these parameters in the paper.
(Emphasis added).


Why would this be helpful in a short paper?

To whom?

Few, if any, readers of a paper are actually interested in replicating experiments. And, if they are, the reviewer acknowledges that the complete information required for replication is readily available (in the almost 10,000 copies "of the first author's books" that are currently in circulation).

It takes several pages to meticulously catalog all of the marginally different alternative ways of translating the metaphor of the genetic algorithm into an actual computer program. That amount of space is clearly not available in short conference paper.

The only issue here is the reviewer's strong opinion that this voluminous information should be inserted into the confines of the small number of pages allotted to ML papers.

A glance at the recent ML proceedings indicates that many accepted ML papers do not meet the minimal requirement of stating all the important parameters of their work. Why then is this reviewer singling out this GP paper (and other GP papers) when all the important parameters are in this paper (and all the unimportant parameters are readily available to anyone who is interested)?

10.9. Paper 281 - Memory and Mental Models

ML-95 complains about paper 281 saying

"It would also help to have some discussion of the strengths and weaknesses of its technique."


Similarly, ML-95 complains about paper 302 saying

"The strengths and limitations of GP are not discussed ... "


Similarly, ML-95 complains about paper 055 saying

"The strengths and limitations of GP are not discussed ... "

Note that these reviewer comments are specifically aimed at the "limitations of GP" and "weaknesses of the technique" - not merely to the weaknesses of the results of the specific work described in the paper.

Again, a search through the most recent ML proceedings does not uncover a single article on inductive logic programming (or decision trees or reinforcement learning or any other paradigm) that contains any kind of detailed recital of the strengths and weaknesses of the technique used in the paper. Where do we find Quinlan, Muggleton, or any of the other many practitioners of decision trees and inductive logic programming dutifully reciting the weaknesses of their techniques?

It might be nice if every paper at ML began with a disciplined and detailed recitation of the strengths and weaknesses of the technique being used in the paper, followed by a careful linkage between these itemized strengths and weaknesses with the decision as to "why use" technique X.

One argument against this is that most readers are not impressed by such recitals from avid practitioners of a particular paradigm and that such recitals are best done by others. (Recall how the reviewer himself bristled when paper 221 contained just one sentence about an asserted strength of GP).

But regardless of whether or not this practice would be "nice," the important point is that such space-consuming recitals are not part of the prevailing practice of ML.

And, there certainly is no special reason why such recitals should be differentially required for GP while not being required for any other paradigm.

The problem with the ML-95 reviews is that there is a blizzard of words in which the reviewer selectively complains about such matters concerning the GP papers, while there is total silence as to the real reasons behind ML-95's decision to exclude all GP papers. The ML-95 reviews are full of words that the reviewer and the authors both know have nothing to do with the decisions. The obvious lack of connection between the real decision-making criteria and the reviewer's written words heightens interest in the question of the real reason why 100% of the GP papers were rejected by ML and Lang's GP-bashing paper was accepted?

10.10. Paper 055 - Insufficient Demonstrated Scalability

Almost all the papers at ML conferences cast a blind eye to the critical issue of scalability. In fact, there is apparently a gentleman's agreement not to mention the "S-word" in the ML community's nanoworld of 3-argument Boolean functions and sick soybeans.

Paper 055 did utter the forbidden S-word and demonstrated scaling over a series of comparative experiments. ML-95 admits,

"The problem is of enormous significance, since proper handling of recursion in GAs would potentially allow significant advance in all types of first-order learning with GAs." (Emphasis added)


The reviewer then calls paper 055 "original." (Indeed, it is).

Then, in dismissing paper 055, ML-95 said,

" ... this ... kind of recursion did improve the performance of GP dramatically, and it would be very interesting to see it scaled up to problems whose interest and difficulty is clearer." (Emphasis added)

Thus, after doing what the reviewer acknowledges is indeed "original" work, after grappling with a problem of "enormous significance", after producing some evidence on the important question of scalability, and after actualy doing comparative experiments, ML-95 dismissed paper 055 because it didn't also solve unspecified additional problems of "interest."

"Come back when you've cured both cancer and heart disease!"

One wonders how ML-95 came to view Lang's mechanical tabulation of 100 runs of 80 3-argument Boolean functions as "original" or "of enormous significance" when it manifestly would not qualify for a passing grade in reputable university course.

How could ML-95 possibly have regarded the bogus conclusions derived from the trivial tabulation in Lang's paper as superior to paper 055?

10.11. Paper 255 - Details in the Wrong Place

ML-95 again complains,

"Too many details are missing ... and the reader is referred to a [technical report] for details." (Emphasis added)


Note, again, the issue is not whether the details were readily available, but whether the author should have put them into his highly space-limited paper.

One thing is noteworthy here: the apparently common ML-95 reviewer is to be congratulated for his ecologically-minded recycling of words as he goes from paper to paper.

10.12. Papers 221 and 228 - Inconsequential Missing Details


For paper 221, ML-95 complains,

"Some details of the GA are missing, e.g., at each generation, how many individuals are created through copying, and how many through crossover? Or, what was the probability of using each of the functions and terminals in creating the initial population?" (Emphasis added)


For paper 228, ML-95 complains,

"... [T]he author leaves out some details of the GP method used -- e.g., how was the initial population formed? How many individuals were copied directly each generation and how many were formed via crossover?"



Although the authors of papers 221 and 228 did cover all the important parameters in their short papers, they are indeed guilty of omitting some of their minor parameters.

The key point here is whether ML-95 even-handedly demanded, criticized, and penalized the non-GP papers for similar (and far worse) omissions? If the published papers in previous ML proceedings are a guide, the answer is clearly "no." Is the answer "yes' for 100% of the accepted and published ML-95 papers?

10.13. Paper 281 - Failure by the ML-95 Reviewer to Even Bother Reading the Paper(s)

The ML-95 reviewer asserts,

"The basic capability was demonstrated in the IEEE paper and the author does not note any significant differences in the results presented in this paper compared to the results in the IEEE paper."


The ML-95 reviewer clearly didn't bother to read paper 281 (much less the 1994 paper by the author of paper 281).

He's just plain wrong when he says paper 281 failed to "note ... significant differences" between the 1995 work and the previous work. The paper did recite these differences and they are substantial.

The accusation that the ML-95 reviewer didn't bother to read paper 281 can be easily experimentally verified. As it happens, a paper by the same author (substantially similar to paper 281) was submitted and subsequently accepted by IJCAI-95. That is, paper 281, which was summarily rejected by ML-95, was accepted by IJCAI-95 (a particularly competitive conference) in its machine learning area. (Advance copies are available for readers not wishing to wait until August 22, 1995 to conduct this experiment for themselves). The reader can easily compare the two papers and form his own opinion as to whether the ML-95 reviewer really read ML paper 281 or whether the ML-95 reviewer is merely taking a cheap shot based on an incorrect guess about the contents of paper(s) that he didn't read.

As foretaste of the results of this experiment, let me merely mention that the author has informed me that one of the independent IJCAI reviewers (whose identity is unknown to us) specifically commented in his IJCAI review about the significant differences between the 1994 and 1995 papers.

We are confident that if the reader conducts his own experiment, he will end up agreeing with the author of paper 281, with me, and with the independent IJCAI reviewer that the ML-95 reviewer didn't read paper(s).

The fact is that the (apparently common) ML-95 reviewer probably didn't read most of the other GP papers on which he passed judgment.

Could it be that the reason the ML-95 reviewer didn't read the papers is that he really didn't need to read the papers because his negativity had nothing to do with the contents of the papers, but derived from a manuscript-independent dislike of GP?


AUTHOR'S NOTE: The IJCAI-95 proceednigs hvee, of course, been publically available for some time now. The complete reference is as follows:

Andre, David. 1995. The automatic programming of agents that learn mental models and create simple plans of action. Proceedings of the 14th International Joint Conference on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann. Pages 741&shyp;747.


The complete reference to the 1994 paper by the author of paper 281 is as follows:

Andre, David. 1994. Evolution of map making: Learning, planning, and memory using genetic programming. Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE Press. Volume I. Page 250&shyp;255.

The reader is invited to compare the two papers and form his own opinion as to whether the ML-95 reviewer really read ML paper 281.

10.13.1 Paper 281 - Reviewer Inappropriately Cites Obscure Technical Report on Classifier Systems

The title of paper 281 is

"The artificial creation of agents that learn and use mental models to evolve simple plans" (Emphasis added).


There is, of course, an enormous literature (measured in tens thousands of published papers) on agents, planning, and mental models in the machine learning literature.

Puzzlingly, the reviewer of paper 281 further complains,

"The only related work I'm aware of is a dissertation that showed how to evolvng learning classifier systems". TGCA report No. 91003. Department of Engineering Mechanics, The University of Alabama, Tuscaloosa, Alabama."

Surely the reviewer is aware of more than one paper in existence on the subject of agents, planning, and mental models.

In any event, this puzzling comment is an additional indicator that the reviewer did not, in fact, read paper 281 because paper 281 has nothing remotely to do with classifier systems.

Classifier systems are a miniscule sub-area of the field of agents, planning, and mental models.

And, how could paper 281 possibly be considered deficient in neglecting to cite an obscure technical report outside the subject matter of paper 281?

Indeed, standard practice is not to cite a technical report if a more widely published source is available (since technical reports are typically published in very small numbers, not to found to find in most libraries, and usually familiar only to specialists who monitor the latest and most detailed new developments in a particular subfield).

To say

"The only related work I'm aware of"

in connection with paper 281 with one particular difficult-to-find University of Alabama technical report on classifier systems shows again that the ML-95 reviewer has no idea what is in paper 228. This is another throw-away comment that attempts to manufacture flaws in paper 228.

More important, what this comment does show, however, is that this apparent common ML-95 reviewer is very conversant with the extremely specialized subfield of classifier systems - a subfield where considerably less than 100 papers have ever been published (those few having been written by a relatively small group of researchers).

10.14. Paper 228 - Reviewer's Feigned Inability to Visualize any Interesting Applications for Mental Models

Paper 228 applied GP to the automatic construction of mental models and illustrated it with a tree search task. ML-95 complained that

"No case is made that the problem being solved contains features that allow for generalization to more interesting problems." (Emphasis added).


No one complained (or should have complained) when Quinlan's often-cited and well-regarded 1986 article on decision trees presented the infinitely more trivial ("Saturday morning weather") problem to illustrate his then-new technique. No one in computer science had no difficulty visualizing numerous applications for Quinlan's "Saturday morning weather" problem. Similarly, no one complains about the numerous papers in previous ML conference proceedings that use a modest illustrative problem (usually far more modest than the tree search task in paper 228) to illustrate whatever new approach they are presenting.

Assuming the hostile common reviewer even bothered to read paper 228 (and one cannot be sure, given his indiscriminate recycling of the more or less identical complaints over all the GP papers), he would have seen what anyone in computer science would readily recognize to be a very general procedure for the automatic construction of mental models.

Anyone familiar with machine learning or artificial intelligence should have had no difficulty visualizing a cornucopia of applications for something as fundamental as mental models (just as any reader of Quinlan's 1986 paper would have had no difficulty appreciating the wide applicability of his "Saturday morning weather" illustration).

It is nominally true that

"No case is made ..."

in paper 228. However, no case was made because no case was needed. Any open-minded reader not feigning inability to visualize "interesting" applications would have immediately grasped the importance the automatic construction of mental models.

However, if ML-95 felt that paper 228 fell short because it didn't make a specific detailed "case" for the wider applicability of its illustrative example, how is it that ML-95 came to accept Lang's mechanical tabulation of 100 runs of 80 3-argument Boolean functions? Does Lang's accepted paper

"contain ... features that allow for generalization to more interesting problems"? (Emphasis added)

ML-95 could not possibly have applied this test (or any test of scientific competency) to Lang's GP-bashing paper.

10.14.1 Paper 302 - Another Unfounded Complaint

In section 10.3, I mentioned that the (no doubt common) ML-95 reviewer complained about paper 302,

" ... no comparisons are made with other search methods (e.g., why use GP instead of some other method?)." (Emphasis added)

In preparing version 2 of this paper, I happened to notice how utterly untrue the statement by the ML-95 reviewer was. In fact, paper 302 included several statements that directly spoke to the question of why genetic programming should be chosen for the particular problem addressed in the paper. One of four such linkages is

"When this problem is rephrased as a search for an unknown-sized task-performing computer program (i.e., a composition of primitive functions and terminals), genetic programming becomes a candidate for approaching this problem. Moreover, if it is also desired to capture regularities in the problem environment (e.g., the repeated use of a subexpression such as [LIVM]), then genetic programming with automatically defined functions becomes relevant." (Emphasis added).

10.15. Paper 255 - Manufactured, Reasonably Small Datasets with Relatively Small Number of Classes

Paper 255 attempted to tackle a problem of classifying complicated visual and acoustic signals. ML-95 criticized it because

"[t]he experiments used manufactured, reasonably small datasets with relatively small number of signal classes."


The reviewer - so anxious to fire off shoot-from-hip criticisms of GP papers - is apparently unaware that both of the datasets that he is criticizing come from published sources in the signal processing field (including Sebastian Thrun's set) and the ML community.

Be that as it may, did ML-95 similarly criticize Lang's GP-bashing paper with its 100 runs of 80 3-argument Boolean functions because it used

"... manufactured, reasonably small datasets with relatively small number of ... classes"?


What dataset could possibly have been smaller and less significant than the family o 80f 3-argument Boolean functions?

Even if the authors of paper 255 were guilty of the alleged sin of using an oversimplified dataset (and they are not), there are degrees of sin. How serious was paper 255's oversimplification in comparison to the oversimplification in Lang's accepted and published GP-bashing paper?

Even the ML-95 reviewer acknowledges

"the chosen problem domain is certainly challenging"

for paper 255 (something that surely cannot be said of Lang's problem domain).

And, paper 255 faced pressing practical problems of computer time arising from the massive number of pixels and time slices (something that surely cannot be said of Lang's nanoscale problem domain).

In any event, author 255 did not oversimplify to anywhere near the degree that Lang did in accepted and his published GP-bashing ML-95 paper.

And, most important, there is no reason to believe that the author of paper 255 produced any bogus conclusions.

Why was paper 255 criticized and penalized for a far less consequential instance of the alleged sin of oversimplification that was breezily ignored by ML-95 in accepting and publishing Lang's GP-bashing paper?

By the way, the reviewer also complained,

"There are no references to the related work in the area of signal processing. The authors should also reference the work by Usama Fayyad et al on SKYCAT..."
(Emphasis added)

But Fayyad was cited in paper 255!

This is yet additional evidence of the ML-95 reviewer not reading the paper.

Clearly it is the reviewer's bias that caused him to reject the paper - not the actual contents of the paper - and certainly not the empty words contained in his review.

10.15.1 Paper 302 - The ML-95 Reviewer Protesth Too Much

Most of the reviews that I have received over the years of conference papers are extremely short. Most of the reviews of the rejected GP papers at ML are also short.

The review of paper 302 (in which "GP does give a very slightly better solution" than human performance) from the common ML reviewer begins with the following extraordinary preamble:

"My opinion is that for a paper to be significant for the ML community, it has to do at least one of the following things:

"(1) attack a problem that has not previously been solved, and show why the problem being attacked is interesting and/or important

"(2) give some new insight as to why a particular ML method works or does not work, or as to why some particular class of problems is difficult or easy for ML

"(3) demonstrate that some new, previously untested method has the potential to solve interesting and important problems

"(4) model some phenomenon in real-world learning in an interesting and enlightening way

"This list could no doubt be added to, ... "



Why is this lengthy philosophizing about review criteria in the review of paper 302?

Did this ML-95 reviewer apply these criteria to the Lang article?

10.16. Independent Evidence of the Quality of the Rejected Papers


Paper 281, which was summarily rejected by ML-95, was accepted by IJCAI-95 (a particularly competitive conference) in its machine learning area.

Subsequent to its rejection by ML-95, paper 302 was submitted and accepted as a chapter in a refereed and edited book on machine learning being published overseas. Papers 228 and 055 were accepted as regular papers by other conferences in the machine learning area. Paper 222 was accepted as a poster paper at another conference. The work from paper 255 will appear as a chapter in a refereed and edited book.

When favorable decisions come in from the independent peer reviewers of IJCAI, other conferences, and other publications for 86% of the papers involved here, it is certainly suggestive of the fact that these GP papers are not as worthless as ML-95 would have the world believe.

This independent reviewing is, in effect, a peer review of the ML peer review process.

While decisions made by other conferences and publications are, of course, not conclusive, these independent conclusions certainly carry some additional evidential value that can be added to the scales of judgment. This is especially so in light of the extreme disparity between the ML's 0% acceptance rate and the 86% acceptance rate from these other independent reviewers.

10.17. GP Papers at ML over the Past Five Years

In the past five years, none of this writer's submitted papers were accepted by the ML conference. In fact, during this same five-year period, all but one paper on genetic programming was rejected by the ML conference.

Over 250 GP papers (by authors other than myself) have been accepted and published elsewhere and over 50 of my papers have been accepted and published elsewhere (in almost every case by publications considered to be of at least equal quality to ML).

The fact is that the collective scientific judgment of a very large sampling of other independent reviewers - over an extended period of time - is very much at variance with the extreme nature of the outcome at ML.

We are not talking about some minor percentage difference here.

What is the reason for this extreme disparity?

This writer's most recent experience with a previous ML conference may shed additional light in the mystery of this extreme disparity, over a period of years, between the outcome at ML and the outcome from the collective scientific judgment of a mass of other researchers. I received two "independent" reviews from ML that each contained several very odd common words and phrases. These unusual words and phrases were, in turn, embedded in almost identical sentences. The sentences conveyed identical thoughts. However, the order of the sentences was symmetrically and mechanically toggled between the two "independent" peer reviews. Anyone with even a little experience at spotting plagiarism would instantly recognize that the two "independent" reviewers were not independent at all, but, instead had colluded in the writing of their reviews (most likely with the second reviewer using the first review as his template and mechanically switching slightly modified sentences around as to order).

Thus, ML's essentially manuscript-independent rejection of all the 1995 GP papers is nothing new, but, instead, part of a persistent pattern of abuse of the peer review process at ML.

The inescapable conclusion is that the only way paper 302 (or any of the other 1995 massacred GP papers) would have been accepted by ML-95 would if it had carried a title such as "Yet Another GP Failure - GP Performs Only Slightly Better than Human Experts."

10.18. Need for Follow-Up by ML

The goal now should be to ascertain the truth of the matter raised herein.

In ascertaining the truth of what happened to the rejected 1995 GP papers, there is an obvious need for the acquisition of additional information currently contained in ML-95's records.

10.18.1 The Two ML-95 Reviewer Forms for the Lang Paper

What does the written record show about how the ML conference came to accept the Lang paper?

Since the Lang paper was accepted at ML-95, its two reviews must be quite favorable.

How, for example, did the two ML-95 reviewers of Lang's paper answer the following six questions (taken from the official ML reviewer form)? How did the two reviewers answer the question:

"To what extent does it evaluate the strengths and limitations of its result(s)? Dimensions for evaluation include generality, ... "


Did the reviewers actually say "yes - this paper has strength and, yes, this paper has generality"?

Since the Lang paper was totally silent insofar as introspectively identifying or itemizing limitations of its results, did the reviewers criticize the Lang paper for this omission?
What did the ML-95 reviewers say in answer to the question,

"What lesson(s) does the paper offer the scientific community?"

Did the reviewers actually say that Lang showed that hill climbing "beat" GP?

Did the reviewers really say that the Lang paper "had strengths," "was general," and "offered lessons"?

Presumably the reviewers must have said just that!!

If they didn't, did ML-95 accept and publish a paper that the reviewers said did not have "strengths," did not have "generality," and did not offer any "lessons"?

In answer to the question,

"How important is the work reported?"


did the reviewers actually say that a tabulation of the behavior of the family of 3-argument Boolean functions was "important"?

In answer to the question,

"How difficult is the problem it attacks?"

did the reviewers really say there was some level of difficulty involved in Lang's 100 runs of the family of 80 3-argument Boolean functions?

How did the two reviewers answer the question:

"Is the paper technically sound?"

Did the reviewers really say that Lang's paper was technically sound?

In answer to the question,

"To what extent does the paper discuss relevant research?",

did the reviewers not notice the equivalence with mutation and not cite the failure of attribution in the Lang paper?

Most important of all: Were any of many criticisms that the ML-95 reviewers repeatedly leveled at the rejected GP papers (cited in earlier sections) even-handedly raised with respect to Lang's paper on these two review forms?

Was a "single standard" applied or was a "double standard" applied by ML-95?
It is the job of the ML program committee to maintain overall consistency in the standards of the reviewing process as it adopts or rejects reviewer recommendations.

The necessarily generally favorable comments on the two review forms for Lang's accepted and now-published paper (and the necessary general absence of any significant negative comments) are unlikely to demonstrate a fair or even-handed application of a consistent single standard at ML-95.

10.18.2 Role of the Apparent Common Reviewer on Lang's Paper

Did the apparent common reviewer of the rejected GP papers also review the Lang paper?

The answer is probably "yes" because the daisy-chaining of repetitive statements across the rejected GP papers suggests that all the GP papers were submitted to one common reviewer (with apparently different people providing the second review) and the Lang paper had "genetic programming" in its title.

If the answer is "yes," this particular review form takes on unusual significance (without any reference his name).

The question is then whether the common reviewer even-handedly complained on his review form for the Lang paper about all of the same matters that he repeatedly raised about the rejected GP papers or whether he developed sudden amnesia and extolled the Lang paper's "beating" of genetic programming as a towering intellectual achievement providing great insights for the machine learning community.

Since the Lang paper was favorably reviewed and accepted, it is unlikely that the common reviewer made any negative comments (or at least not many). Thus, this reviewing form is very important evidence that will almost certainly further support the already very strong case of unethical discrimination and scientific misconduct by this reviewer.

10.18.3 A Possibility

There is a very good possibility that the ML conference and the ML-95 co-chairs were themselves also a victim in these events.

The fate of all the GP papers at ML was probably determined, as a practical matter, at the time of appointment of the reviewer, not by the quality or content of the submitted papers. There is a very good possibility that highly relevant information that would have almost certainly deflected ML from appointing the reviewer in the first place was intentionally withheld from ML co-chairs by the would-be reviewer himself (perhaps in conjunction with someone who may have promoted the appointment of this person as the reviewer of the GP papers). This information is of such a clear nature that both the reviewer himself (and the second person, if any, who may have pushed for his appointment) knew, with certainty at the time they withheld (or colored) the information, that there would be no way that the ML co-chairs would appoint the person involved to review GP papers if the information had been fully and properly disclosed to ML. Indeed, if the co-chairs had even an inkling of this information, they would certainly have sought to acquire additional information from sources independent of the reviewer himself (and the second person, if any, who may have pushed for his appointment).

There is a good possibility that the above paragraph will come as a surprise to the ML-95 co-chairs. If, upon reading the above paragraph, the co-chairs say "ah ha," then the co-chairs will know that material information was indeed withheld from them and that they, too, have been victimized by the same people who victimized the GP authors.

10.19. Conclusion

This paper raises legitimate questions about scientific misconduct and calls upon ML for a candid explanation of


In responding to the above, there will, no doubt, be those who will advocate further victimizing the victims (and the messenger) by attacking them for making waves by raising these legitimate questions.

There will also, no doubt, be the see-no-evil whitewashers who will advocate denying that there is any problem at all and will suggest manufacturing some strained and implausible interpretation of events to explain away the clear evidence in this matter.

I hope a clear focus is maintained on who exactly are the victims and who is the perpetrator. David Goodstein (1995), vice provost at Caltech, covered this point in his recent editorial in Biotechnology when he said,

"Editors of scientific journals and program officers ... have the most to gain from peer review, and they steadfastly refuse to believe that anything might be wrong with the system. Their jobs are made easier because they never have to take responsibility for decisions. ...
" ... referees are able, with relative impunity, to delay or deny funding or publication to their rivals. When misconduct of this kind occurs, it is the referee who is guilty ..." (Emphasis added).

If there is any temptation for ML to resort to denial of the existence of the problem, avoidance of dealing with the problem, or blame-shifting to the victims (or messenger) as the way to deal with the legitimate issues raised by the massacre of the GP papers, it should be noted that there is one issue of scientific integrity for ML that cannot be easily sidestepped if ML aspires to any degree of scientific credibility, namely the issue of the presence in the scientific record of the demonstrably erroneous Lang paper.

We submit that the best course of action for ML is to ascertain the truth of this entire matter and then forthrightly take action against the perpetrator that will send an unmistakable deterrent message in support of scientific integrity - rather than to make itself part of the problem by denying the existence of the problem, avoiding the problem, trying to shift the blame to the victims, or trying to defend the indefensible. The evidence supports taking the following actions:

Acknowledgements

David Andre did the programming of the runs of the even-4-parity and even-5-parity problems for genetic programming and the hill climbing algorithm.

References

Goldstein, David. Ethics and peer review. Biotechnology. 13:618. June 13, 1995.

Holland, John H. 1975. Adaptation in Natural and Artificial Systems. Ann Arbor, MI: University of Michigan Press.

Kinnear, Kenneth E. Jr. (editor). 1994. Advances in Genetic Programming. Cambridge, MA: The MIT Press.

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. Optimization by simulated annealing. Science 220:671-680. 1983.

Koza, John R. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: The MIT Press.
Koza, John R. 1994a. Genetic Programming II: Automatic Discovery of Reusable Programs. Cambridge, MA: The MIT Press.

Koza, John R. 1994b. Architecture-Altering Operations for Evolving the Architecture of a Multi-part Program in Genetic Programming. Stanford University Computer Science Department technical report stan-cs-tr-94-1528. October 21, 1994.

Koza, John R. 1995. Gene duplication to enable genetic programming to concurrently evolve both the architecture and work-performing steps of a computer program. Proceedings of 14th International Joint Conference on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann.

Lang, Kevin. 1995. Hill climbing beats genetic Search on a Boolean circuit synthesis problem of Koza's. Proceedings of the Twelfth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann. Pahes 340-343.

Langley, Pat. 1986. On machine Learning. Machine Learning. 1(1):5-10.

Metropolis, N. A., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. 1953. Equation of state calculations by fast computing machines. Journal of Chemical Physics. 21: 1087-1092.

Quinlan, J. R. 1986. Induction of decision trees. Machine Learning. 1 (1): 81-106.


· The home page of Genetic Programming Inc. at www.genetic-programming.com.

· For information about the field of genetic programming in general, visit www.genetic-programming.org

· The home page of John R. Koza at Genetic Programming Inc. (including online versions of most papers) and the home page of John R. Koza at Stanford University

· Information about the 1992 book Genetic Programming: On the Programming of Computers by Means of Natural Selection, the 1994 book Genetic Programming II: Automatic Discovery of Reusable Programs, the 1999 book Genetic Programming III: Darwinian Invention and Problem Solving, and the 2003 book Genetic Programming IV: Routine Human-Competitive Machine Intelligence. Click here to read chapter 1 of Genetic Programming IV book in PDF format.

· For information on 3,198 papers (many on-line) on genetic programming (as of June 27, 2003) by over 900 authors, see William Langdon’s bibliography on genetic programming.

· For information on the Genetic Programming and Evolvable Machines journal published by Kluwer Academic Publishers

· For information on the Genetic Programming book series from Kluwer Academic Publishers, see the Call For Book Proposals

· For information about the annual Genetic and Evolutionary Computation (GECCO) conference (which includes the annual GP conference) to be held on June 26–30, 2004 (Saturday – Wednesday) in Seattle and its sponsoring organization, the International Society for Genetic and Evolutionary Computation (ISGEC). For information about the annual NASA/DoD Conference on Evolvable Hardware Conference (EH) to be held on June 24-26 (Thursday-Saturday), 2004 in Seattle. For information about the annual Euro-Genetic-Programming Conference to be held on April 5-7, 2004 (Monday – Wednesday) at the University of Coimbra in Coimbra Portugal.


Last updated on August 27, 2003