Hide Comments
Hide Comments

Getting Started - Genetic Programming

Comments (0)

Using the RiverSoftAVG Genetic Algorithms & Programming Component Library (GACL) for genetic programming is very easy, and will get you started in no time.  Of course, the devil is in the details. Genetic programming requires careful crafting of the building block instructions and the fitness evaluation to properly guide the genetic programming components.

 

The following steps will help you in using the genetic algorithms components.  The basic steps for perform genetic programming are:

 

Drop a TRSGeneticProgramming component (or TRSBinaryGeneticProgramming component) on your form
Define the Instructions (functions, constants, and variables) for your problem
Define a fitness function (OnEvaluateFitness event) to properly execute and assess the fitness of each individual
Tune the genetic programming component to your problem by setting properties like InitialPopulation, InitializationMethod, InitializationDepth, FitnessMethod, and others
Evolve your solution
Optional but recommended, control Bloat (uncontrolled growth of your GP trees) by selecting and tuning a BloatStrategy

 

hmtoggle_plus1Define the Instructions (functions, constants, and variables)

The building blocks of each genetic programming problem are called Instructions.  They are unique to the problem at hand.  Instructions have a number of characteristics:

 

Instructions are a shared set of functions, constants and variables (i.e., every individual has access to the complete set)
Functions define their Arity, which means the number of sub-instructions they can accept.
Functions may have an arity of 0 or more.
Functions have a OnExecute event defined to control what they do
Constants and Variables always have an arity of 0.
Constants, Variables, and Functions of Arity = 0 are called Terminals.
Constants return the value of their constant, which may be different for each individual and for each instance of the constant in the individual if the value has been mutated.
Variables return the value of the variable as it is passed in when executing a genetic program

 

Choosing your functions and terminals is extremely important and is the key predictor of success.  Your choice of instructions should be:

Sufficient
Small
Provide closure

 

It is important that the set of functions and terminals be complete enough to represent a solution to the given problem. This is called sufficiency.  If a necessary function or terminal is missing, the genetic programming components may not find a solution or will only find suboptimal solutions.  However, if the set of instructions are too large (in the number of functions, the number of terminals, or the arity of the functions), the search space will be exponentially large and may slow or prevent the finding of a viable solution.   Unlike genetic algorithms where candidate solutions are constrained to a fixed size, genetic programs can, and will, grow too large. The size of a search space depends on the depth of the GP tree needed to find a solution, the size at each level (which depends on the arity of the functions), and the size of the instruction set to choose from.  The table below shows the size of the search space for the Santa Fe Ant Trail problem with 3 functions (if Food Ahead (Arity=2), Progn-2 (2), Progn-3 (3)) and 3 terminals (Move, Turn Left, and Turn Right)

 

Level

Population Size

1

288 (2.8 x 10^2)

2

2.48 x 10^7

3

1.53 x 10^22

4

3.56 x 10^66

5

4.53 x 10^199

 

If there was just some way to reduce the Arity of Progn-3 to 2, the table would become

 

Level

Population Size

1

108 (1.08 x 10^2)

2

3.7 x 19^4

3

4.1 x 10^9

4

5.04 x 10^19

5

7.63 x 10^39

 

Still massive, but much smaller and with a much more likely chance of finding a solution (BTW, you can find out the size of your search space for a given depth by calling the TRSGeneticProgramming.SearchSpaceSize method).  The final rule for your choice of instruction is that all functions in the instructions should be able to accept any other function or terminal as inputs.  This is called the closure property.  For example, with symbolic regression, the division function is a protected division.  If the divisor is 0, which in normal math is disallowed and would raise an exception, the protected division function returns 0 or 1 (depending on the problem).

 

hmtoggle_plus1Define a fitness function

Defining your fitness function is another highly important part of defining a genetic programming problem.  The fitness function returns a floating point value that specifies the correctness of the individual solution (chromosome).  The fitness function needs to be able to allow the genetic programming component to decide which solution is better than another.  The genetic programming component will seek to either maximize the solution (e.g., keep evolving for individuals whose fitness are greater than other individuals in the population) or to minimize the solution (e.g., find the individuals whose fitness are less than other individuals).  Note you can specify which direction to evolve towards with the FitnessMethod property.

 

In genetic programming, often it is important to use a weighted fitness to guide your genetic programs towards a solution.  For example, some anti-bloating methods (to control GP size growth), weight the fitness method to make larger programs less desirable. It is VERY important that the fitness function return the true fitness of the individual; this value is used by the genetic programming components to decide if a solution has been found.  If you want use the fmWeightedMinimize or fmWeightedMaximize fitness methods, use the OnEvaluateWeightedFitness method to return a weighted fitness value.  It should not recalculate the fitness value as this will waste precious CPU cycles, but rather weight the fitness that has already been assigned to the individual through the OnEvaluateFitness event.

 

hmtoggle_plus1Tune the Genetic Programming Component

Tuning the genetic programming component is important to the success of finding a solution.  Many properties are domain-independent and will work in many different situations.  Others can literally "change the course of evolution" (forgive me the pun :-) ) and keep the component from converging towards a solution.

 

Some of the most important properties deal with initializing the population before starting evolution.  These properties are:

 

Name

Description

InitialPopulation

Specifies the initial number of individuals to make in the Population before attempting to solve your solution. Use this property to quickly and easily define the population size. The InitialPopulation property specifies the minimum initial population. Koza, in his original book, used an initial population of 500.  In "Genetic Programming II", he was suggesting in the thousands.

 

Your population size has a substantial bearing on whether the genetic programming component can find a solution, too small a population may be unable to evolve towards a solution, too large a population wastes processing power.

InitializationDepth

Specifies the initial maximum depth of an individual's genetic program tree.  For the problems in his book, Koza selected an InitializationDepth of 6.

 

The InitializationDepth property is very complementary to the InitialPopulation and InitializationMethod. Too small of an initial depth hinders finding a solution.  A wide range of shapes and size of initial trees seems to help for evolving a solution.  Too large an initial depth and the genetic components can evolve away from simple solutions to more complex ones that waste a lot of processing power.

InitializationMethod

Defines the initialization methods for initializing a genetic program tree, i.e., how the initial trees are formed.  There are two basic initialization methods: full and grow.

 

The Full method creates trees that all equal the InitializationDepth (think full, bushy trees).  There are not necessarily all the same size (number of nodes) as they depends on the arity of the instructions, but the same initial depth.  The Full method creates trees by selecting only functions with arity greater than 0 until the initialization depth is reached; after which only terminals are selected.

 

The Grow method creates a wider variety of shapes and sizes.  In the Grow method, nodes for the trees are selected from the full set of primitives (functions, constants, and variables) until the initialization depth is reached.  Since an early node could select a terminal (constant, variable, or function of arity = 0), that node's subtree stops growing before the initialization depth.

 

The literature suggests that the shape of trees can impact which types of problems a genetic programming component will solve.  Koza actually suggests using a method called ramped half and half, which involves using both the Full and Grow method and varying the desired depth of the trees.  The depth limit is ramped from 2 to the full initialization depth limit, i.e., first individual has a depth limit of 2, second individual has a depth limit of 3, etc up to the full initialization depth limit. The depth limits are evenly divided amongst the population, e.g., if there are 100 individuals and a full initialization depth limit of 10, approximately 11 individuals will have a depth limit of 2, 11 will have a depth limit of 3, etc.

InitialDuplicatesRetries

Specifies how many times we attempt to create a unique individual per individual on initialization. If all of these retry attempts fail, the initialization depth is incremented to try and make a unique individual at the new depth.

 

Having duplicate individuals in the initial population is a waste of processing power as it limits the pool of truly available individuals (though having duplicates in later generations is a feature not a bug as fitter individuals will tend to be selected and reproduces more as we evolve our solution).  It is highly recommended to avoid duplicates in the initialization phase.

 

The properties that control the evolution of the next generation are also extremely important.  Here are the most important properties during evolution:

 

Name

Description

CrossoverProbability

Controls the likelihood that crossover (also known as recombination) will occur when 2 parents are selected to create a new child.  When crossover occurs, the 2 parents genes or DNA are combined to make the child.  The crossover probability is checked for each new child (e.g., if there are 100 children created each generation, the CrossoverProbability will be tested 100 times). If crossover does not occur, the child is a duplicate of the first parent (but may be mutated or inverted).

 

For genetic programming, the literature suggests a value of 0.9 (90% chance crossover occurs).

InversionProbability

Controls the likelihood that inversion will occur in a child's DNA or chromosome, e.g., whether a portion of the child's chromosome will be flipped. The probability is between 0 (no chance) and 1 (always). When 2 parents reproduce and create a new child, their chromosomes are combined using Crossover and then the child's chromosome may be mutated and/or inverted.

 

In genetic programming, inversion is actually a point mutation (i.e., a node in an individual's GP tree may be flipped and become a different instruction of the same arity and type).  You can either 0 out the probability here and handle it all in the mutation phase or leave it in.

MutationProbability

Controls the likelihood that one mutation will occur in a child's chromosome, i.e., genetic program tree. Genetic programming has a whole suite of mutations (subtree, point, constant, etc) that can occur; which mutation occurs depends on their MutationWeights.

MutationWeights

Defines the weights or probabilities that particular mutation methods will be chosen over the other genetic programming mutation methods. When mutation is selected to occur (see MutationProbability), the genetic programming component chooses a mutation method using the weights you supply. The different mutation methods Subtree - Select a node and grow a new subtree

Replacement - Select a node of same type and arity and grow a subtree

Constant - Replace a constant terminal node with another constant

Shrink - Replaces a subtree of the child with a terminal. Reduces complexity of the child.

Hoist - Replaces the entire child program tree with a subtree from the child (the subtree becomes the root node). Reduces complexity of the child.

Point - Replaces a node of child with a node of the same type and arity.  See InversionProbability.

SelectionMethod

Specifies how the genetic component selects individuals from the current generation to be used as parents of the next generation. The genetic component provides some built-in methods and allows you to write your own by using the  OnSelection event.  The build-in selection methods are:

Random - Select random parents

Roulette - Select parents based on "roulette-wheel" selection, where the better "fit" parents are more likely to be chosen (and re-chosen), e.g., the probability of selection is proportional to the fitness of the parent.  Also known as fitness proportionate selection

Elitist - Select parents based on "elitist" selection, where the top nth percentile parents are chosen (and re-chosen). Elitist is a heavy-weight selection algorithm (in our implementation at least) because we must figure out the top nth percentile population first

Tournament - Select "best" parents from a tournament field (size of tournament field is determined by the TournamentField property), e.g., select the best parent from randomly selecting individuals TournamentField times and then choosing the "winner" of the tournament. Note: for speed purposes, this selection method does not ensure that the same individual cannot be picked twice

Stochastic Tournament - Select "best" (in this case using roulette wheel) parents from a tournament field (size of tournament field is determined by the TournamentField property), e.g., select the best parent by selecting better fit individuals proportionally TournamentField times and then choosing the "winner" of the tournament. Note: for speed purposes, this selection method does not ensure that the same individual cannot be picked twice

 

Note that Koza used fitness proportionate or roulette selection.

 

hmtoggle_plus1Evolving your solution

To start evolving towards a solution, call the Evolve method.  The Evolve method starts "breeding" an answer to your goal.  Each generation, the Evolve method selects parents and then Reproduces children for the new generation by using the genetic operations: first crossover, then mutation, and finally inversion.  You can specify the number of generations to evolve or use the FitnessCutoff, GenerationLimit, and DiversityLimit properties to automatically stop the genetic process.  Successive calls to the Evolve method continues the evolutionary process from where the Evolve method stopped (use Initialize to reset the process).

 

An important thing to remember about genetic components: you are trying to breed a solution that is "good enough", not perfect.  It is a difficult thing to wrap your mind around, but it is often impossible to find a perfect solution.  Rather, you should be looking for a solution that meets your requirements.  The demo programs can be misleading as we know the perfect solution for the problems but are using these problems to show you how to use the components.  Genetic programming is really like regular programming in that you want to create a program that does the job.  You try to reduce bugs but you know that there will also be bugs in your programs (unless it is a very trivial problem).  Genetic programming is similar.  Do not fall into the trap of trying to find a perfect solution.

 

To monitor the progress of your population, it is recommended that you track at the least the MaxFitness (if FitnessMethod is maximize) or MinFitness, and the AvgProgramSize property.  The demo applications add a point to the line charts for fitness and AvgProgramSize after every generation (the OnEvolve event).  Plotting fitness is very useful for seeing if your population is still evolving towards a solution or has stalled.  Plotting AvgProgramSize helps you see if the population is evolving towards runaway bloat.  Controlling bloat is an important aspect in genetic programming and is discussed in the next section.  It is often better to retry from the beginning instead of trying to keep just breeding more and more generations.  Genetic programming can often get stuck in a local minima/maxima and be unable to get out of it to converge towards a solution.

 

Another useful gauge of progress is monitoring the diversity of your population.  If the diversity falls too low, the population is getting too homogenous (i.e., they look the same) and it is doubtful the evolution will be able to find a solution.

 

The FitnessCutoff property aborts the evolutionary process when any child meets or exceeds (if FitnessMethod is fmMaximize) or is less than (if FitnessMethod is fmMinimize) the cutoff value.  The UseFitnessCutoff property must be true to use the fitness cutoff.

 

The DiversityLimit property aborts the evolutionary process when the diversity of the children falls below the DiversityLimit, e.g., all the children are too alike to go any further.  The UseDiversityLimit property must be true to use the diversity limit.

 

The GenerationLimit property aborts the evolutionary process after the number of generations have been evolved.  The UseGenerationLimit property must be true to use the generation limit.

 

hmtoggle_plus1Controlling Bloat

Bloat is the uncontrolled growth of your GP trees during evolution.  For some reason, at some point in the evolution cycle, Genetic programming trees tend to experience explosive growth in the size of the trees.  This growth can easily overwhelm the computing ability of your computer (either memory or processing cycles).  The RiverSoftAVG Genetic Algorithms & Programming Component Library (GACL) implements many different strategies for controlling bloat.  To control bloat, there are two important properties, BloatStrategy and BloatLimit, and a few ancillary properties:

 

Name

Description

BloatStrategy

The BloatStrategy property specifies the anti-bloating strategy to attempt to control excessive tree growth. Unfortunately, the bloat strategies are often problem dependent so you may have to tinker with bloat strategies to see which works for your problem.  Please refer to the TRSGPBloatStrategy topic for more information.

BloatLimit

Specifies the bloat limit. The meaning of this property depends on the  BloatStrategy. When the bloat strategy specifies limiting the depth, the BloatLimit specifies the maximum depth. Similarly, when the bloat strategy specifies limiting the tree size, the BloatLimit specifies the maximum size.

ParsimonyCoefficient

Specifies the parsimony coefficient to use when trying to control bloat using bsSizeParsimonyPressure and bsDepthParsimonyPressure bloat strategies.

 

When generating the WeightedFitness of an individual, the Fitness is reduced by the size or depth its genetic program multiplied by the ParsimonyCoefficient. This has the effect for evolving for smaller genetic programs over larger genetic programs.

 

bsCovariantSizeParsimonyPressure and bsCovariantDepthParsimonyPressure also use the ParsimonyCoefficient when evaluating the WeightedFitness. However, these two methods recalculate the ParsimonyCoefficient themselves every generation using the formula CoVariance(ProgramData, FitnessData) / Variance(ProgramData).

TarpeianProbability

Specifies the probability that an individual's Fitness will be set to UnfitFitness (a special unfit value) when its genetic program size or septh exceeds the average size or depth. The Tarpeian anti-bloat strategy has the advantage that the individual is judged before evaluating fitness which can significantly reduce computation.

 

This property is used when the BloatStrategy is bsTarpeianDepth or bsTarpeianSize.

 

You can also try to implement your own anti-bloat strategy by hooking into the OnReproduce event.  This event occurs when two parent individuals have reproduced to create a new child. The OnReproduction event occurs after the child has been created and its DNA has possibly been combined, mutated, and inverted. Use the OnReproduction event to perform some action after reproduction has occurred.

Comments (0)