13. Symbolic Regression with Genetic Programming

  • The purpose of this topic is to perform symbolic regression with genetic programming

  • The genetic programming system being used is DEAP (Distributed Evolutionary Algorithms in Python)

  • Much of the heavy lifting required to implement a genetic programming algorithm is managed by DEAP

13.1. Problem — Regression Analysis

  • Regression is finding the relationships between dependent and independent variables given some observations

  • A common and simple form of regression is linear regression

    • Find a linear relationship between some dependent and independent variables

../../_images/regression_linear_data.png

Observations of some phenomenon plotted in two dimensions. The dependent variable is plotted along the y-axis and the independent variable is along the x-axis. From a quick glance, this data clearly has some linear relationship.

../../_images/regression_linear_data_linear_model.png

Linear model plotted on top of the observations. The resulting model is \(\hat{y} = 0.999885x + 0.997937\) and has an \(R^{2}\) value of \(0.9999885\).

  • The linearly regressed model for the above observations is \(\hat{y} = 0.999885x + 0.997937\)

    • Note that \(\hat{y}\) denotes that it is not \(y\), but only a prediction/estimation of \(y\)

  • This particular model has an \(R^{2}\) of \(0.9999885\)

    • This is a measure of goodness

    • The closer to 1.0, the better the fit

13.1.1. Linear Regression on Nonlinear Relationships

  • With linear regression, trouble arises when the observed data has nonlinear relationships

  • It is still possible to find a high-quality model, but it would require several assumptions and a lot of guesswork

../../_images/regression_nonlinear_data.png

Observed data with clear nonlinear relationships. The data appears to be parabolic or hyperbolic, but may be neither depending on which segment of the function was observed. It is not possible to effectively fit a straight line to describe the relationship between the dependent and independent variable.

  • Nevertheless, looking for a linear model for such nonlinear data is doomed to fail

../../_images/regression_nonlinear_data_linear_model.png

Linear model generated with linear regression on the nonlinear data. The resulting model is \(\hat{y} = -3.070395x + 916.550675\), which has an \(R^{2} = 0.010924\). Although this is the best possible straight line that fits this data, it is clear that it is not effectively fitting the data due to the limitation of the modelling strategy.

13.1.2. Symbolic Regression

  • An alternative strategy is something called symbolic regression

  • It is a form of regression analysis that requires fewer assumptions and works with nonlinear data

  • By using symbolic regression, the underlying nonlinear relationships may be found

../../_images/regression_nonlinear_data_nonlinear_model.png

Nonlinear model found with symbolic regression. The model is \(\hat{y} = 1.100511x^{2} - 4.439578\) and has a mean squared error of \(0.162087\).

  • It is clear that \(\hat{y} = 1.100511x^{2} - 4.439578\) effectively describes the relationships in the data

  • The mean squared error is \(0.162087\)

    • This is not the same as \(R^{2}\)

    • With mean squared error, a value closer to 0.0 is better

Note

In reality, symbolic regression produced add(add(protected_divide(mul(x, x), 9.949161), mul(x, x)), -4.439578), which was simplified to \(\hat{y} = 1.100511x^{2} - 4.439578\).

13.2. DEAP

13.3. Data

  • The data being used will be tabular

    • Rows of observations

    • Columns of variables

  • Below is an example of how the data would look

Arbitrary Example Data for Regression Analysis

ARG1 (\(X_1\))

ARG2 (\(X_2\))

ARG3 (\(y\))

-3.16833493e+01

3.63253949e+01

-5.75550374e+02

-3.59771849e+01

1.63633781e+01

-2.94238014e+02

4.35235243e+01

-1.54189209e+01

-3.35451435e+02

4.19705508e+01

1.25491733e+01

2.63416737e+02

  • For this example, the data is provided in CSV files

  • The data is formatted such that the last column is always the dependant variable

    • Thus, all proceeding columns are independent variables

  • This data was generated by some phenomenon

  • The goal is to find a model that effectively describes the relationships between the data

  • Loading this data can be done as follows

 97    data_file = open(os.path.join(RELATIVE_RESOURCES, DATA_FILE))
 98    all_data = [list(map(float, x.split(","))) for x in data_file]
 99    independent_variables = [observation[:-1] for observation in all_data]
100    dependent_variable = [observation[-1] for observation in all_data]
  • The above code loads the data specified in some constants

  • The independent variables will be stored in a two dimensional list

    • Rows are observations

    • Columns are different independent variables

  • The dependent variable is stored in a one dimensional list

    • Each value corresponds to the independent variables at the same index

  • Following the example data in the above table, the values would be stored as follows

independent_variables = [[-3.16833493e+01, 3.63253949e+01],
                        [-3.59771849e+01, 1.63633781e+01],
                        [4.35235243e+01, -1.54189209e+01],
                        [4.19705508e+01, 1.25491733e+01],
                        ... ]

dependant_variable = [-5.75550374e+02, -2.94238014e+02, -3.35451435e+02, 2.63416737e+02, ...]

13.4. Evaluation

  • A common mechanism for evaluating genetic programming performing symbolic regression is mean squared error

    • \(\frac{1}{n}\sum_{i}^{n}(y_{i} - \hat{y_{i}})^{2}\)

  • A mean squared error of zero is perfect

  • Any metric could be used, but mean squared error is very popular

  • One of the reasons it is used is because it punishes bigger errors more

    • If the difference between the observed and estimated value is 1, the squared error is 1

    • If the difference is 2, the error is 4

    • If it’s 3, the error is 9

    • In other words, it’s much worse to be very wrong than it is to be a little wrong

48def mean_squared_error(
49    compiled_individual, independent_variables: list[list[float]], dependent_variable: list[float]
50) -> float:
51    """
52    Calculate the mean squared error for a given function on the observed data. The independent variables are a list of
53    lists where the list of the outer list is the number of observations and the length of the inside list is the number
54    of independent variables. The length of the dependent variables is the number of observations.
55
56    :param compiled_individual: A callable function that the dependent variables are give to as arguments
57    :param independent_variables: Values given to the function to predict
58    :param dependent_variable: Expected result of the function
59    :return: Mean squared error of the function on the observed data
60    """
61    running_error_squared_sum = 0
62    for xs, y in zip(independent_variables, dependent_variable):
63        y_hat = compiled_individual(*xs)
64        running_error_squared_sum += (y - y_hat) ** 2
65    return running_error_squared_sum / len(dependent_variable)

13.5. Language

  • With symbolic regression, genetic programming will be searching the space of mathematical expressions

  • This includes both the operators and operands

  • The language is the set of operators and operands that the genetic programming system can use in the search

  • The trouble is, it’s not always clear which operators and operands should be included

  • Thus, the design of the language may need to be tuned for evolution like other hyperparameters

  • Fortunately, with symbolic regression, there is a common set to start with

    • Typical arithmatic operators

    • Maybe some trigonometry functions

    • Maybe logarithm and exponentials or even natural log and Euler’s number

  • However, there is a potential problem with divide as the evolutionary search may try to divide by zero

  • For this reason, it is common to see a protected divide used instead

    • If attempting to divide by zero, return infinity

33def protected_divide(dividend: float, divisor: float) -> float:
34    """
35    The basic arithmetic operator of division. If the division basic results in a ZeroDivisionError, this function
36    returns an arbitrarily large number (999,999,999).
37
38    :param dividend: Value to divide
39    :param divisor: Value to divide by
40    :return: The quotient; the result of the division (dividend/divisor)
41    """
42    if divisor == 0:
43        return math.inf
44    else:
45        return dividend / divisor
  • Setting up the language with DEAP is then done like below

104    primitive_set = gp.PrimitiveSet("MAIN", len(independent_variables[0]))
105    primitive_set.addPrimitive(operator.add, 2)
106    primitive_set.addPrimitive(operator.sub, 2)
107    primitive_set.addPrimitive(operator.mul, 2)
108    primitive_set.addPrimitive(protected_divide, 2)
109    primitive_set.addPrimitive(operator.neg, 1)
110    primitive_set.addEphemeralConstant("rand_int", partial(randint, -10, 10))
  • The first line, creating the PrimitiveSet object, is where the language operators and operands will be added

    • It also specifies the number of independent variables the function will have as the second argument

    • As discussed above, the independent variables are stored in some list called independent_variables

  • Each operator is added to the primitive set object along with the number of operands that operator requires

    • The operator is the first argument, and is passed as a function

    • Most of the added operators in this example are taken from the operator library

    • The protected divide, described above, is also added

    • The neg operator is the negation (multiply by -1) and only requires one operand

  • Finally, the constant values are added with addEphemeralConstant

    • In this example, integers between -10 and 10 are eligible

Warning

There is no guarantee or suggestion that this is a good language, but it is at least a reasonable place to start.

13.6. DEAP Setup

  • Below is an example of a setup for a DEAP algorithm

  • It effectively defines what each part of the algorithm will be

114    creator.create("FitnessMinimum", base.Fitness, weights=(-1.0,))
115    creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMinimum)
116
117    toolbox = base.Toolbox()
118    toolbox.register("generate_tree", gp.genHalfAndHalf, pset=primitive_set, min_=1, max_=4)
119    toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.generate_tree)
120    toolbox.register("population", tools.initRepeat, list, toolbox.individual)
121    toolbox.register("compile", gp.compile, pset=primitive_set)
122    toolbox.register(
123        "evaluate",
124        mean_squared_error_fitness,
125        toolbox=toolbox,
126        independent_variables=independent_variables,
127        dependent_variable=dependent_variable,
128    )
129    toolbox.register("select", tools.selTournament, tournsize=TOURNAMENT_SIZE)
130    toolbox.register("mate", gp.cxOnePoint)
131    toolbox.register("generate_mutation_subtree", gp.genFull, min_=0, max_=4)
132    toolbox.register("mutate", gp.mutUniform, expr=toolbox.generate_mutation_subtree, pset=primitive_set)
  • This code

    • Sets the problem to be a minimization problem

    • Says how the individuals (chromosomes) will be encoded

    • Defines how the individuals and population will be generated

    • It defines the evaluation metric

      • Note that mean_squared_error_fitness is not the function described above

      • More on this below

    • It also sets the selection strategy and genetic operators

  • DEAP requires that the fitness function return a tuple of results

  • However, the above mean_squared_error function returns only a single value

  • Thus, another function, mean_squared_error_fitness, is created that calls mean_squared_error

  • This function also

    • Checks if exceptions should be thrown

    • Compiles the trees into something that can be functionally evaluated

    • Wraps the mean squared error in a tuple to be returned

68def mean_squared_error_fitness(
69    individual, toolbox, independent_variables: list[list[float]], dependent_variable: list[float]
70) -> tuple[float]:
71    """
72    Calculate the mean squared error for a given function on the observed data. The independent variables are a list of
73    lists where the list of the outer list is the number of observations and the length of the inside list is the number
74    of independent variables. The length of the dependent variables is the number of observations.
75
76    :param individual: A deap representation of a function that the dependent variables are give to as arguments
77    :param toolbox: A deap toolbox
78    :param independent_variables: Values given to the function to predict
79    :param dependent_variable: Expected result of the function
80    :return: A tuple containing the mean squared error of the function on the observed data
81    :raises ValueError: If the number of observations s 0 or if there are a different number of observations
82    """
83    if len(independent_variables) == 0 or len(dependent_variable) == 0:
84        raise ValueError("Number of dependent and independent variables must be greater than 0")
85    if len(independent_variables) != len(dependent_variable):
86        raise ValueError(
87            f"Uneven number of observations of independent and dependent variables: "
88            f"{len(independent_variables)}, {len(dependent_variable)}"
89        )
90    callable_function = toolbox.compile(expr=individual)
91    individual_mean_squared_error = mean_squared_error(callable_function, independent_variables, dependent_variable)
92    return (individual_mean_squared_error,)

13.6.1. Bloat Control

  • As mentioned in the previous topic, genetic programming suffers from bloat

    • The trees tend to grow larger with no improvement in fitness

    • This negatively impacts the algorithm’s performance

  • DEAP provides simple ways to control bloat

136    toolbox.decorate("mate", gp.staticLimit(key=operator.attrgetter("height"), max_value=MAX_HEIGHT))
137    toolbox.decorate("mutate", gp.staticLimit(key=operator.attrgetter("height"), max_value=MAX_HEIGHT))
138    toolbox.decorate("mate", gp.staticLimit(key=len, max_value=MAX_NODES))
139    toolbox.decorate("mutate", gp.staticLimit(key=len, max_value=MAX_NODES))
  • The above sets limits on how deep and how many nodes a tree may have after a genetic operator is applied

  • These values may need to be adjusted to improve performance

    • If they are allowed to get too big, the search may take too long

    • If they are not allowed to get big enough, it may not be possible to model the data

13.6.2. Bookkeeping

  • DEAP also provides tools for keeping track of the results throughout evolution

143    hall_of_fame = tools.HallOfFame(1)
144    stats_fit = tools.Statistics(lambda individual: individual.fitness.values)
145    stats_size = tools.Statistics(len)
146    stats_height = tools.Statistics(operator.attrgetter("height"))
147    mstats = tools.MultiStatistics(fitness=stats_fit, size=stats_size, height=stats_height)
148    mstats.register("avg", numpy.mean)
149    mstats.register("std", numpy.std)
150    mstats.register("min", numpy.min)
151    mstats.register("max", numpy.max)
  • The above code creates a HallOfFame, which will keep track of the top candidate solution for each generation

  • It also keeps track of some summary statistics on some population metrics

    • Population average, standard deviation, minimum, and maximum for fitness, tree height, and tree size

    • These values will be recorded for each generation

13.7. Running and Results

  • After all the setup, running the algorithm is done as follows

155    population = toolbox.population(n=POPULATION_SIZE)
156    population, log = algorithms.eaSimple(
157        population,
158        toolbox,
159        CROSSOVER_RATE,
160        MUTATION_RATE,
161        GENERATIONS,
162        stats=mstats,
163        halloffame=hall_of_fame,
164        verbose=True,
165    )
  • With the above settings, output like the following will be displayed

                                          fitness                                                      height                                             size
            -------------------------------------------------------------------     -------------------------------------------     -----------------------------------------------
gen nevals  avg             gen     max             min     nevals  std             avg     gen     max     min     nevals  std     avg     gen     max     min     nevals  std
0   10      3.07232e+08     0       2.98774e+09     472.225 10      8.93845e+08     2.3     0       4       1       10      1.1     10.2    0       30      2       10      9.67264
1   10      1.80136e+06     1       1.66745e+07     0.00906234      10      4.97235e+06     2.9     1       5       1       10      1.04403 9.7     1       26      2       10      7.0859
2   3       2331.57         2       13844.5         0.00906234      3       4034.1          2.8     2       3       2       3       0.4     6.7     2       12      4       3       2.75862
3   7       128527          3       1.26907e+06     0.00906234      7       380187          2.9     3       5       1       7       1.13578 6.9     3       14      2       7       3.53412

(rows 4 -- 96 omitted for brevity)

97  8       1.00691         97      1.00691         1.00691         8       0               0       97      0       0       8       0               1       97      1       1       8       0
98  8       86.9783         98      860.721         1.00691         8       257.914         0.1     98      1       0       8       0.3             1.2     98      3       1       8       0.6
99  7       1.00691         99      1.00691         1.00691         7       0               0       99      0       0       7       0               1       99      1       1       7       0
100 7       128571          100     1.2857e+06      1.00691         7       385711          0.2     100     2       0       7       0.6             1.3     100     4       1       7       0.9
  • Viewing the results requires converting the tree to a string

    • It is possible to have DEAP generate and display images of the expression trees

    • Unfortunately, it can be difficult to get working

  • Below is an example of displaying the top result after evolution is complete

169    print(str(hall_of_fame[0]))
170    print(hall_of_fame[0].fitness.values)
  • Below is an example result for a run of the GP

neg(neg(add(1, ARG0)))
(0.009062344161081039,)
  • Given that ARG0 is the only independent variable, the above can be simplified to \(y = x + 1\)

  • The mean squared error of the model when applied to the data is roughly 0.00906

13.7.1. Funny Results

  • Given the stochastic nature of the search, the final result will differ every time

  • And since genetic programming’s only goal is to fit the data, and it does not attempt to simplify the results, some silly looking functions may be generated

    • Below are additional examples of generated solutions that all simplify to \(y = x + 1\)

sub(ARG0, protected_divide(ARG0, neg(ARG0)))
add(mul(protected_divide(ARG0, ARG0), protected_divide(ARG0, ARG0)), ARG0)
add(ARG0, protected_divide(ARG0, mul(7, protected_divide(ARG0, 7))))
add(-8, add(ARG0, 9))
add(mul(protected_divide(ARG0, ARG0), protected_divide(ARG0, ARG0)), mul(protected_divide(ARG0, 1), protected_divide(ARG0, ARG0)))

13.8. For Next Class