Ramblings of a Researcher

Return to Floating point hell

(sitting at SKA again)

No choice but to return to floating points again as once in every several thousand trees, the fitness score stored has expanded beyond the set 4 floating point places, resulting in something like n.123456e (scientific notation).

I have searched on-line, dug deep into Stackoverflow, and asked my colleagues. No one has a means by which I can define a variable as a fixed number of floating points (to the right of the decimal) that stays that way. Ugh!

I therefore introduced a second round function in the Tournament method, to force the numbers back into spec. I have not since experienced the same error. This is not the ideal solution as the round function introduces its own issues, but for the sake of GP, it does the trick.

I then approached the issue of plotting 2D and 3D data, but hit a major mental block. Like an old engine with leaky coolant, my brain overheated and I had a bad day. Sorry Arun, I was really grumpy.

By Kai Staats|2017-11-24T23:52:37-04:00September 8th, 2015|Ramblings of a Researcher|Comments Off

Flowers for Algebra

(sitting at SKA)

It is my intent to complete the classification test suite today, using the benchmark Iris dataset.

The Iris dataset offers 4 features (columns in a .csv) for each of 3 unique plant species, as labelled in the right-most (5th) column. As 2 of these species are not discernible from each other with a linear function, there are 2 ways to approach this problem:

Compare only 2 species at a time (in a round-robin) such that we work with A/B, A/C, and finally B/C; or
Seek a non-linear function with a kernel which supports more than 2 classes at a time

As Karoo GP supports only 2 classes at this time, I am going to work with the first option, and come back to the second when I have time to build a new kernel.

If it is our intent to plot the results, we can engage only 3 dimensions of data at one (without forcing higher dimensions into lower orders). However, the Iris dataset is compose of 4 features (a, b, c and d). As whatever holds true in lower dimensions will be retained in higher dimensions, it is safe to test a lower order of the featureset as long as the features we selected are decisive in the identification of the flower species. As such that we remove column d from all 3 comparisons.

This resulted in what appears to be a fully functional classification by Karoo GP, with both linear and non-linear functions that produce 100% (50/50) correct classifications.

By Kai Staats|2017-11-24T23:52:45-04:00September 7th, 2015|Ramblings of a Researcher|Comments Off

Kepler’s Law resolved with GP!

(Kepler’s 3rd Law of Planetary Motion table by the Physics Classroom)

(continued from Premature convergence)

Working from AIMS and my apartment these past two days, I was able to resolve a persistent floating point issue by employing a round function before the fitness evaluation.

I also fixed the minimisation function with the discovery of 2 copy/pasted lines of code I had apparently failed to come back to. It appears this has not been working for some time, as I have been focused on other aspects of the code.

Finally! Just like that! Karoo GP now resolves Kepler’s 3rd Law of Planetary Motion! YES!!!

I ran it with the default Depth 3 and minimum node count of 3 and again, it came up with t/t = 1. So I ran it again with Depth 5 [2^(d+1)-1 = 63 possible nodes] and minimum node count of 9.

While it struggled for the first 5-6 generations, converging on what appeared to be 1 again, some mutation gave it the correct answer and in just 3 generations the correct trees dominated! Coooool!

{1: t**2/r**3,
2: t**2/r**3,
3: t**2/r**3,
…
88: t**2/r**3,
92: t**2/r**3,
94: t**2/r**3}

This proves 2 of the 3 desired functions: regression maximisation (match) and regression minimisation. Only classification remains to prove Karoo GP road worthy.

By Kai Staats|2017-11-24T23:52:52-04:00September 6th, 2015|Ramblings of a Researcher|Comments Off

Smashing bugs

The past 3 days have been a fun dive back into my code. I discovered my code no longer worked with cos and sin, the matching fitness function (result = solution) failed with floats, and my minimalisation function was selecting the final Tree, not the best. Ugh!

The float issue all programmers deal with, namely, forcing all variables which are compared against each other into the same number of positions to the right of the decimal point. FIXED!

The cos and sin were related to the float issue. FIXED!

The minimalisation function was my own damned fault, as in my Tournament Selection I had copy/pasted a body of code with intent to them rework it from maximising to minimising, but got distracted, forgot (3-4 weeks ago?) and only tonight realised what was happening. FIXED!

Here are my first results from the working minimalisation function. What remains is the automated selection of the best of the best, not just a list of the leaders in the final generation. Easily done.

Desired result:
a*b + c where 1*10 + 0.05 = 10.05

Trial 1:
a*b + c – c**3/b where 1*10 + 0.05 – (0.05^3/10) = 9.64

Trial 2:
a*b + c + c/b where 1*10 + 0.05 + 0.05/10 = 10.055
a*b where 1*10 = 10.00
a*b + c where 1*10 + 0.05 = 10.05 is CORRECT
a*b + c + c/a where 1*10 + 0.05 + (0.05/1) = 10.10

Trial 3:
a*b + c where 1*10 + 0.05 = 10.05 is CORRECT
a*b + 1/b where 1*10 + 1/10 = 10.10

The above test works perfectly. Now, I have only to test the Iris classification set and I will have all 3 fitness functions fully tested and working.

I am roughly 5 weeks behind sched, but believe I can catch-up as I my code base is solid and designed for the rapid introduction of new fitness functions. In theory, this goes fairly smooth after I prove the base works (why do I keep saying this?!)

Cool!

By Kai Staats|2017-11-24T23:53:01-04:00September 6th, 2015|Ramblings of a Researcher|Comments Off

Floating points

In a moment of needing a break from working on the User Guide I played with Karoo GP for 30 minutes and came back to testing cos and sin functions. They broke the Absolute Value function where Classification and Matching yet worked.

A day into the battle, a deeper issue has unfolded which keeps Karoo GP from working with any floats (although this was tested a long time ago, but with a very limited and controlled number of decimal places).

Now, I have learned even when [result = solution], it fails to match.

algo_sym a + b/c
result 0.44404973357
solution 0.4440497336

algo_sym a + b/c
result 0.690166666667
solution 0.6901666667

algo_sym a + b/c
result 2.70666517168
solution 2.7066651717

ARGH!!! Rounding errors!!!

I need to introduce careful control of the number of decimal places invoked for both ‘result’ and ‘solution’. For as much headache as this has caused, I am pleased I took that break and played around with my code again, else I would have run into this when I came back to KAT7 data.

What’s more, this might help Karoo GP solve the Kepler planet problem (which still fails at, miserably).

By Kai Staats|2017-11-24T23:53:08-04:00September 5th, 2015|Ramblings of a Researcher|Comments Off

A simple equation

(sitting in the bookstore in Kalk Bay)

This is my first time working on Karoo GP since my daughter Lindah’s arrival to South Africa nearly two weeks ago. She departed just yesterday. Out time together was incredible. We both learned so much. I am so sad to see her go :(

My intent is to complete the User Guide by the close of the weekend. I was able to complete all in-line documentation for the main script. I am now in the process of completing the in-line documentation for the base_class.

In so doing, I derived the following simple but quite useful equation to determine the maximum number of nodes (assuming a Full tree) in any given GP tree:

nodes = 2^(depth + 1) – 1

For example:

Depth 1 = 3 nodes (1 functions, 2 terminals)
Depth 2 = 7 nodes (3 functions, 4 terminals)
Depth 3 = 15 nodes (7 functions, 8 terminals)
…
Depth 10 = 2047 nodes (1023 functions, 2024 terminals) *

* that is one huge-ass (scientific lingo) polynomial!

By Kai Staats|2017-11-24T23:53:17-04:00September 4th, 2015|Ramblings of a Researcher|Comments Off

Multi-core GP!

August 15-20

I was able to introduce multi-core functionality using ‘pprocess multicore library‘ (Thanks for the pointer Arun!). This effort required roughly four days research and coding, as it was my very first go. In the end, it is relatively simple, depending upon the code in which it is introduced. I included a user-defined core quantity, and the ability to modify the number of cores engaged during runtime.

The key to multi-core functionality is getting your head wrapped around protected memory spaces. That is, each core assigned by pprocess (or any multi-core library) will reproduce a fully functional copy (instance) of the section of your code which you are spawning on each core.

Each of these instances can read from any global variable, no problem. However, they cannot write back to global variables as all of them would try to write back to the same variable at the same time, known as a race condition. Bad things would happen.

It is imperative to keep in mind that what happens in each instance will not affect what happens in the other instances, on the other cores. Once spawned, they are all independent even though they work with the same variable names and are processed by copies of the same code.

If the returned values are to affect something, that something must be outside of the multi-core environment, post-collection. This may require some rearranging of your code.

For me, it meant that instead of a fitness = fitness + 1 inside the pprocess pipeline, I returned the single value of fitness for each instance on each core, and then conducted the sum function when I collect the results from pprocess.

Below I offer an example of the original for loop and the pprocess which replaces it, for multi-core support. I have left the single core for loop in place as a user invoked bypass of pprocess when the overhead of multi-core processing reduces performance (common on non-CPU intensive runs or on a limited number of cores).

In my code, each GP tree must evaluate all rows in the given data. I therefore employ a for loop to iterate through each row in previously loaded .csv file (not shown in this example). The Python method (function) fitness_eval() is what conducts the evaluations.

As fitness_eval itself calls another two Python methods (3 methods in total), the key was to rewrite my code such that any global variables for which values are updated (written to) were made local instead, passed from method to method directly. In the end, this makes for better code, forcing me to rethink the way I had designed this and others sections.

if cores == 1: # employ only one CPU core and bypass ‘pprocess’
   for row in range(0, data_rows): # increment through all rows in the data
      fitness = fitness + fitness_eval(row) # evaluate tree fitness

else: # employ multiple CPU cores using ‘pprocess’
   results = pp.Map(limit = self.cores)
   parallel_function = results.manage(pp.MakeParallel(fitness_eval))
   for row in range(0, data_rows): # increment through all rows in data
      parallel_function(row) # evaluate tree fitness

   fitness = sum(results[:]) # ‘pprocess’ returns the fitness scores in a single dump

In this example, the pprocess method parallel_function() is a wrapper for my original method fitness_eval(), such that I pass the incrementing variable row through parallel_function(row) which in turn hands if off to multiple copies of fitness_eval(). Cool, huh?! :)

One benefit of pprocess, as compared to the Python mulitiprocessing in Python library, is that pprocess can pass more than one variable through the called methods without using a pickle (a package used to send variables to and from methods while in mulit-core memory spaces). So, you can treat your methods just as you would on a single core, sending and returning variables using method(var_1, var_2, var_n) and subsequent return var_1, var_2, var_n.

But keep in mind what I shared above, about how those variables can only return an isolated value per instance, as changes to each of those variables on each core will not affect the other instances. Make sense?

Many thanks to the Stackoverflow community and Paul Boddie, co-author of pprocess.

By Kai Staats|2017-11-24T23:53:24-04:00August 20th, 2015|Ramblings of a Researcher|Comments Off

Premature convergence, part 2

(continued from GP update 20150813)

I see 4 ways to deal with the premature convergence:

a) Take a pill.

b) If any given tree falls below the user-defined number of nodes (node count, not depth count), that tree is forced to mutate over and over again ’till it is at or above the prescribed node count. This feels convoluted, as this is not how it happens in the biological world.

c) Nudge the fitness function (higher or lower, depending upon max or min function) such that a tree whose node count is below the user-defined number is less likely to be selected in a Tournament.

d) Simply block any tree whose node count is lower than the user-defined number from entering a tournament. As all four of my mutation types are channelled through tournament selection, this is an easy, 2 line solution.

Is it real-world? Is it any different than applying a maximum depth?

Hmmmm …

(resolved in Kepler’s Law resolved by GP!)

By Kai Staats|2017-11-24T23:53:31-04:00August 14th, 2015|Ramblings of a Researcher|Comments Off

GP update 2015 08/13

(email to my fellow researchers)

Subject: premature convergence

No, this email is not about some “teenage” problem :)

Per my conversation with Emmanuel today, the classic Kepler’s 3rd Law of Planetary Motion (called the “Harmonies Law”) is the square of the period divided by the cube of the mean radius from the center of the Sun to the center of the planet.

My new “minimisation’ function is working beautifully! However, maybe it works a little too well. Karoo GP is quickly converging on p/p (period divided by period) which of course equals 1, where 6 of the 9 planets are defined as 1.0 in the dataset I am using (the other 3 planets are 0.99 or 0.98).

So, I am going to introduce a new user defined “Min Nodes” which will set the minimum number of nodes (elements) in the GP tree (equation). I feel like this is cheating, but Emmanuel confirmed that most GP problems require some tweaking of the code.

There are 2 ways a large tree can very quickly get smaller: Grow Mutation or Cross-Over mutation (Reproduction, Point Mutation, and Full Mutation do not alter Tree size). I can intercept either of these and force them to evolve again and again until the new tree is above the min boundary; or simply invoke an artificial fitness boost in the right direction.

What would Darwin say? What would Gandhi do?

It seems to invoke evolutionary pressure is the one more like the “real-world”, no? To force something to evolve again and again until it satisfies a certain criteria is a bit convoluted. But any more so than defining a maximum depth?

We’ll see how I feel in the morning :)

kai

(continued in Premature convergence)

By Kai Staats|2017-11-24T23:53:39-04:00August 13th, 2015|Ramblings of a Researcher|Comments Off

Minimise, Maximise

(sitting at SKA)

Struggling to get back into my code.

Preparing to test Kepler’s 3rd Law of Planetary Motion. This is a minimisation problem, meaning the best overall fitness will be the smallest number. I will employ the Absolute Value Difference fitness function I recently developed. Simple in theory, but always a few hidden challenges in implementation.

(later)

Realised I need to re-think all my fitness functions, and embed both minimisation and maximisation into the tournament and ‘fitness gym’. Will move to define each fitness kernel as ‘min’ or ‘max’ at the opening.

By Kai Staats|2017-11-24T23:53:46-04:00August 12th, 2015|Ramblings of a Researcher|Comments Off