Frustration in the easy things

(sitting at SAAO)

Per my work at SKA yesterday, I learned to use matplotlib to produce 3D plots of my functions in combination with a scatterplot of the Iris features. I was hoping to automate the solving for any given variable using Sympy, but have not found a means to that end. For now, I will manually reform each algebraic expression. Not ideal, but I need to move ahead. Spent too much time on this already.

I recognise that plotting is core to any modern research. I feel far behind, but know I will come up to speed quickly. Between the older gnuplot, matplotlib, and sympy’s plot functions, there are myriad approaches (too many, in fact).

As I have many times experienced over the past year, every day I am humbled by the challenge of learning something new, and at the same time rewarded by the same. Each day feels incremental, but when I look back to my very first line of Python a little over a year ago, and now, over 2500 lines of Object Oriented, multi-core code with a home-built Numpy array management system, yeah, I’ve learned a ton.

By |2017-11-24T23:45:11-04:00September 9th, 2015|Ramblings of a Researcher|Comments Off on Frustration in the easy things

My brain barrier

I pushed too hard, tried to do too many things at once. I felt the barrier coming, but ignored it, trying to make one more breakthrough in my code.

Last night and today I am learning how to plot multi-dimensional data. In practice, this is very simple. But this has been a real struggle for me.

I promised myself that in this Masters, I would not take any shortcuts, that what I learn would be fully integrated into my understanding. Due to my lack of a mathematical science foundation, some of the core principals of data reduction, statistical application, and multi-dimensional visualisation are totally new to me.

What’s more, my anxiety literally heats my brain to a point of dysfunction, the back of my neck feels like someone is pouring hot syrup down my spine (since 2011). When I feel stupid, when I can’t figure something out, and when the pressure is on … I spiral.

I feel as though I lost a full day (which I can’t afford to do).

On the walk to the train station I asked Arun a number of questions to clarify my understanding. The physical act of walking, gesturing with my arms and hands helped tremendously. Arun is always patient, and an excellent explainer. A few key concepts became clear and it fell into place for me.

By |2017-11-24T23:52:30-04:00September 8th, 2015|Ramblings of a Researcher|Comments Off on My brain barrier

Return to Floating point hell

(sitting at SKA again)

No choice but to return to floating points again as once in every several thousand trees, the fitness score stored has expanded beyond the set 4 floating point places, resulting in something like n.123456e (scientific notation).

I have searched on-line, dug deep into Stackoverflow, and asked my colleagues. No one has a means by which I can define a variable as a fixed number of floating points (to the right of the decimal) that stays that way. Ugh!

I therefore introduced a second round function in the Tournament method, to force the numbers back into spec. I have not since experienced the same error. This is not the ideal solution as the round function introduces its own issues, but for the sake of GP, it does the trick.

I then approached the issue of plotting 2D and 3D data, but hit a major mental block. Like an old engine with leaky coolant, my brain overheated and I had a bad day. Sorry Arun, I was really grumpy.

By |2017-11-24T23:52:37-04:00September 8th, 2015|Ramblings of a Researcher|Comments Off on Return to Floating point hell

Flowers for Algebra

(sitting at SKA)

It is my intent to complete the classification test suite today, using the benchmark Iris dataset.

The Iris dataset offers 4 features (columns in a .csv) for each of 3 unique plant species, as labelled in the right-most (5th) column. As 2 of these species are not discernible from each other with a linear function, there are 2 ways to approach this problem:

  1. Compare only 2 species at a time (in a round-robin) such that we work with A/B, A/C, and finally B/C; or
  2. Seek a non-linear function with a kernel which supports more than 2 classes at a time

Karoo GP - by Kai Staats

As Karoo GP supports only 2 classes at this time, I am going to work with the first option, and come back to the second when I have time to build a new kernel.

If it is our intent to plot the results, we can engage only 3 dimensions of data at one (without forcing higher dimensions into lower orders). However, the Iris dataset is compose of 4 features (a, b, c and d). As whatever holds true in lower dimensions will be retained in higher dimensions, it is safe to test a lower order of the featureset as long as the features we selected are decisive in the identification of the flower species. As such that we remove column d from all 3 comparisons.

This resulted in what appears to be a fully functional classification by Karoo GP, with both linear and non-linear functions that produce 100% (50/50) correct classifications.

By |2017-11-24T23:52:45-04:00September 7th, 2015|Ramblings of a Researcher|Comments Off on Flowers for Algebra

Kepler’s Law resolved with GP!

Kepler's Law resolved (Karoo GP, Kai Staats)

(Kepler’s 3rd Law of Planetary Motion table by the Physics Classroom)

(continued from Premature convergence)

Working from AIMS and my apartment these past two days, I was able to resolve a persistent floating point issue by employing a round function before the fitness evaluation.

I also fixed the minimisation function with the discovery of 2 copy/pasted lines of code I had apparently failed to come back to. It appears this has not been working for some time, as I have been focused on other aspects of the code.

Finally! Just like that! Karoo GP now resolves Kepler’s 3rd Law of Planetary Motion! YES!!!

I ran it with the default Depth 3 and minimum node count of 3 and again, it came up with t/t = 1. So I ran it again with Depth 5 [2^(d+1)-1 = 63 possible nodes] and minimum node count of 9.

While it struggled for the first 5-6 generations, converging on what appeared to be 1 again, some mutation gave it the correct answer and in just 3 generations the correct trees dominated! Coooool!

{1: t**2/r**3,
2: t**2/r**3,
3: t**2/r**3,

88: t**2/r**3,
92: t**2/r**3,
94: t**2/r**3}

This proves 2 of the 3 desired functions: regression maximisation (match) and regression minimisation. Only classification remains to prove Karoo GP road worthy.

By |2017-11-24T23:52:52-04:00September 6th, 2015|Ramblings of a Researcher|Comments Off on Kepler’s Law resolved with GP!

Smashing bugs

The past 3 days have been a fun dive back into my code. I discovered my code no longer worked with cos and sin, the matching fitness function (result = solution) failed with floats, and my minimalisation function was selecting the final Tree, not the best. Ugh!

The float issue all programmers deal with, namely, forcing all variables which are compared against each other into the same number of positions to the right of the decimal point. FIXED!

The cos and sin were related to the float issue. FIXED!

The minimalisation function was my own damned fault, as in my Tournament Selection I had copy/pasted a body of code with intent to them rework it from maximising to minimising, but got distracted, forgot (3-4 weeks ago?) and only tonight realised what was happening. FIXED!

Here are my first results from the working minimalisation function. What remains is the automated selection of the best of the best, not just a list of the leaders in the final generation. Easily done.

Desired result:
a*b + c where 1*10 + 0.05 = 10.05

Trial 1:
a*b + c – c**3/b where 1*10 + 0.05 – (0.05^3/10) = 9.64

Trial 2:
a*b + c + c/b where 1*10 + 0.05 + 0.05/10 = 10.055
a*b where 1*10 = 10.00
a*b + c where 1*10 + 0.05 = 10.05 is CORRECT
a*b + c + c/a where 1*10 + 0.05 + (0.05/1) = 10.10

Trial 3:
a*b + c where 1*10 + 0.05 = 10.05 is CORRECT
a*b + 1/b where 1*10 + 1/10 = 10.10

The above test works perfectly. Now, I have only to test the Iris classification set and I will have all 3 fitness functions fully tested and working.

I am roughly 5 weeks behind sched, but believe I can catch-up as I my code base is solid and designed for the rapid introduction of new fitness functions. In theory, this goes fairly smooth after I prove the base works (why do I keep saying this?!)

Cool!

By |2017-11-24T23:53:01-04:00September 6th, 2015|Ramblings of a Researcher|Comments Off on Smashing bugs

Floating points

In a moment of needing a break from working on the User Guide I played with Karoo GP for 30 minutes and came back to testing cos and sin functions. They broke the Absolute Value function where Classification and Matching yet worked.

A day into the battle, a deeper issue has unfolded which keeps Karoo GP from working with any floats (although this was tested a long time ago, but with a very limited and controlled number of decimal places).

Now, I have learned even when [result = solution], it fails to match.

algo_sym a + b/c
result 0.44404973357
solution 0.4440497336

algo_sym a + b/c
result 0.690166666667
solution 0.6901666667

algo_sym a + b/c
result 2.70666517168
solution 2.7066651717

ARGH!!! Rounding errors!!!

I need to introduce careful control of the number of decimal places invoked for both ‘result’ and ‘solution’. For as much headache as this has caused, I am pleased I took that break and played around with my code again, else I would have run into this when I came back to KAT7 data.

What’s more, this might help Karoo GP solve the Kepler planet problem (which still fails at, miserably).

By |2017-11-24T23:53:08-04:00September 5th, 2015|Ramblings of a Researcher|Comments Off on Floating points

A simple equation

Depth 10 GP Tree by Kai Staats

(sitting in the bookstore in Kalk Bay)

This is my first time working on Karoo GP since my daughter Lindah’s arrival to South Africa nearly two weeks ago. She departed just yesterday. Out time together was incredible. We both learned so much. I am so sad to see her go :(

My intent is to complete the User Guide by the close of the weekend. I was able to complete all in-line documentation for the main script. I am now in the process of completing the in-line documentation for the base_class.

In so doing, I derived the following simple but quite useful equation to determine the maximum number of nodes (assuming a Full tree) in any given GP tree:

nodes = 2^(depth + 1) – 1

For example:

Depth 1 = 3 nodes (1 functions, 2 terminals)
Depth 2 = 7 nodes (3 functions, 4 terminals)
Depth 3 = 15 nodes (7 functions, 8 terminals)

Depth 10 = 2047 nodes (1023 functions, 2024 terminals) *

* that is one huge-ass (scientific lingo) polynomial!

By |2017-11-24T23:53:17-04:00September 4th, 2015|Ramblings of a Researcher|Comments Off on A simple equation

Multi-core GP!

August 15-20

I was able to introduce multi-core functionality using ‘pprocess multicore library‘ (Thanks for the pointer Arun!). This effort required roughly four days research and coding, as it was my very first go. In the end, it is relatively simple, depending upon the code in which it is introduced. I included a user-defined core quantity, and the ability to modify the number of cores engaged during runtime.

The key to multi-core functionality is getting your head wrapped around protected memory spaces. That is, each core assigned by pprocess (or any multi-core library) will reproduce a fully functional copy (instance) of the section of your code which you are spawning on each core.

Each of these instances can read from any global variable, no problem. However, they cannot write back to global variables as all of them would try to write back to the same variable at the same time, known as a race condition. Bad things would happen.

It is imperative to keep in mind that what happens in each instance will not affect what happens in the other instances, on the other cores. Once spawned, they are all independent even though they work with the same variable names and are processed by copies of the same code.

If the returned values are to affect something, that something must be outside of the multi-core environment, post-collection. This may require some rearranging of your code.

For me, it meant that instead of a fitness = fitness + 1 inside the pprocess pipeline, I returned the single value of fitness for each instance on each core, and then conducted the sum function when I collect the results from pprocess.

Below I offer an example of the original for loop and the pprocess which replaces it, for multi-core support. I have left the single core for loop in place as a user invoked bypass of pprocess when the overhead of multi-core processing reduces performance (common on non-CPU intensive runs or on a limited number of cores).

In my code, each GP tree must evaluate all rows in the given data. I therefore employ a for loop to iterate through each row in previously loaded .csv file (not shown in this example). The Python method (function) fitness_eval() is what conducts the evaluations.

As fitness_eval itself calls another two Python methods (3 methods in total), the key was to rewrite my code such that any global variables for which values are updated (written to) were made local instead, passed from method to method directly. In the end, this makes for better code, forcing me to rethink the way I had designed this and others sections.

if cores == 1: # employ only one CPU core and bypass ‘pprocess’
   for row in range(0, data_rows): # increment through all rows in the data
      fitness = fitness + fitness_eval(row) # evaluate tree fitness

else: # employ multiple CPU cores using ‘pprocess’
   results = pp.Map(limit = self.cores)
   parallel_function = results.manage(pp.MakeParallel(fitness_eval))
   for row in range(0, data_rows): # increment through all rows in data
      parallel_function(row) # evaluate tree fitness

   fitness = sum(results[:]) # ‘pprocess’ returns the fitness scores in a single dump

In this example, the pprocess method parallel_function() is a wrapper for my original method fitness_eval(), such that I pass the incrementing variable row through parallel_function(row) which in turn hands if off to multiple copies of fitness_eval(). Cool, huh?! :)

One benefit of pprocess, as compared to the Python mulitiprocessing in Python library, is that pprocess can pass more than one variable through the called methods without using a pickle (a package used to send variables to and from methods while in mulit-core memory spaces). So, you can treat your methods just as you would on a single core, sending and returning variables using method(var_1, var_2, var_n) and subsequent return var_1, var_2, var_n.

But keep in mind what I shared above, about how those variables can only return an isolated value per instance, as changes to each of those variables on each core will not affect the other instances. Make sense?

Many thanks to the Stackoverflow community and Paul Boddie, co-author of pprocess.

By |2017-11-24T23:53:24-04:00August 20th, 2015|Ramblings of a Researcher|Comments Off on Multi-core GP!

Premature convergence, part 2

(continued from GP update 20150813)

I see 4 ways to deal with the premature convergence:

Karoo GP, premature converge by Kai Staats

a) Take a pill.

b) If any given tree falls below the user-defined number of nodes (node count, not depth count), that tree is forced to mutate over and over again ’till it is at or above the prescribed node count. This feels convoluted, as this is not how it happens in the biological world.

c) Nudge the fitness function (higher or lower, depending upon max or min function) such that a tree whose node count is below the user-defined number is less likely to be selected in a Tournament.

d) Simply block any tree whose node count is lower than the user-defined number from entering a tournament. As all four of my mutation types are channelled through tournament selection, this is an easy, 2 line solution.

Is it real-world? Is it any different than applying a maximum depth?

Hmmmm …

(resolved in Kepler’s Law resolved by GP!)

By |2017-11-24T23:53:31-04:00August 14th, 2015|Ramblings of a Researcher|Comments Off on Premature convergence, part 2
Go to Top