16 Aug 2025

WakeGP 16th Augest 2025 Devlog

words

After so long time, I’m beginning to think that it’s too hard for Genetic Programming to do feature selection, feature discovery, constant discovery and solution discovery all at the same time. I believe most of my experiments were in waste. Also I am beginning to doubt if my feature extraction code, which I have just stolen from another project, works correctly.

So what I’ve done? I’ve used PyTorch’s code for feature extraction(MelSpectogram). It outputs a CSV file, each line has got 280 floating point numbers. And now I’m experimenting with different ways for feature selection. I select 24 features among the 280 and see how well they perform.

I have added -r option to WakeGP with which you can specify how many runs do you want to go with the config you provided. So you can do wakegp -c config.toml -r 16 to do 16 runs with the same config. This is good because loading dataset, selecting features, loading configuration file and parsing it all happen only once. I also have added -t with which you can tell WakeGP to do all experiments concurrently. It simply uses rayon’s par_iter thing. The advantage is that it’ll utilize all my 24 cores of CPU, almost. If you don’t provide it, at some points like Selection, the code is sequential. When you do -t, it does all 16 runs simultaneously which is faster. However, all the output logs will be mixed together. But that’s no problem. Because I can do like:

wakegp -c config.toml -r 16 -t | grep 'Gen 600 ' | awk '{ print $7 }'

And then I will have fitness values of all of them. After which I can do a T test with another config to see which one perform better.

For feature selection, I have written a small Python script to play with this. The script does runs, does T tests and then moves forward. I’m testing with Simulated Annealing and a simple greedy algorithm. I first started with a small population(around 16) and a low number of generations, like 32. There seemed to be progress. But when you actually do runs with a realistic number of generations like 400, you realize that the better set of features you’ve found is actually worse!

So I have tried the same population size but with bigger number of generations(96). And now I have progress! I’m currently using just a simple hill climbing algorithm. And yeah I know it has got several disadvantages. But for now, I want to realize how many number of generations, and how many runs per features set is required.