Download MEPX software from here
Data are loaded from csv or txt files. Data must be separated by blank space, tab or ;.
Last value on each line is the target (expected output). Test data can be without output (they may have one column less than training data).
Currently the problems can have only 1 output. Files containing multiple outputs must be split accordingly (for instance Building problem from PROBEN1 which has 3 outputs (energy, hot and cold water)).
For classification problems, the last column may contain only values 0 or 1 (for binary classification) or values 0,1 ... (num_classes - 1) for more multi-class classification.
Training data is compulsory. The others (validation and test) are optional.
You can also load alphanumerical values and then convert them to numerical values. You have several specialised buttons for that:
The user can also scale numerical values to a given interval.
Fitness (or the error) is computed as follows:
For symbolic regression problems, the fitness is either Mean Absolute Error (sum of errors divided by the number of examples) or Mean Squared Error (sum of squared error divided by the number of examples).
For classification problems, the fitness is Mean Classification Error which is the number of incorrectly classified examples divided by the number of examples and multiplied by 100 (this is actually the percentage of incorrectly classified data).
A problem with 2 classes can be solved by selecting either binary classification or multi-class classification.
Binary classification means that is a threshold involved. Values less or equal to the threshold are classified as belonging to class 0 and the others are classified as belonging to class 1.
For multi-class optimisation, the outputs are assigned to groups of genes and the gene encoding the expression having the first maximal value will provide class for that data (see more details here: google groups post).
In the case of binary classification, the threshold is computed automatically (because of that, binary classification is slower than multi-class classification).
If use validation set is checked then, at each generation, the best individual is run against the validation set, and the best such individual (from those tested against the validation set) is the output of the program (and will be applied on the test data).
It is possible to run the optimization on a smaller set of training data. In such case you have to set the Random subset size to a value smaller than the size of the training set.
Operators (or functions)
Classic operators +, -, *, ... nothing new here.
Note that trigonometric operators work with radians.
MEPX uses a steady state model with multiple subpopulations. Steady-state means that inside one subpopulation, the worst individuals are replaced with newer ones (if the newer are better).
User may specify the number of subpopulations. Each subpopulation will run independently from the others and, after one generation, they will exchange few individuals.
Genetic operators (crossover and mutation) are classic ... nothing new here.
It is possible to specify how often the variables, operators and constants should appear in a chromosome. This is done probabilistically. If you want more operators to appear, please increase the operators probability. More operators means more complex expressions.
Sum of operators probability, variables probability and constants probability must be 1.
In order to enable constants, one must define a probability greater than 0 for constants. You cannot edit that probability directly, but constants_probability + operators_probability + variables_probability = 1. So if you define a value for probability for operators or variables such that their sum is less than 0, you will get a greater than 0 value for constants.
Constants can be user defined or generated by the program (over a given interval). Generated constants can be kept fixed for all the evolution or they can also evolve. Mutation of constants is done by adding a random value between [-max delta, +max_delta].
Usually multiple runs must be performed for computing some statistics. It is also possible to specify the initial seed of the first run (consecutive runs will start from previous seed + 1).
Num threads - will run the subpopulations on multiple CPU cores. This can increase the speed of analysis significantly. If you have a quad core processor with hyper-threading, you may set the number of threads to 8. For best results make sure that the number of subpopulations is a multiple of number of threads.
The following results are displayed:
Reporting problems, bugs, comments
If you have problems with this program please save the project (by pressing the Save Project button from the main toolbar) and send it to email@example.com