MEPX user manual

Home
Description
Papers
Source code
MEPX Software
MEPX user manual
People
News
Links
Contact

Download MEPX software from here

Contents

Quick start

  • Select Data panel.
  • Select Training data panel.
  • Press Load training data button and choose a csv or txt file. Data must be separated by blank space, tab or ;.
  • Select Parameters panel. Modify some parameters if needed. For instance, one could modify code length, numer of subpopulations, the (sub)population size, number of generations etc. Also specify the problem type (regression or binary classification).
  • Press Start button from the main toolbar.
  • Read the results from Results panel.
  • You can also save the entire project (data, parameters, results) by pressing the Save project from the main toolbar.

Data

Data are loaded from csv or txt files. Data must be separated by blank space, tab or ;.

Last value on each line is the target (expected output). Test data can be without output (they may have one column less than training data).

Currently the problems can have only 1 output. Files containing multiple outputs must be split accordingly (for instance Building problem from PROBEN1 which has 3 outputs (energy, hot and cold water)).

For classification problems, the last column may contain only values 0 or 1 (for binary classification) or values 0,1 ... (num_classes - 1) for more multi-class classification.

Training data is compulsory. The others (validation and test) are optional.

You can also load alphanumerical values and then convert them to numerical values. You have several specialised buttons for that:

  • Replace values - which will replace some values (alphanumerical for instance with numerical). Find and replace works with regular expressions too.
  • To numeric - which will do an automatic conversion of alphanumerical values to integer values. First alphanumerical value will be converted to 0, the second (distinct one) to one and so on.

The user can also scale numerical values to a given interval.

Parameters

Fitness function

Fitness (or the error) is computed as follows:

For symbolic regression problems, the fitness is either Mean Absolute Error (sum of errors divided by the number of examples) or Mean Squared Error (sum of squared error divided by the number of examples).

For classification problems, the fitness is Mean Classification Error which is the number of incorrectly classified examples divided by the number of examples and multiplied by 100 (this is actually the percentage of incorrectly classified data).

Problem type

Can be:

  • regression
  • ,
  • binary classification (with 2 classes)
  • ,
  • multi-class classification (with 2 or more classes)
  • .

A problem with 2 classes can be solved by selecting either binary classification or multi-class classification.

Binary classification means that is a threshold involved. Values less or equal to the threshold are classified as belonging to class 0 and the others are classified as belonging to class 1.

For multi-class optimisation, the outputs are assigned to groups of genes and the gene encoding the expression having the first maximal value will provide class for that data (see more details here: google groups post).

In the case of binary classification, the threshold is computed automatically (because of that, binary classification is slower than multi-class classification).

If use validation set is checked then, at each generation, the best individual is run against the validation set, and the best such individual (from those tested against the validation set) is the output of the program (and will be applied on the test data).

It is possible to run the optimization on a smaller set of training data. In such case you have to set the Random subset size to a value smaller than the size of the training set.

Operators (or functions)

Classic operators +, -, *, ... nothing new here.

Note that trigonometric operators work with radians.

The algorithm

MEPX uses a steady state model with multiple subpopulations. Steady-state means that inside one subpopulation, the worst individuals are replaced with newer ones (if the newer are better).

User may specify the number of subpopulations. Each subpopulation will run independently from the others and, after one generation, they will exchange few individuals.

Genetic operators (crossover and mutation) are classic ... nothing new here.

It is possible to specify how often the variables, operators and constants should appear in a chromosome. This is done probabilistically. If you want more operators to appear, please increase the operators probability. More operators means more complex expressions.

Sum of operators probability, variables probability and constants probability must be 1.

Constants

In order to enable constants, one must define a probability greater than 0 for constants. You cannot edit that probability directly, but constants_probability + operators_probability + variables_probability = 1. So if you define a value for probability for operators or variables such that their sum is less than 0, you will get a greater than 0 value for constants.

Constants can be user defined or generated by the program (over a given interval). Generated constants can be kept fixed for all the evolution or they can also evolve. Mutation of constants is done by adding a random value between [-max delta, +max_delta].

Runs

Usually multiple runs must be performed for computing some statistics. It is also possible to specify the initial seed of the first run (consecutive runs will start from previous seed + 1).

Num threads - will run the subpopulations on multiple CPU cores. This can increase the speed of analysis significantly. If you have a quad core processor with hyper-threading, you may set the number of threads to 8. For best results make sure that the number of subpopulations is a multiple of number of threads.

Results

The following results are displayed:

  • error for the entire training, validation and test set.
  • obtained value for each data in the training, validation and test set (also called Model or Output).
  • evolution of fitness for the best individual in the population and the population average.
  • C source code of the best solution. This code can be simplified in order to show only instructions that generate the output (remember that not all genes of a chromosome participate to the solution - these genes are called introns). Note that there is no simplification in the case of multi-class classification.

Reporting problems, bugs, comments

If you have problems with this program please save the project (by pressing the Save Project button from the main toolbar) and send it to mihai.oltean@gmail.com