CIFAR-10 Genetic Algorithm Hyperparameter Selection using TensorFlow

CIFAR-10 Genetic Algorithm Hyperparameter Selection using TensorFlow

In machine learning, one of the key challenges is selecting the correct hyperparameter values for your problem. This includes things like model complexity, learning rate, batch size, etc. This has largely been done using intuition, which is decidedly not scientific.  At GTC 2017, there were two talks on using genetic algorithms to select the best hyperparameter values for neural networks:

S7230 – Joseph Schneible (Technica Corporation) Using Genetic Algorithms to Optimize Recurrent Neural Networks.

S7435 – Steven Young, (Oak Ridge National Laboratory) Adapting DL to New Data: An Evolutionary Algorithm for Optimizing Deep Networks.

I decided to try this out using one of the common image classification benchmarks, CIFAR-10. This is a standard classification problem, the goal being to place each image into one of 10 classes. I started with the tutorial implementation available in the TensorFlow repository.

I wrote a small library to do genetic optimization internally at Acceleware.  It supports the following types of traits:

  • Boolean
  • Integers within a specified range
  • Uniformly distributed floating point values within a specified range
  • Geometrically distributed floating point values
  • Geometrically distributed integers

The next task was to determine the ranges of values for each of the hyperparameters. The network as implemented in the sample code (cifar10.py inference())  looks like this:

 

 

sample code ( cifar10.py inference() )

 

I decided to vary the following parameters:

Parameter Description Values
conv1_size Size of filters 3, 4, 5, 6
conv1_count Number of filters 32, 40, 51, 64, 81, 102
conv2_size Size of filters 3, 4, 5, 6
conv2_count Number of filters 32, 40, 51, 64, 81, 102
pool_size Pooling size 3, 4
pool_stride Pooling stride 1, 2, 3, 4
local3_size Number of neurons 256, 287, 323, 362, 406, 456
local4_size Number of neurons 128, 147, 169, 194, 223

 

In total there are 138240 combinations for all of the parameters in the table above.  At an average of one hour per combination, on 4 GPUs, it would take almost 4 years to run all possible combinations!  We can do better, using genetic algorithms.

I modified inference() to take these parameters in a config dictionary:In total there are 138240 combinations.  At an average of one hour per combination, on 4 GPUs, it would take almost 4 years to run all possible combinations!  We can do better, using genetic algorithms.

  with tf.variable_scope('conv1') as scope:
    kernel = _variable_with_weight_decay('weights',
                                         shape=[config['conv1_size'],
                                                config['conv1_size'],
                                                3,
                                                config['conv1_count']],
                                         stddev=5e-2,
                                         wd=0.0)
    conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME')
    biases = _variable_on_cpu('biases',
                              [config['conv1_count']],
                              tf.constant_initializer(0.0))
    pre_activation = tf.nn.bias_add(conv, biases)
    conv1 = tf.nn.relu(pre_activation, name=scope.name)
    _activation_summary(conv1)

 

I then created a population with the following parameters:  The first generation had 20 random entities. The parents of each subsequent generation consisted of:

  • The top 10 members of the previous generation
  • The top 10 entities of all time
  • One new random entity

Fitness was determined using the classification accuracy for the validation set.  The parents were randomly paired off to breed with each other until 20 offspring were created for the next generation.  When breeding, each trait had a 5% chance of mutating.

Using two NVIDIA Tesla P100 and two NVIDIA Tesla K20c GPUs, I ran this algorithm through 8 generations, and the top entities were as follows, with the original sample code values for comparison:

 

conv1

size

conv1

count

conv2

size

conv2

count

pool

size

pool

stride

local3

size

local4

size

accuracy

5

64

5

102

3

1

406

169

88.01%

5

102

6

102

3

1

256

128

87.96%

5

64

5

102

3

1

456

169

87.76%

5

64

6

102

3

1

406

169

87.71%

3

64

5

102

3

1

456

169

87.53%

Sample Code - Baseline

5

64

5

64

3

2

384

192

86.31%

 

As you can see, we have achieved an improvement over the baseline with minimal effort. The top performing algorithms use other network topologies.

 

What can we do to further improve?  We could modify the algorithm beyond simple hyperparameter selection and allow it to actually choose different network topologies.  We could also try using Bayesian optimization as an alternative to genetic algorithms, although that can be harder to parallelize.  We could also try penalizing the fitness function with the size of the network in order to build a model which is not only accurate but fast and memory-efficient as well.