Random Search and Appropriate Search-Space Scaling

Grid search isn’t always our best approach for figuring out our best hyperparameters.

In the example of Deep Learning and Adam Optimization, there are several different hyperparameters to consider. Some, like the alpha constant, need tuning. On the other hand, constants like epsilon are basically taken as granted and don’t affect the model.

from IPython.display import Image

Image('images/feature_grid.PNG')

png

By grid searching over any feature space that includes epsilon, we’re putting five times the computation on our system for less-than-negligible performance gain.

Instead, Andrew Ng suggests doing a random search over the feature space to arrive at your coefficients. If instead, we did half the number of searches in a random fashion (below), we’d still get the learnings from tuning alpha, without over-searching in the epsilon space.

Image('images/feature_grid_random.png')

png

But this presents a new, interesting problem.

Whereas something like Number of Layers or Number of Hidden units might make sense to sample from on a linear scale, not all coefficients behave this way.

For instance, adjusting our alpha constant between 0.05 and 0.06 likely has a bigger performance impact than adjusting between 0.5 and 0.6, as it’s often a low-valued number.

And so by sampling randomly on a linear scale between 0.0001 and and 1, we spend as much compute resources investigating the upper-range values as the incremental, lower ones where the real optimization occurs.

Image('images/uniform_alpha.PNG')

png

Scaling

Thus, determining the correct scaling mechanism for your hyperparameters is crucial. You want to scale, based on the class of the coefficient, which may include the following:

Linear

This one’s easy. We just want to do a uniform search between two values.

import numpy as np

min_val, max_val = 0, 1

np.random.uniform(min_val, max_val)
0.9358677626967968

Discrete

For whole-numbered values between two values, we’ll use randint()

np.random.randint(0, 10)
2

Log Scale

For coefficients, like alpha above, where we want to select between a very small value and 1, it’s helpful to consider how to write it out as an exponent. For instance:

$0.0001 = 10^{-4}$

similarly

$1 = 10^{0}$

So this is actually just the same exercise as the Linear Scale, but between some negative number and 0, then piped as an exponent!

min_exp, max_exp = -4, 0

val = np.random.uniform(min_exp, max_exp)
10 ** val
0.00305918793992655

Exponentially Weighted

This would likely be better-named as “Reverse Log Scale,” describing hyperparameters where your search space is most effective between, say, 0.9 and 0.999, on a Log-ish scale.

Following the same approach as above, we just want to do a uniform search over the correct range of values, plus some other steps– in this case, establishing a log log scale for 0.9 to 0.999 involves establishing a log scale for 0.0001 to 0.1 and subtracting that from 1

min_exp, max_exp = -3, -1

val = np.random.uniform(min_exp, max_exp)
1 - (10 ** val)
0.9908195535776579