Momentum, RMSprop, and Adam Optimization for Gradient Descent

Say we’re trying to optimize over an oblong cost function like the one below.

from IPython.display import Image



Traditionally, we know that there’s a large emphasis on the learning rate, alpha, that dictates the step size of our gradient descent.

Too large, and we wind up over-shooting paths that would allow us to converge sooner (purple). Too small, and it takes forever to run (blue).



However, you look at these lines, they learn at a reasonable pace in the X plane, while oscillating back and forth in the Y.


By introducing a concept of “Momentum” we can continue to enjoy the horizontal descent rate, while deliberately dampening the variance in the Y.


We define v_dW and v_db terms to be “velocity” terms with the same shape as their gradient counterparts. And depending on our values for beta and alpha, we can tune how much momentum we want subsequent gradients to gather.

On iteration t:
    Compue dW, db for current mini-batch
    v_dW = beta * v_dW + (1 - beta) dW
    v_db = beta * v_db + (1 - beta) db
    W = W - alpha * v_dW
    b = b - alpha * v_db

If this form looks similar to Exponentially Weighted Moving Averages, that’s only because it is.

Note that because we’re weighting successive gradient steps against one another, the X movement will reinforce, whereas the Y movement will have a cancelling-out effect.


The Root Mean Squared propagation algorithm looks extremely similar to momentum approach we outlined above.

On iteration t:
    Compue dW, db for current mini-batch
    s_dW = beta * v_dW + (1 - beta) (dW ** 2)
    s_db = beta * v_db + (1 - beta) (db ** 2)
    W = W - alpha * (dW / (sqrt(s_dW) + epsilon))
    b = b - alpha * (db / (sqrt(s_db) + epsilon))

Key difference here is how we element-wise square the last term in our calculations of s_dW and s_db. This matters a ton because on the update step, we’re dividing by the s coefficients (and adding a negligible epsilon coefficient so we don’t run into divide-by-zero errors)

So intuitively:

  • If dW is large
    • We square it, giving us a larger number in the denominator
    • This makes the whole learning rate coefficient smaller, so we make a smaller update
    • This is what we want to happen on our Y axis
  • If db is small
    • Squaring it makes the number even smaller
    • Dividing by the square root of a smaller number makes for a much larger update coefficient– a big step
    • This is what we want to happen on the X axis

Whereas the momentum approach ramps up after the model gains, well, momentum, this allows us to begin converging more quickly (green line) by correcting for a large gradient value right out of the gate.



Adam Optimization

The current de-facto optimization algorithm, Adam (Adaptive Moment Estimation) combines both Momentum and RMSprop into a mouthful of an update step, borrowing the best features of both to give you smoother cost functions as well as higher accuracy.



  • We’ve got two separate beta coefficients– one for each optimization part.
  • We implement bias correction for each gradient (see bottom of Exponentially Weighted Moving Averages for explanation)
On iteration t:
    Compue dW, db for current mini-batch
    # Momentum
    v_dW = beta1 * v_dW + (1 - beta1) dW
    v_db = beta1 * v_db + (1 - beta1) db
    v_dW_corrected = v_dw / (1 - beta1 ** t)
    v_db_corrected = v_db / (1 - beta1 ** t)
    # RMSprop    
    s_dW = beta * v_dW + (1 - beta2) (dW ** 2)
    s_db = beta * v_db + (1 - beta2) (db ** 2)
    s_dW_corrected = s_dw / (1 - beta2 ** t)
    s_db_corrected = s_db / (1 - beta2 ** t)
    # Combine
    W = W - alpha * (v_dW_corrected / (sqrt(s_dW_corrected) + epsilon))
    b = b - alpha * (v_db_corrected / (sqrt(s_db_corrected) + epsilon))


alpha: the learning rate. Needs tuning.

beta1: momentum weight. Default to 0.9.

beta2: RMSprop weight. Default to 0.999.

epsilon: Divide by Zero failsave. Default to 10 ** -8.