# Momentum, RMSprop, and Adam Optimization for Gradient Descent

Say we’re trying to optimize over an oblong cost function like the one below.

```
from IPython.display import Image
Image('images/momentum_1.png')
```

Traditionally, we know that there’s a large emphasis on the learning rate, `alpha`

, that dictates the step size of our gradient descent.

Too large, and we wind up over-shooting paths that would allow us to converge sooner (purple). Too small, and it takes forever to run (blue).

`Image('images/momentum_2.png')`

However, you look at these lines, they learn at a reasonable pace in the X plane, while oscillating back and forth in the Y.

## Momentum

By introducing a concept of “Momentum” we can continue to enjoy the horizontal descent rate, while deliberately dampening the variance in the Y.

### Pseudocode

We define `v_dW`

and `v_db`

terms to be “velocity” terms with the same shape as their gradient counterparts. And depending on our values for `beta`

and `alpha`

, we can tune how much momentum we want subsequent gradients to gather.

```
On iteration t:
Compue dW, db for current mini-batch
v_dW = beta * v_dW + (1 - beta) dW
v_db = beta * v_db + (1 - beta) db
W = W - alpha * v_dW
b = b - alpha * v_db
```

If this form looks similar to Exponentially Weighted Moving Averages, that’s only because it is.

Note that because we’re weighting successive gradient steps against one another, the `X`

movement will reinforce, whereas the `Y`

movement will have a cancelling-out effect.

## RMSprop

The Root Mean Squared propagation algorithm looks extremely similar to momentum approach we outlined above.

```
On iteration t:
Compue dW, db for current mini-batch
s_dW = beta * v_dW + (1 - beta) (dW ** 2)
s_db = beta * v_db + (1 - beta) (db ** 2)
W = W - alpha * (dW / (sqrt(s_dW) + epsilon))
b = b - alpha * (db / (sqrt(s_db) + epsilon))
```

Key difference here is how we element-wise square the last term in our calculations of `s_dW`

and `s_db`

. This matters a ton because on the update step, we’re *dividing by* the `s`

coefficients (and adding a negligible `epsilon`

coefficient so we don’t run into divide-by-zero errors)

So intuitively:

- If
`dW`

is large- We square it, giving us a larger number in the denominator
- This makes the whole learning rate coefficient
*smaller*, so we make a smaller update - This is what we want to happen on our
`Y`

axis

- If
`db`

is small- Squaring it makes the number even smaller
- Dividing by the square root of a smaller number makes for a much larger update coefficient– a big step
- This is what we want to happen on the
`X`

axis

Whereas the momentum approach ramps up after the model gains, well, momentum, this allows us to begin converging more quickly (green line) by correcting for a large gradient value right out of the gate.

`Image('images/momentum_3.png')`

## Adam Optimization

The current de-facto optimization algorithm, Adam (Adaptive Moment Estimation) combines both Momentum and RMSprop into a mouthful of an update step, borrowing the best features of both to give you smoother cost functions as well as higher accuracy.

### Pseudocode

Note:

- We’ve got two separate
`beta`

coefficients– one for each optimization part. - We implement bias correction for each gradient (see bottom of Exponentially Weighted Moving Averages for explanation)

```
On iteration t:
Compue dW, db for current mini-batch
# Momentum
v_dW = beta1 * v_dW + (1 - beta1) dW
v_db = beta1 * v_db + (1 - beta1) db
v_dW_corrected = v_dw / (1 - beta1 ** t)
v_db_corrected = v_db / (1 - beta1 ** t)
# RMSprop
s_dW = beta * v_dW + (1 - beta2) (dW ** 2)
s_db = beta * v_db + (1 - beta2) (db ** 2)
s_dW_corrected = s_dw / (1 - beta2 ** t)
s_db_corrected = s_db / (1 - beta2 ** t)
# Combine
W = W - alpha * (v_dW_corrected / (sqrt(s_dW_corrected) + epsilon))
b = b - alpha * (v_db_corrected / (sqrt(s_db_corrected) + epsilon))
```

### Coefficients

`alpha`

: the learning rate. **Needs tuning.**

`beta1`

: momentum weight. Default to `0.9`

.

`beta2`

: RMSprop weight. Default to `0.999`

.

`epsilon`

: Divide by Zero failsave. Default to `10 ** -8`

.