Batch Normalization

Recall the effect of normalization on the cost function back when we considered Logistic Regression.

By recasting our data in terms of a fixed mean and standard deviation, it made our hypothetical cost function follow a rounder, evener distribution, thereby making our Gradient Descent approach much easier.

from IPython.display import Image

Batch Normalization essentially does the same thing, but for hidden layers of a Neural Network.

But why do we to normalize in the hidden layer steps?


Say we’ve got a simple, 4-layer network like the one below.


Covering up earlier layers, it becomes immediately clear how we can extend all the benefits we see in Logistic Regression Normalization to this case.


More practically put, this makes later layers more robust to changes in earlier layers.

Andrew Ng provides a great example to demonstrate this:


If your input data for a given batch only consists of black cats, your model will overlearn the importance of the “is black” feature and fall apart when the next batch includes cats of various colors.

How to Implement

Like Logistic Regression before, batch normalization involves recasting the numbers within a batch relative to a mean and standard deviation. Except here, the values are only relative to the batch themselves and follow the formulas:

$normed_z^{(i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^{2}+\epsilon}}$

A simple, vanilla normalization step

$\tilde{Z}^{(i)} = \gamma * normed_z^{(i)} + \beta$

Multiplied by two new hyperparameters, gamma and beta (not to be confused with our Adam implementation). Like all others before, it also gets tuned via Gradient Descent.

Therefore, z_tilde is calculated at the layer and batch level, and gets applied before the activation function in a given hidden layer, allowing hidden units to see a more uniform distribution of values and train more effectively.

At Run Time

After training, we likely won’t be running our model in batches, thus the notion of “normalize the intermediate layers relative to the batch” doesn’t really make sense.

Instead, it’s recommended that we apply an Exponentially Weighted Moving Average, similar to our approach with Adam optimization.