Back Propagation

Back Propagation is essentially a \$2 way of saying “make an incremental change to your weights and biases, relative to our error.” Like Gradient Descent, the main goal is doing a bunch of Chain Rule magic™ to find all of our partial derivatives. Then we calculate our error by taking a simple (actual - expected) and marching backwards through the Net using some small learning rate, and adjustments to each of the matricies.

vs Logistic Regression Gradient Descent

Recall our vectorized implementation for Gradient Descent with Logistic Regressions

# predict
Z = np.dot(w.T, X) + b
A = sigma(Z)

# gradient descent
dZ = A - Y

dw = (1/m) * np.dot(X, dZ.T)
db = (1/m) * np.sum(dZ)

# update
w = w - alpha * dw
b = b - alpha * db

The implementation for Back Propagation is very, very similar. The update step is basically the same, and the predict step is replaced by Forward Prop.

The biggest difference lies in the gradient descent section, and even that should look pretty familiar.

dZ2 = A2 - Y                                      
dW2 = (1/m) * np.dot(dZ2, A1.T)                   
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)

dZ1 = np.multiply(np.dot(W2.T, dZ2), activation_fn_deriv(Z1))
dW1 = (1/m) * np.dot(dZ1, X.T)
db1 = (1/m) * np.sum(dZ1 * axis=1, keepdims=True)

Note, however, the second chunk of calculations, and how they depend on the output of the section above– calculating backwards through the Net.

Why Can’t I Hold All of These Derivatives?

The equation for dZ1 might throw you for a loop, but it’s easy when you consider the chain of calculations that get you to your Loss Function that use Z1. First, we start with our hypothesis function.

$Z^{[1]} = W^{[1]}X + b^{[1]}$

which, of course, gets piped into our activation function to become A1

$A^{[1]} = g(Z^{[1]})$

The nesting becomes pretty cumbersome by the second layer Z2

$Z^{[2]} = W^{[2]} g^{[2]}(A^{[1]}) + b^{[2]}$

$Z^{[2]} = W^{[2]} g^{[2]}(g^{[1]}(Z^{[1]})) + b^{[2]}$

And so if we’re looking for the derivative of the overall Cost Function, with respect to Z1, we’ve got our work cut out for ourselves doing Chain Rule stuff (we’ve omitted A2 and cost for simplicity)

But looking at this you see that it just follows typical Chain Rule fashion– We’re multiplying the derivative of the function (the np.dot() portion) by the derivative of the inside. Also note, this is element-wise multiplication, not the dot product metween the two matricies.

dW1 and db1 are comparably-trivial calculations, and our update step looks like normal.