# Back Propagation

Back Propagation is essentially a \$2 way of saying “make an incremental change to your weights and biases, relative to our error.” Like Gradient Descent, the main goal is doing a bunch of Chain Rule magic™ to find all of our partial derivatives. Then we calculate our error by taking a simple (actual - expected) and marching backwards through the Net using some small learning rate, and adjustments to each of the matricies.

### vs Logistic Regression Gradient Descent

Recall our vectorized implementation for Gradient Descent with Logistic Regressions

```
# predict
Z = np.dot(w.T, X) + b
A = sigma(Z)
# gradient descent
dZ = A - Y
dw = (1/m) * np.dot(X, dZ.T)
db = (1/m) * np.sum(dZ)
# update
w = w - alpha * dw
b = b - alpha * db
```

The implementation for Back Propagation is very, very similar. The update step is basically the same, and the predict step is replaced by Forward Prop.

The biggest difference lies in the gradient descent section, and even that should look pretty familiar.

```
dZ2 = A2 - Y
dW2 = (1/m) * np.dot(dZ2, A1.T)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
dZ1 = np.multiply(np.dot(W2.T, dZ2), activation_fn_deriv(Z1))
dW1 = (1/m) * np.dot(dZ1, X.T)
db1 = (1/m) * np.sum(dZ1 * axis=1, keepdims=True)
```

Note, however, the second chunk of calculations, and how they depend on the output of the section above– calculating backwards through the Net.

### Why Can’t I Hold All of These Derivatives?

The equation for `dZ1`

might throw you for a loop, but it’s easy when you consider the chain of calculations that get you to your Loss Function that use `Z1`

. First, we start with our hypothesis function.

$Z^{[1]} = W^{[1]}X + b^{[1]}$

which, of course, gets piped into our activation function to become `A1`

$A^{[1]} = g(Z^{[1]})$

The nesting becomes pretty cumbersome by the second layer `Z2`

$Z^{[2]} = W^{[2]} g^{[2]}(A^{[1]}) + b^{[2]}$

$Z^{[2]} = W^{[2]} g^{[2]}(g^{[1]}(Z^{[1]})) + b^{[2]}$

And so if we’re looking for the derivative of the overall Cost Function, with respect to `Z1`

, we’ve got our work cut out for ourselves doing Chain Rule stuff (we’ve omitted `A2`

and `cost`

for simplicity)

But looking at this you see that it just follows typical Chain Rule fashion– We’re multiplying the derivative of the function (the `np.dot()`

portion) by the derivative of the inside. **Also note**, this is element-wise multiplication, not the dot product metween the two matricies.

`dW1`

and `db1`

are comparably-trivial calculations, and our update step looks like normal.