Logistic Regression Gradient Descent

The Building Blocks

Recall our equation for the Cost Function of a Logistic Regression

$\mathcal{L}(\hat{y}, y) = -\big(y\log\hat{y} + (1-y)\log(1-\hat{y})\big)$

We use the weights, w, our inputs, x, and a bias term, b to get a vector z.

$z = w^{T} x + b$

And we want this vector to be between 0 and 1, so we pipe it through a sigmoid function, to get our predictions.

$\hat{y} = \sigma(z)$

We refer to the sigmoid function that runs over all of our values as the activation function, so for shorthand, we’ll say

$a = \hat{y}$

And thus

$\mathcal{L}(a, y) = -\big(y\log a + (1-y)\log(1 - a)\big)$

Gradient Descent

So if our aim is to minimize our overall cost, we need to lean on some calculus.

Idea here is that we’re going to take incremental steps across the inputs of cost function– the weights and bias term, taking x as given. Which means that we want work out the derivative of the cost function with respect to those terms.

Finding the Derivatives

Looking at the chain of execution to arrive at our cost function, we have:

Our z as an intermediate value, generated as a function of w, X, and b

$z = w^{T}x + b$

a, which is a function of z, applied with the our standard sigmoid function

$a = \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$

Finally, Loss is a function of y, or true values, and a (all of its dependencies)

$\mathcal{L}(a, y) = -\big(y\log a + (1-y)\log(1 - a)\big)$

We’re trying to Chain Rule our way backwards, so we need to figure out all of the partial derivatives that impact this loss function.

Key Derivatives to take as Given

Hand-wavy derivations, courtesy of the Logistic Regression Gradient Descent video during Week 2 of Neural Networks and Deep Learning

Sigmoid wrt z

$\frac{\delta a}{\delta z} = a (1 - a)$

Loss Function wrt a

$\frac{\delta \mathcal {L}}{\delta a} = -\frac{y}{a} + \frac{1-y}{1-a}$

Applying the Chain Rule to our Loss Function

$\frac{\delta \mathcal {L}}{\delta z} = \frac{\delta \mathcal {L}}{\delta a}\frac{\delta a}{\delta z}$

Substituting in

$\frac{\delta \mathcal {L}}{\delta z} = \big( -\frac{y}{a} + \frac{1-y}{1-a} \big) a(1-a)$

$\frac{\delta \mathcal {L}}{\delta z} = a-y$

Extrapolating to weights and bias

Assuming a simple formula for z of the form

$z = w_1 x_1 + w_2 x_2 + b$

We can apply the same Chain Rule logic as above

w_1

$\frac{\delta \mathcal{L}}{\delta w_1} = \frac{\delta \mathcal{L}}{\delta z} \frac{\delta z}{\delta w_1}$

Substituting again

$\frac{\delta \mathcal{L}}{\delta w_1} = (a - y) \frac{\delta z}{\delta w_1}$

$\frac{\delta \mathcal{L}}{\delta w_1} = (a - y) x_1$

w_2

Follows the exact same form

$\frac{\delta \mathcal{L}}{\delta w_2} = (a - y) x_2$

Bias

Derivative just goes to 1 and cancels out the term

$\frac{\delta \mathcal{L}}{\delta b} = (a - y)$

At Scale

So far we’ve been showing the cost of one training example. Now, when you consider all m training examples, your Cost Function looks like

$J(w, b) = \frac{1}{m} \sum^{m}_{i=1} \mathcal{L}\big( a^{i}, y \big)$

And if you want to take the derivative of this, with respect to whatever, the fraction and the summation terms are going to get kicked out to the front, because mathâ„¢. This makes the calculation very tidy!

The Descent Algorithm

We set our parameters to 0, by default

J, dw_1, dw_2, db = 0

And then this loop happens for each training iteration step

# one pass for each of the m training examples
for i in range(m):
    z = np.dot(w, x) + b
    a = sigma(z)
    J += cost(y, a)
    dz += a - y
    dw_1 += x[1]*dz
    dw_2 += x[1]*dz
    db += dz

# handle the leading fraction in the cost function
J = J / m
dw_1 = dw_1 / m
dw_2 = dw_2 / m

# adjust weights by learning rate
w_1 = w_1 - alpha * dw_1
w_2 = w_2 - alpha * dw_2
b = b - alpha * b

All of this looping is, of course, wildly inefficient. Which is why we vectorize.

Vectorized Implementation

J = b = 0
w = np.zeros(n_x, 1)

# Our iterations
Z = np.dot(w.T, X) + b
A = sigma(Z)

dZ = A - Y

dw = (1/m) * np.dot(X, dZ.T)
db = (1/m) * np.sum(dZ)

w = w - alpha * dw
b = b - alpha * db

Note: If we want to run 1000 iterations, we’d still have to wrap the third line down in a for loop