Loading [MathJax]/jax/output/CommonHTML/jax.js

Logistic Regression Gradient Descent

The Building Blocks

Recall our equation for the Cost Function of a Logistic Regression

L(ˆy,y)=(ylogˆy+(1y)log(1ˆy))

We use the weights, w, our inputs, x, and a bias term, b to get a vector z.

z=wTx+b

And we want this vector to be between 0 and 1, so we pipe it through a sigmoid function, to get our predictions.

ˆy=σ(z)

We refer to the sigmoid function that runs over all of our values as the activation function, so for shorthand, we’ll say

a=ˆy

And thus

L(a,y)=(yloga+(1y)log(1a))

Gradient Descent

So if our aim is to minimize our overall cost, we need to lean on some calculus.

Idea here is that we’re going to take incremental steps across the inputs of cost function– the weights and bias term, taking x as given. Which means that we want work out the derivative of the cost function with respect to those terms.

Finding the Derivatives

Looking at the chain of execution to arrive at our cost function, we have:

Our z as an intermediate value, generated as a function of w, X, and b

z=wTx+b

a, which is a function of z, applied with the our standard sigmoid function

a=ˆy=σ(z)=11+ez

Finally, Loss is a function of y, or true values, and a (all of its dependencies)

L(a,y)=(yloga+(1y)log(1a))

We’re trying to Chain Rule our way backwards, so we need to figure out all of the partial derivatives that impact this loss function.

Key Derivatives to take as Given

Hand-wavy derivations, courtesy of the Logistic Regression Gradient Descent video during Week 2 of Neural Networks and Deep Learning

Sigmoid wrt z

δaδz=a(1a)

Loss Function wrt a

δLδa=ya+1y1a

Applying the Chain Rule to our Loss Function

δLδz=δLδaδaδz

Substituting in

δLδz=(ya+1y1a)a(1a)

δLδz=ay

Extrapolating to weights and bias

Assuming a simple formula for z of the form

z=w1x1+w2x2+b

We can apply the same Chain Rule logic as above

w_1

δLδw1=δLδzδzδw1

Substituting again

δLδw1=(ay)δzδw1

δLδw1=(ay)x1

w_2

Follows the exact same form

δLδw2=(ay)x2

Bias

Derivative just goes to 1 and cancels out the term

δLδb=(ay)

At Scale

So far we’ve been showing the cost of one training example. Now, when you consider all m training examples, your Cost Function looks like

J(w,b)=1mmi=1L(ai,y)

And if you want to take the derivative of this, with respect to whatever, the fraction and the summation terms are going to get kicked out to the front, because math™. This makes the calculation very tidy!

The Descent Algorithm

We set our parameters to 0, by default

J, dw_1, dw_2, db = 0

And then this loop happens for each training iteration step

# one pass for each of the m training examples
for i in range(m):
    z = np.dot(w, x) + b
    a = sigma(z)
    J += cost(y, a)
    dz += a - y
    dw_1 += x[1]*dz
    dw_2 += x[1]*dz
    db += dz

# handle the leading fraction in the cost function
J = J / m
dw_1 = dw_1 / m
dw_2 = dw_2 / m

# adjust weights by learning rate
w_1 = w_1 - alpha * dw_1
w_2 = w_2 - alpha * dw_2
b = b - alpha * b

All of this looping is, of course, wildly inefficient. Which is why we vectorize.

Vectorized Implementation

J = b = 0
w = np.zeros(n_x, 1)

# Our iterations
Z = np.dot(w.T, X) + b
A = sigma(Z)

dZ = A - Y

dw = (1/m) * np.dot(X, dZ.T)
db = (1/m) * np.sum(dZ)

w = w - alpha * dw
b = b - alpha * db

Note: If we want to run 1000 iterations, we’d still have to wrap the third line down in a for loop