Logistic Regression Gradient Descent
The Building Blocks
Recall our equation for the Cost Function of a Logistic Regression
L(ˆy,y)=−(ylogˆy+(1−y)log(1−ˆy))
We use the weights, w
, our inputs, x
, and a bias term, b
to get a vector z
.
z=wTx+b
And we want this vector to be between 0
and 1
, so we pipe it through a sigmoid function, to get our predictions.
ˆy=σ(z)
We refer to the sigmoid function that runs over all of our values as the activation function, so for shorthand, we’ll say
a=ˆy
And thus
L(a,y)=−(yloga+(1−y)log(1−a))
Gradient Descent
So if our aim is to minimize our overall cost, we need to lean on some calculus.
Idea here is that we’re going to take incremental steps across the inputs of cost function– the weights and bias term, taking x
as given. Which means that we want work out the derivative of the cost function with respect to those terms.
Finding the Derivatives
Looking at the chain of execution to arrive at our cost function, we have:
Our z
as an intermediate value, generated as a function of w
, X
, and b
z=wTx+b
a
, which is a function of z
, applied with the our standard sigmoid function
a=ˆy=σ(z)=11+e−z
Finally, Loss is a function of y
, or true values, and a
(all of its dependencies)
L(a,y)=−(yloga+(1−y)log(1−a))
We’re trying to Chain Rule our way backwards, so we need to figure out all of the partial derivatives that impact this loss function.
Key Derivatives to take as Given
Hand-wavy derivations, courtesy of the Logistic Regression Gradient Descent video during Week 2 of Neural Networks and Deep Learning
Sigmoid wrt z
δaδz=a(1−a)
Loss Function wrt a
δLδa=−ya+1−y1−a
Applying the Chain Rule to our Loss Function
δLδz=δLδaδaδz
Substituting in
δLδz=(−ya+1−y1−a)a(1−a)
δLδz=a−y
Extrapolating to weights and bias
Assuming a simple formula for z
of the form
z=w1x1+w2x2+b
We can apply the same Chain Rule logic as above
w_1
δLδw1=δLδzδzδw1
Substituting again
δLδw1=(a−y)δzδw1
δLδw1=(a−y)x1
w_2
Follows the exact same form
δLδw2=(a−y)x2
Bias
Derivative just goes to 1 and cancels out the term
δLδb=(a−y)
At Scale
So far we’ve been showing the cost of one training example. Now, when you consider all m
training examples, your Cost Function looks like
J(w,b)=1m∑mi=1L(ai,y)
And if you want to take the derivative of this, with respect to whatever, the fraction and the summation terms are going to get kicked out to the front, because math™. This makes the calculation very tidy!
The Descent Algorithm
We set our parameters to 0
, by default
J, dw_1, dw_2, db = 0
And then this loop happens for each training iteration step
# one pass for each of the m training examples
for i in range(m):
z = np.dot(w, x) + b
a = sigma(z)
J += cost(y, a)
dz += a - y
dw_1 += x[1]*dz
dw_2 += x[1]*dz
db += dz
# handle the leading fraction in the cost function
J = J / m
dw_1 = dw_1 / m
dw_2 = dw_2 / m
# adjust weights by learning rate
w_1 = w_1 - alpha * dw_1
w_2 = w_2 - alpha * dw_2
b = b - alpha * b
All of this looping is, of course, wildly inefficient. Which is why we vectorize.
Vectorized Implementation
J = b = 0
w = np.zeros(n_x, 1)
# Our iterations
Z = np.dot(w.T, X) + b
A = sigma(Z)
dZ = A - Y
dw = (1/m) * np.dot(X, dZ.T)
db = (1/m) * np.sum(dZ)
w = w - alpha * dw
b = b - alpha * db
Note: If we want to run 1000
iterations, we’d still have to wrap the third line down in a for
loop