Multi-Class Regression with SoftMax

Note, these notes were taken in the context of Week 3 of Improving Deep Neural Networks

When your prediction task extends beyond a binary classification, you want to rely less on the sigmoid function and logistic regression. While you might see some success doing it anyways, and then doing some numpy.max() dancing over your results, a much cleaner approach is to use the SoftMax function.

The Math

Essentially, softmax takes an arbitrary results vector, Z, and instead of applying our typical sigmoid function to it, instead does the following:

  1. Overwrites each value, z_i with t_i, where

$t_i = e^{z_i}$

  1. Normalizes each value by the sum of all values in the vector (the activation function)

$a = \frac{e^Z}{\sum{t_i}}$

This has the convenient effect of all values in the vector a summing to 1– a rough “percent likelihood” value assigned to each cell.

  1. In terms of training, we can do Gradient Descent on this just fine as the cost function is essentially the same as that for Logistic Regression, but with more parts.

A Simple Example

Say we’re trying to decide between 4 separate classes and wind up with a final vector that looks like

import numpy as np

Z = np.array([5, 2, -1, 3])
Z
array([ 5,  2, -1,  3])

Determining softmax likelihood is easy enough, following along with the steps above

T = np.exp(Z)
T
array([148.4131591 ,   7.3890561 ,   0.36787944,  20.08553692])
A = T / np.sum(T)
A
array([0.84203357, 0.04192238, 0.00208719, 0.11395685])

We can see that Class_0 having a large value makes it likely and conversely Class_2 having a low value makes it unlikely, thus mirroring our Sigmoid Activation intuition.