# Logistic Regression Basics

## Stated with variables

Our goal is to find predictions that accurately predict the actual values

We’ve got a bunch of input data

$x \in \mathbb{R}^{n}$

We’ve got our `0`

or `1`

target

$y$

Our predictions between `0`

and `1`

$\hat{y}$

We’ll arrive at our predictions using our weights

$w \in \mathbb{R}^{n}$

And our bias unit

$b \in \mathbb{R}$

Both of which will be a result of our computation

But we need to coerce our prediction values to be between 0 and 1, therefore we need a *sigmoid function.*

$\sigma(z) = \frac{1}{1+e^{-z}}$

```
%pylab inline
def sigmoid(z):
return 1/(1+np.exp(-z))
```

```
Populating the interactive namespace from numpy and matplotlib
```

Because as you can see, the values tend to `0`

for negative numbers, and `1`

for positive numbers. Furthermore, the curve crosses `x=0`

at `y=0.5`

.

```
X = np.linspace(-10, 10, 100)
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(X, sigmoid(X))
ax.axvline(0, color='k')
```

```
<matplotlib.lines.Line2D at 0x8b6b7b8>
```

## Cost Function

So our prediction vector is going to be a multiplication of the inputs $x_1, …, x_n$ by the weights $w_1, …, w_n$, plus a bias term $b$.

$\hat{y} = \sigma(w^{T}x + b) \quad \text{where} \quad \sigma(z) = \frac{1}{1+e^{-z}}$

Traditionally, we might consider some sort of cost function like squared error– the difference between observation and actual, squared.

$\mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y} - y)^{2}$

However, this leads to some very poorly-behaved curves. Instead, we use:

$\mathcal{L}(\hat{y}, y) = -\big(y\log\hat{y} + (1-y)\log(1-\hat{y})\big)$

## Intution

Recall the shape of the `log`

function:

- It’s basically negative infinity at
`0`

- It is exactly
`0`

at`1`

- It scales (slowly) to positive infinity

```
fig, ax = plt.subplots(figsize=(10, 6))
X = np.linspace(0.001, 10, 100)
ax.plot(X, np.log(X))
ax.axhline(0, color='black')
```

```
<matplotlib.lines.Line2D at 0x8f5ad68>
```

So looking back at this cost function and considering the behavior of `log`

consider what happens in the following scenarios

### If `y=1`

Our Loss Fucntion becomes

$\mathcal{L} = -\big( \log(\hat{y}) + 0\big)$

Therefore, if we predict `1`

, then `log(1)`

evalues to `0`

– no error.

Conversely, if we predict `0`

, then we have basically infinite error. We don’t ever want to be certain that it’s a `0`

when it’s actually not.

```
fig, ax = plt.subplots(figsize=(10, 6))
X = np.linspace(0.001, 1, 100)
ax.plot(X, -np.log(X))
```

```
[<matplotlib.lines.Line2D at 0x8f344a8>]
```

### If `y=0`

Our Loss Fucntion becomes

$\mathcal{L}(\hat{y}, y) = -\big(0 + (1)\log(1-\hat{y})\big)$

And looking at that last term, we see that as our prediction gets closer and closer to `1`

, the error becomes infinite.

```
fig, ax = plt.subplots(figsize=(10, 6))
X = np.linspace(0., .999, 100)
ax.plot(X, -np.log(1-X))
```

```
[<matplotlib.lines.Line2D at 0x910c358>]
```

## Cost Function

If this intuition makes sense at a record-level, then extrapolating this *loss function* to each of our records helps us arrive at our *Cost Function*, expressed as

$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}(\hat{y}^{i}, y^{i})$

$J(w, b) = - \frac{1}{m} \sum_{i=1}^{m} \big(y^{i}\log\hat{y}^{i} + (1-y^{i})\log(1-\hat{y}^{i})\big)$