# F Statistic

## Overview

The F-Statistic of a Linear Regression seeks to answer “Does the introduction of these variables give us greater information gain when trying to explain variation in our target?”

I like the way that Ben Lambert explains and will paraphrase.

First you make two models– a *restricted model* that’s just the intercept and an *unrestricted model* that includes new `x_i`

values

$R: y = \alpha $

$U: y = \alpha + \beta_1 x_1 + \beta_2 x_2$

Then our Null Hypothesis states that none of the coefficients in `U`

matter and `B_1 = B_2 = 0`

(but can extend to arbitrarily-many Beta values). Equivalently, the alternative hypothesis states that `B_i != 0`

for either Beta.

And so we start by calculating the Sum of Squared Residuals (see notes on R-Squared for refresher) for both the Restricted and Unrestricted models.

By definition, then SSR for the Restricted will be higher– addition of any X variables will account for *some* increase in predictive power even if miniscule.

Armed with these two, **the F-Statistic is simply the ratio of explained variance and unexplained variance**, and is calculated as

$F = \frac{SSR_R - SSR_U}{SSR_U}$

Well, almost.

We also, critically, normalize the numerator and denominator based on `p`

, the number of `x`

features we’re looking at, and `n`

, the number of observations we have. This helps us account for degrees of freedom and looks like the following:

$F = \frac{(SSR_R - SSR_U)/p}{SSR_U/(n-p-1)}$

## Interpretation

Plugging this into your favorite statistical computing software will yield a value that can take on wildly-different values. Conceptually, let’s imagine two extremes:

- The X values don’t give us anything useful. This means that the numerator (the information gain of adding them to the model) is small, therefore the whole fraction is small (often around
`1`

or so) - On the other hand, if there’s a huge improvement, you might see F values in the hundreds, if not thousands.

Generally, the F-statistic follows a distribution that depends on the degrees of freedom for both the numerator and denominator, and has a shape that looks like the following.

```
from IPython.display import Image
Image('images/f_dists.PNG')
```

## Relationship to t-statistic

The F and t statistics feel conceptually adjacent. But whereas F examines the effect of *multiple* attributes on your model, the t simply looks at one.

From a notation standpoint, if you had a model with an intercept and one `x`

and wanted to observe the F statistic when introducing another `x`

, you’d have a difference in 1 degree of freedom (numerator), and 2 degrees of freedom (minus the standard `1`

in the denominator), thus

$F_{1, N-3}$

As far as the output goes, this is functionally equivalent to finding the t-statistic for the same degrees of freedom (N-3), and squaring it. Source

### An Example

If we whip up a quick Linear Regression using the `statsmodels.api`

and the `boston`

dataset within `sklearn`

, we get access to a very clean object that we can use to interrogate the F-statistic for the model as a whole

```
import statsmodels.api as sm
from sklearn.datasets import load_boston
data = load_boston()
X = data['data']
X = sm.add_constant(X)
est = sm.OLS(data['target'], X)
est = est.fit()
est.fvalue
```

```
108.07666617432622
```

But we also can see, using `.summary()`

, the t-statistic for each of the attributes of our model. This is the same as omitting the variable, calculating the F-statistic, then taking the square root (and changing the sign, where appropriate).

In other words, this is the same as the partial effect of adding this variable to the mix.

`est.summary()`

Dep. Variable: | y | R-squared: | 0.741 |
---|---|---|---|

Model: | OLS | Adj. R-squared: | 0.734 |

Method: | Least Squares | F-statistic: | 108.1 |

Date: | Mon, 09 Sep 2019 | Prob (F-statistic): | 6.72e-135 |

Time: | 16:44:05 | Log-Likelihood: | -1498.8 |

No. Observations: | 506 | AIC: | 3026. |

Df Residuals: | 492 | BIC: | 3085. |

Df Model: | 13 | ||

Covariance Type: | nonrobust |

coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|

const | 36.4595 | 5.103 | 7.144 | 0.000 | 26.432 | 46.487 |

x1 | -0.1080 | 0.033 | -3.287 | 0.001 | -0.173 | -0.043 |

x2 | 0.0464 | 0.014 | 3.382 | 0.001 | 0.019 | 0.073 |

x3 | 0.0206 | 0.061 | 0.334 | 0.738 | -0.100 | 0.141 |

x4 | 2.6867 | 0.862 | 3.118 | 0.002 | 0.994 | 4.380 |

x5 | -17.7666 | 3.820 | -4.651 | 0.000 | -25.272 | -10.262 |

x6 | 3.8099 | 0.418 | 9.116 | 0.000 | 2.989 | 4.631 |

x7 | 0.0007 | 0.013 | 0.052 | 0.958 | -0.025 | 0.027 |

x8 | -1.4756 | 0.199 | -7.398 | 0.000 | -1.867 | -1.084 |

x9 | 0.3060 | 0.066 | 4.613 | 0.000 | 0.176 | 0.436 |

x10 | -0.0123 | 0.004 | -3.280 | 0.001 | -0.020 | -0.005 |

x11 | -0.9527 | 0.131 | -7.283 | 0.000 | -1.210 | -0.696 |

x12 | 0.0093 | 0.003 | 3.467 | 0.001 | 0.004 | 0.015 |

x13 | -0.5248 | 0.051 | -10.347 | 0.000 | -0.624 | -0.425 |

Omnibus: | 178.041 | Durbin-Watson: | 1.078 |
---|---|---|---|

Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 783.126 |

Skew: | 1.521 | Prob(JB): | 8.84e-171 |

Kurtosis: | 8.281 | Cond. No. | 1.51e+04 |

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

[2] The condition number is large, 1.51e+04. This might indicate that there are

strong multicollinearity or other numerical problems.

### A warning about `.summary()`

method and t-statistics

It’s not enough to consider all of the t-statistics for each coefficient.

Consider the case when we build a Linear Regression off of 100 different attributes and that our null hypothesis is true– each of them are unrelated to the target.

Recall that the p-value is shorthood for “probability that we observed this statistic under the null hypothesis.” We often reject values less than `0.05`

, but considering the joint probability of 100 attributes, it’s *likely* that we incidentally see one of them coming in under this cutoff– despite not being valid.

If our criteria for “is this model valid?” is throwing a big ol’ OR statement on all of the p-values and hoping for a bite, we’re likely to come to a false conclusion.

On the other hand, looking at the F statistic, because of the normalization by `n`

and `p`

, accounts for the observation and dimension size.