R Squared Measure of Fit

09 Sep 2019

Components

Suppose we’ve got a spattering of points that look like the following

from IPython.display import Image
Image('images/r2_points.PNG')

png

We’d say that the Total Sum of Squares represents the combined amount of variation in Y across all points. In this case, we measure and sum the distance of each point (maroon) from the average Y value (red).

We square each value when we sum so values above and below the average height don’t cancel out.

Image('images/r2_tss.PNG')

png

Then consider fitting a Linear Regression to these points and plotting the proposed form, $\hat{y}$, in blue

We’d then say that the Residual Sum of Squares represents the total distance from each point to the line of best fit– again, squared and summed.

Image('images/r2_rss.PNG')

png

$R^2$ Statistic

Simply put, the $R^2$ statistic of a Linear Regression measures the proportion of variance of Y explained by variance in X. In other words, “what percentage of all variation in the target can we model using our attributes?”

To arrive at this, we want to quantify the amount of variation not explained by X– which is simply the total residual error as a fraction of all variation

$\text{Unexplained error} = \frac{RSS}{TSS}$

If “the proprotion of all error” adds up to 100 percent, then finding the R-squared value simply means subtracting the above from 1.

$R^2 = 1 - \frac{RSS}{TSS}$

This gives us “the percentage of variation in Y that we can explain using X”

Extending

To solidify this intution, if we can perfectly model our data points with a straight line, as below, then we’d say that 100% of the variation in Y can be explained by variation in X– or that $R^2 = 1$

Image('images/r2_perfect.PNG')

png

Conversely, if there is NO relationship between X and Y, and our “line of best fit” winds up being the same as the constant average value for Y, then R-squared is 0, as RSS = TSS, given that the two lines are identical.

Image('images/r2_imperfect.PNG')

png

Adjusted R-Squared

Finally, it’s worth examining the effect of adding variables to your model, with respect to R-Squared.

First consider that R-Squared is a measure of explained variance. If you were to add a variable to your model that was outright useless in prediction, it would come out in the wash of your model training. Thus, adding more variables to your model will only ever improve your R-squared.

But this obviously isn’t a desirable trait of the score, otherwise we’d litter all of our model with a host of unimportant features.

The “Adjusted R-Squared Score” corrects for this by normalizing the calculation for feature count.

$AdjR^2 = 1 - \frac{\frac{RSS}{n - d - 1}}{\frac{TSS}{n-1}}$

Where d is the number of different features. To paraphrase ISL

Once all of the correct variables have been added, adding noise variables will lead to only a very small decrease in RSS. However, incrementing d in the denominator of the numerator will outweigh this effect. Thus adjusted R-squared “pays a price” for the inclusion of unnecessary variables in the model.