Root Mean Squared Error

31 May 2018

Overview

One of the more standard measures of model accuracy when predicting numeric values is the Root Mean Squared Error.

Basically, for every predicted value, you:

Find the difference between your prediction and the actual result
Square each value
Add each value together
Take the square root of that
Divide by the number of observations

This allows us to get an absolute-value measure of how far off from correct each prediction was, over or under.

Additionally we take the root (as opposed to just MSE), in order to express the error in interpretable units.

Fitting a Model

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

# dummy dataset
X, y = make_regression()
X.shape, y.shape

((100, 100), (100,))

We want to build a simple Linear Regression model with our dummy data.

But as we’ve discussed in other notebooks, we first need to split our data up into training and test sets, so let’s do that.

from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(X, y)

[arr.shape for arr in train_test_split(X, y)]

[(75, 100), (25, 100), (75,), (25,)]

Now we can fit our model with our training data

model = LinearRegression()
model.fit(train_X, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

And use that model to make a prediction on our test data

model.predict(test_X)

array([ 157.19999884,  182.84477034,  198.61054437,  168.05031037,
        102.80607835, -161.14849689,  -45.37499645,   77.02471828,
        174.79940023,   70.73630468,   96.67254953,  -22.69534224,
       -251.23474593,  191.22108821, -302.14564522, -149.39232913,
        167.25523265,  212.15791823,  251.71364073,  -90.09065502,
        -16.90454986,  -21.64715521,   94.58179599, -292.43390204,
        131.28127778])

Scoring Accuracy

If we want to see how close we were, we compare against test_y and follow the same steps above.

import numpy as np

predictions = model.predict(test_X)

error = predictions - test_y
mse = np.sum(error * error) / len(error)
rmse = np.sqrt(mse)
rmse

59.635789214540765

Or we just use the sklearn implementation

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(test_y, predictions)
rmse = np.sqrt(mse)
rmse

59.635789214540765