Root Mean Squared Error
Overview
One of the more standard measures of model accuracy when predicting numeric values is the Root Mean Squared Error.
Basically, for every predicted value, you:
- Find the difference between your prediction and the actual result
- Square each value
- Add each value together
- Take the square root of that
- Divide by the number of observations
This allows us to get an absolute-value measure of how far off from correct each prediction was, over or under.
Additionally we take the root (as opposed to just MSE), in order to express the error in interpretable units.
Fitting a Model
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
# dummy dataset
X, y = make_regression()
X.shape, y.shape
((100, 100), (100,))
We want to build a simple Linear Regression model with our dummy data.
But as we’ve discussed in other notebooks, we first need to split our data up into training and test sets, so let’s do that.
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y)
[arr.shape for arr in train_test_split(X, y)]
[(75, 100), (25, 100), (75,), (25,)]
Now we can fit our model with our training data
model = LinearRegression()
model.fit(train_X, train_y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
And use that model to make a prediction on our test data
model.predict(test_X)
array([ 157.19999884, 182.84477034, 198.61054437, 168.05031037,
102.80607835, -161.14849689, -45.37499645, 77.02471828,
174.79940023, 70.73630468, 96.67254953, -22.69534224,
-251.23474593, 191.22108821, -302.14564522, -149.39232913,
167.25523265, 212.15791823, 251.71364073, -90.09065502,
-16.90454986, -21.64715521, 94.58179599, -292.43390204,
131.28127778])
Scoring Accuracy
If we want to see how close we were, we compare against test_y
and follow the same steps above.
import numpy as np
predictions = model.predict(test_X)
error = predictions - test_y
mse = np.sum(error * error) / len(error)
rmse = np.sqrt(mse)
rmse
59.635789214540765
Or we just use the sklearn
implementation
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test_y, predictions)
rmse = np.sqrt(mse)
rmse
59.635789214540765