Boosted Models

29 Sep 2019

The concept of Boosting a given model extends, intuitively, from two other Data Science concepts:

Principal Component Analysis, which aims to find an axis that explains the most variation, re-orient your original data, then find a new axis that explains what the first couldnt, etc
Bootstrapping, which involves maximizing the use of a dataset by repeated sampling with replacement and aggregating model generation

Essentially, want to make simple model that predicts on y. Then we subtract these predictions from y to get a bunch of residuals that we missed. Then we train a second model on these residuals. We subtract what we learned from these residuals and rinse-repeat.

At each step, we tack the model we just trained to the end of a big ol’ list of models.

When we’re done, predictions on new X values will be fed into each of these models, and our output will be weighted by a constant lambda for each model.

With Tree-Based Models

Model Boosting was introduced to me in the context of Decision Trees in Chapter 8 of ISL, but the idea extends neatly-enough to other models.

The algorithm that they outline looks like the following:

Note: We start off making a simple Decision Tree to predict on y, then we predict on updated residuals every step thereafter, so we call r=y at the start.

trees = []
r = y     # residuals are just the first y values

for b in range(1, B):
    f_b = DecisionTree(terminal_nodes=d).fit(X, r)
    
    trees.append(f_b)
    
    r = r - lambda_ * f_b(x)
    
f_x = lambda_ * np.sum([tree(x) for tree in trees])

Hyperparameters

Shrinkage

The shrinkage parameter only makes sense to have values between 0 and 1, but in practice we’ll use something closer to 0.01 or 0.001. It’s whole job is to slow down the learning process from model to model.

Look at the residual update step above and consider lambda values small and large:

If small, then the residuals are only a little different than they were in the last step and we learn almost the same thing
If large, then we’re basically sculpting with dynamite, haphazardly over-correcting at each step

Number of Models

It’s not hard to imagine that a very large Model Count, B, will lead to over-fitting. Infinately-many models trained to reduce the prediction error of the last model will collapse on perfect interpolation at the cost of flexibility.

This is typically a value that we arrive at via cross-validation.

Note: There is some tricky interplay between this and our lambda value– a small shrinkage parameter will require a lot of aggregated models to arrive at the appropriate convergence.