Boosted Models
The concept of Boosting a given model extends, intuitively, from two other Data Science concepts:
- Principal Component Analysis, which aims to find an axis that explains the most variation, re-orient your original data, then find a new axis that explains what the first couldnt, etc
- Bootstrapping, which involves maximizing the use of a dataset by repeated sampling with replacement and aggregating model generation
Essentially, want to make simple model that predicts on y
. Then we subtract these predictions from y
to get a bunch of residuals that we missed. Then we train a second model on these residuals. We subtract what we learned from these residuals and rinse-repeat.
At each step, we tack the model we just trained to the end of a big ol’ list of models.
When we’re done, predictions on new X
values will be fed into each of these models, and our output will be weighted by a constant lambda
for each model.
With Tree-Based Models
Model Boosting was introduced to me in the context of Decision Trees in Chapter 8 of ISL, but the idea extends neatly-enough to other models.
The algorithm that they outline looks like the following:
Note: We start off making a simple Decision Tree to predict on y
, then we predict on updated residuals every step thereafter, so we call r=y
at the start.
trees = []
r = y # residuals are just the first y values
for b in range(1, B):
f_b = DecisionTree(terminal_nodes=d).fit(X, r)
trees.append(f_b)
r = r - lambda_ * f_b(x)
f_x = lambda_ * np.sum([tree(x) for tree in trees])
Hyperparameters
Shrinkage
The shrinkage parameter only makes sense to have values between 0
and 1
, but in practice we’ll use something closer to 0.01
or 0.001
. It’s whole job is to slow down the learning process from model to model.
Look at the residual update step above and consider lambda
values small and large:
- If small, then the residuals are only a little different than they were in the last step and we learn almost the same thing
- If large, then we’re basically sculpting with dynamite, haphazardly over-correcting at each step
Number of Models
It’s not hard to imagine that a very large Model Count, B
, will lead to over-fitting. Infinately-many models trained to reduce the prediction error of the last model will collapse on perfect interpolation at the cost of flexibility.
This is typically a value that we arrive at via cross-validation.
Note: There is some tricky interplay between this and our lambda
value– a small shrinkage parameter will require a lot of aggregated models to arrive at the appropriate convergence.