# Boosted Models

The concept of Boosting a given model extends, intuitively, from two other Data Science concepts:

- Principal Component Analysis, which aims to find an axis that explains the most variation, re-orient your original data, then find a new axis that explains what the first couldnt, etc
- Bootstrapping, which involves maximizing the use of a dataset by repeated sampling with replacement and aggregating model generation

Essentially, want to make simple model that predicts on `y`

. Then we subtract these predictions from `y`

to get a bunch of residuals that we missed. Then we train a second model on these residuals. We subtract what we learned from *these* residuals and rinse-repeat.

At each step, we tack the model we just trained to the end of a big ol’ list of models.

When we’re done, predictions on new `X`

values will be fed into each of these models, and our output will be weighted by a constant `lambda`

for each model.

## With Tree-Based Models

Model Boosting was introduced to me in the context of Decision Trees in Chapter 8 of ISL, but the idea extends neatly-enough to other models.

The algorithm that they outline looks like the following:

**Note**: We start off making a simple Decision Tree to predict on `y`

, then we predict on updated residuals every step thereafter, so we call `r=y`

at the start.

```
trees = []
r = y # residuals are just the first y values
for b in range(1, B):
f_b = DecisionTree(terminal_nodes=d).fit(X, r)
trees.append(f_b)
r = r - lambda_ * f_b(x)
f_x = lambda_ * np.sum([tree(x) for tree in trees])
```

### Hyperparameters

#### Shrinkage

The *shrinkage parameter* only makes sense to have values between `0`

and `1`

, but in practice we’ll use something closer to `0.01`

or `0.001`

. It’s whole job is to slow down the learning process from model to model.

Look at the residual update step above and consider `lambda`

values small and large:

- If small, then the residuals are only a
*little*different than they were in the last step and we learn*almost*the same thing - If large, then we’re basically sculpting with dynamite, haphazardly over-correcting at each step

#### Number of Models

It’s not hard to imagine that a very large *Model Count*, `B`

, will lead to over-fitting. Infinately-many models trained to reduce the prediction error of the last model will collapse on perfect interpolation at the cost of flexibility.

This is typically a value that we arrive at via cross-validation.

**Note**: There is some tricky interplay between this and our `lambda`

value– a small shrinkage parameter will require a lot of aggregated models to arrive at the appropriate convergence.