# Partial Least Squares

As mentioned in our notebook on Principal Component Analysis, the chief goal of a dimension reduction technique is to express the observations of our `p`

-dimensional dataset, `X`

as a linear combination of `m`

-dimensional vectors (`m < p`

), `Z`

, using a mapping optimized “to explain the most variation in our data.”

But whereas PCA is an unsupervised method that involves figuring out how to explain variation in `X`

, the Partial Least Squares method introduces a *supervised* alternative and considers our target, `Y`

, in the dimension reduction.

Or to quote ISL:

Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.

## Intuition

Recall that the general idea of PCA is to:

- Find an axis that explains the most variation in
`X`

- Re-orient our data relative to this new axis
- Repeat until we reach some desired “explained variation” threshold

PLS follows a similar approach, but instead begins with basically a Least Squares regression on `Y`

After we normalize our data, the algorithm can be described as thefollowing (borrowed from this Stanford lecture):

```
from IPython.display import Image
Image('images/pls_alg.PNG')
```

Decrypting this a bit, we start by taking a linear regression on `y`

to get our coefficients `theta_j1`

.

We use this to transform `X_j`

into `y_hat = Z_1`

– our prediction vector– by taking a linear combination. As with any regression, we expect to see a bunch of residual prediction errors between `y_hat`

and `y`

.

Flipping this, `X_j^(2)`

will represent the “missing information” that we have for trying to predict our original `X_j`

values using our new mapping `Z_1`

.

At this point, we want to continue in the PCA fashion of “find the axis that explains *the next-most variance*.” If we use these “missing information residuals”, `X_j^(2)`

to try and predict `y`

, we have a new set of coefficients `theta_j2`

that combine with `X_j^(2)`

to make our second mapping, `Z_2`

.

We continue in this fashion, using the residuals of “missing information” to mine more axes

## Multivariate `Y`

One important feature of PLS worth mentioning is that it allows us to not only include `Y`

in our dimension reduction scheme, it also neatly extends to *mutivariate* dimensions of `Y`

.

In a sense, you can conceptualize this as doing a sort of PCA on both `X`

and `Y`

, then searching for the latent structure of `X`

best explains the latent structure of `Y`

.

This video does a good job highlighting the idea, visually

`Image('images/pls_multivariate.PNG')`

To put this another way, if we can find some representation `U`

in `Y`

that explains most of the variation in our target space, then `T`

, our representation of `X`

, will be optimized to maximize the correlation between `U`

and `T`

, as described in this video

`Image('images/pls_multivariate_cross.PNG')`

Finally, the first 2 minutes of this video do an *exceptional job* illustrating the incremental, simultaneous fitting of `T`

and `U`

and should be watched in excess of 100 times, IMO.