Partial Least Squares

As mentioned in our notebook on Principal Component Analysis, the chief goal of a dimension reduction technique is to express the observations of our p-dimensional dataset, X as a linear combination of m-dimensional vectors (m < p), Z, using a mapping optimized “to explain the most variation in our data.”

But whereas PCA is an unsupervised method that involves figuring out how to explain variation in X, the Partial Least Squares method introduces a supervised alternative and considers our target, Y, in the dimension reduction.

Or to quote ISL:

Roughly speaking, the PLS approach attempts to find directions that help explain both the response and the predictors.


Recall that the general idea of PCA is to:

  • Find an axis that explains the most variation in X
  • Re-orient our data relative to this new axis
  • Repeat until we reach some desired “explained variation” threshold

PLS follows a similar approach, but instead begins with basically a Least Squares regression on Y

After we normalize our data, the algorithm can be described as thefollowing (borrowed from this Stanford lecture):

from IPython.display import Image



Decrypting this a bit, we start by taking a linear regression on y to get our coefficients theta_j1.

We use this to transform X_j into y_hat = Z_1– our prediction vector– by taking a linear combination. As with any regression, we expect to see a bunch of residual prediction errors between y_hat and y.

Flipping this, X_j^(2) will represent the “missing information” that we have for trying to predict our original X_j values using our new mapping Z_1.

At this point, we want to continue in the PCA fashion of “find the axis that explains the next-most variance.” If we use these “missing information residuals”, X_j^(2) to try and predict y, we have a new set of coefficients theta_j2 that combine with X_j^(2) to make our second mapping, Z_2.

We continue in this fashion, using the residuals of “missing information” to mine more axes

Multivariate Y

One important feature of PLS worth mentioning is that it allows us to not only include Y in our dimension reduction scheme, it also neatly extends to mutivariate dimensions of Y.

In a sense, you can conceptualize this as doing a sort of PCA on both X and Y, then searching for the latent structure of X best explains the latent structure of Y.

This video does a good job highlighting the idea, visually



To put this another way, if we can find some representation U in Y that explains most of the variation in our target space, then T, our representation of X, will be optimized to maximize the correlation between U and T, as described in this video



Finally, the first 2 minutes of this video do an exceptional job illustrating the incremental, simultaneous fitting of T and U and should be watched in excess of 100 times, IMO.