Often described as “the categorical analogue to PCA”, Correspondence Analysis is a dimension-reduction technique that describes the relationship and distribution between two categorical variables.
Reading papers on the topic proved to be needlessly dense and uninformative– my lightbulb moment on this topic came when I stumbled across Francois Husson’s fantastic tutorial series on YouTube. 100% worth the watch and is where I’ll pull many of my images from.
For starters, this analysis assumes that our data is prepared as a cross-tabulation of two categorical variables. In the illustration below, there are
L records, each with two categorical variables. This leads to a cross-tab matrix with all
V_1 variables as rows and
V_2 variables as columns.
from IPython.display import Image Image('images/ca_data.PNG')
More concretely, the dataset that Francois uses looks at the distribution of Nobel Prize wins by Category and Country
import pandas as pd df = pd.read_csv('nobel_data.csv', index_col='Country') df
Unpacking further, the explanation that finally stuck for me was deeply rooted in conditional probability. Here, the row, column, and total sums play a crucial role in computation. This allows us to start expressing conditional probabilities, predicated on overall counts (for this, he uses notation that I’ve never seen before, as I’ve highlighted below)
Ultimately, the core mechanic of Correspondence Analysis is an examination of how much our data deviates from an assumption of complete independence. This is a direct extension of the Chi Squared Test.
In the context of a cross-tab, if our variables all had independence, we’d assume that the values in our rows would be distributed consistently with the proportion of the totals, and similarly for the columns.
These two graphics do a fantastic job representing this notion. For example, we can see that
Italy has a
Literature proportion way off of the mean row value, and
Economics prizes are disproportionally won by people from the
Bringing it home, we can then plot all of the conditional probabilities into a point cloud.
G represents the center of gravity for the cloud.
G has a neater intuitive meaning. In the case that our data is all independent, all of the points will just sink to the center of gravity and it will be point, not a point cloud.
The fact that we instead represent all of the conditional probabilities with a cloud, however, means that there is some measure of deviation from the origin,
G. We call this the interia of the point cloud.
To restate: Intertia measures the deviation from independence.
Finally, the whole purpose of Correspondence Analysis is to get our data to this point and then find the orthogonal projections that explain the most inertia.
The two point-clouds are generated from the same dataset, so it stands to reason that there should be some way to translate from one to the other, yeah?
For this, Correspondence Analysis borrows a $2 word from astrophysics, called barycenter, which is basically the center of mass between arbitrarily-many objects. Per wikipedia, in the
n=2 case, where the red
+ is the barycenter:
As it relates to our usecase, the barycenter of a row on a given axis can be thought of as a weighted average of the column representation.
For a given derived axis
s, the columnar representation
G(j) is weighted by the sum of all conditional probabilities across those columns, then finally scaled by the
lambda value for that axis (more on this below).
This is a bit of a mouthful to take in, but this allows us two nice properties:
- The reverse is true– we can swap the
Fterms and this still works
- Because of this, when we the rows and columns in the new space defined by the various
saxes, the rows are closest to columns that it’s most associated and vice-versa.
Relationship to SVD
As mentioned above, the literature on Correspondence Analysis is (in my estimation) needlessly dense. For example, this pdf begins
CA is based on fairly straightforward, classical results in matrix theory
and makes a real hurry of throwing a lot of Greek at you. I might come back and revise this notebook, but for the time being, I think Francois’ explanation is all the know-how we’re going to need on this matter, save for “how do we actually find the axes?”
For this, first refer to my notes on SVD.
Then grabbing a small chunk of the equations they throw at you, I want to highlight two things:
We get the following for free:
nrepresent the crosstab matrix and total counts
- These are used to build a simple matrix of probabilities
crepresent the row and column proportions, which both add to
D_care the diagonal matrices of the row and column spaces
And so it looks like equations
A.8-10 are the nuts and bolts of representing our newfound axes
s. Moreover, it looks like we can get this if we can get the values in
A.6-7, which in turn need values
V, which are typical results of doing SVD, in
The trick to all of is answering “SVD on what?” For which, we can cleverly construct a matrix,
S, of standardized residuals. Once that sentence makes sense to you, you can close the pdf. Residuals should make you think of error, and error should mean difference between predicted and observed, and as mentioned “predicted” actually means “the assumption that everything just follows the population distribution,
Armed with that intuition,
S becomes the key used to unlock the rest. Of course, there’s a handy Python package to keep all of this math straight for us.
On Our Dataset
So running CA on the dataset above, Francois plots out the points from the cross-tab, relative to the new axes he found.
The first thing he points is distance to center of the scatter plot.
Broadly-speaking, this is a good proxy for “how dissimilar to ‘everything follows the same distribution’”.
UK is basically on the axis, and we can see below that the distribution of awards won looks VERY close to the sample distribution.
Italy, on the other hand, has won dramatically more of the green aand is thus far-flung from the origin.
Economics prizes seem disproportionally won by the
US (orange), so it’s pretty far off-origin. However, the
US does seem to win a respectable proportion of the prizes in each category.
A better example to look at is
Italy (teal). On average, they make up a tiny speck of overall awards, but because they’re a good 20% of the
Literature prize winners, their point is the furthest-right of the whole shebang.
Eigenvalues and Explained Inertia
So after we find our othogonal axes that best-span the point cloud, the eigenvalues that we find represent the explained Inertia of a given axis
Then, because our axes are orthogonal, we can add the eigenvalues together and divide by the total inertia to get a look at how well an axis captures the overall spread of our point-cloud
An eigenvalue equaling
1 means that it accounts for a perfect separation between two blocks in the data. As a toy example, Francois cooks up a small table of taste profiles. The rows are whether a tasted sample was sweet, sour, or bitter. The columns are how they were percieved by tasters.
The first CA-generated axis helps separate our data into two distinct groups, “Sweet” and “Maybe sour or bitter”, with perfect accuracy.
On the other hand, let’s look at two slightly-adjusted version of the dataset. The second axis should help us determine the difference between sour and bitter.
The data on the left is better-separated than the one on the right, and thus the
Axis 2 eigenvalue is higher and the points scattered are more spread apart.
Total Explained Inertia
This intuition in mind, when we revisit the Nobel dataset, two things stand out:
- The explained inertia for the first axis is far less than 1, so there’s no straight-forward cut in the data
- The total explained inertia is much less than
5(a heuristic value, I suppose), so the data isn’t as well-separated as we might have thought at first glance.
Correspondence Analysis is made pretty simple by the
Continuing with our original dataset, we’ll follow a workflow similar to what we’d do in
from prince import CA ca = CA(n_components=2) ca.fit(df)
CA(benzecri=False, check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, random_state=None)
we’ve deconstructed the countries into two principal components and can see a good deal of separation.
%pylab inline display(ca.row_coordinates(df)) ca.row_coordinates(df).plot.scatter(0, 1);
Populating the interactive namespace from numpy and matplotlib
similarly, we can look at how the columns are mapped in this new space.
display(ca.column_coordinates(df)) ca.column_coordinates(df).plot.scatter(0, 1);
But even cooler, it’s super easy to plot the two together via the expressive
plot_coordinates() function. This looks like the images we’ve been looking at all along, just flipped across a couple axes.
ca.plot_coordinates(df, figsize=(8, 8));
Finally, we can easily inspect the various metrics that we want to pay attention to.