Like Correspondence Analysis, but with Multiple
An extension of our notebook on Correspondence Analysis, Multiple Correspondence Analysis allows us to extend this methodology beyond a cross-tab of two different variables into arbitrarily-many.
For instance, a pretty canonical dataset used to describe this method (see this paper) is a taste profiling of various wines from different experts.
Like before, we start out with a dataset with
I individuals as rows and
J columns as features. However in this case, there are multiple responses per feature. A few things to point out in particular:
- Two different experts have the same feature. Expert 2 and 3 both have a
woodycolumn. MCA treats these columns separately, but if there’s an underlying relationship between the two (as you might expect), the baricenters of the columns will wind up near one another
- There are also a couple features (
vanillin) that have more than a binary label
- We do a
.get_dummies()call on each feature, however we don’t want to use
drop_first=True, like we would when fitting a model using dummy variables
from IPython.display import Image Image('images/mca_data.PNG')
Also worth noting that the table above is an intermediate table between “tall table with a column for each variable (not separated)” and “cross-tab used to do SVD”.
Ultimately, we can do the similar “plot rows” or “plot columns” representation that we could do with CA, only this time with more levels.
A Bigger Example
In it, he explores a much, much larger dataset (
n=8403 >> 6) encoding various lifestyle characteristics for a given population.
He explores similar concepts as CA, including Inertia, weighting each record by the individual values, etc, etc.
But it gets particularly interesting when he fits to two axes and scatters each person’s representation in this 2D space, *colored by their value for one feature*.
This was still completely possible in CA, we just didn’t have enough data for this pattern to show so starkly. Instead, what we got was a look at the barycenters for the given columns, which he then plots overtop of the scatter in two colored triangles.
Indeed doing this for all of the barycenters, he finds that a pattern emerges.
Recall that all of his data was organized not as a single
0/1 column, but as two
0/1 columns, eg:
cinema 0 1 1
cinema_yes cinema_no 0 1 1 0 1 0
examining, he found that barycenters in black corresponded to the
_nos and similarly, the reds to
You might be thinking “Well what does that have to do with the axis?” and that’s where I’ll remind you that the C in MCA stands for “correspondence.” Or to put it another way, looking at this representation, we learn that people not doing one thing highly corresponds with them not doing another thing. Thus, we can think of the first axis as an apathy continuum
This is made even more obvious when he picks out four individuals at the fringe of this point cloud.
Similarly, he looks at points on the opposite corners to see that the second axis is loosely associated with how youthful the hobbies are
To show how
prince makes short work of
MCA, we’ll borrow the famous Titanic dataset from
seaborn, as it’s got a good amount of categorical variables.
%pylab inline import pandas as pd import prince import seaborn as sns
Populating the interactive namespace from numpy and matplotlib
We load it up and drop any null values for
age– we want to use this in a second.
data = sns.load_dataset('titanic').dropna(subset=['age']) data.head()
Playing around with various bins, we find a few cuts that gives us a good separation in our data.
data['age_bin'] = pd.cut(data['age'], [0, 18, 30, 45, 999]) data['age_bin'].value_counts().sort_index()
(0, 18] 139 (18, 30] 270 (30, 45] 202 (45, 999] 103 Name: age_bin, dtype: int64
And so we’ll use this created feature, as well as a few other categorical features, to see if there’s any underlying structure to our data.
df = data[['sex', 'embarked', 'class', 'age_bin']] mca = prince.MCA() mca.fit(df)
MCA(benzecri=False, check_input=True, copy=True, engine='auto', n_components=2, n_iter=10, random_state=None)
Like with CA, there’s a nice
.plot_coordinates() method that we can use to examine the structure of our loadings.
The first thing that stands out to me is the combination of green and blue dots for
Component 0– The continuum from low to high class passenger seems to correlate with whether or not that person was male.
Component 1, I was intrigued to see that the city that a person embarked in had such a spread (
S, in descending order). That is, until I saw that
age_bin did as well.
mca.plot_coordinates(df, row_points_alpha=.2, figsize=(10, 10), show_column_labels=True );
Indeed, looking at the distribution of ages by embark location from our original dataset, we can see that
Q has the lowest median age, then
S, which is consistent with our ordering in the
import seaborn as sns sns.boxplot(data=data, x='embarked', y='age');