Chi Squared Goodness of Fit

A Test for Comparing Discrete Distributions

I found Sal Khan’s explanation to be as straight-forward as it was useful.

Essentially, the Chi-Squared test is used when:

  • You’ve got k discrete classes
  • Some idea of their distribution
  • A number of observed values that fall into said classes

In the case of the video, we want to check if a shop owner’s approximation of visitors-by-day as a percentage of their week is accurate or not.

from IPython.display import Image
Image('../images/chi_sq_ex.png')

png

To solve a problem like this, we first want to rescale the total count of observed visitors (200) in a given week by the expected % value above, giving

20 20 30 40 60 30

Then, similar to how we might investigate a single normal distribution for each of these, we want to find the squared distance from observation to expected, normalized by expected.

The sum of these make up our Chi-Squared distribution, where our degrees of freedom is equal to the number of classes, minus 1– so 5.

Image('../images/chi_sq_ex2.png')

png

Then, like any other test, we calculate the statistic and let a computer tell us how that translates into a p-value.

The Distribution Itself

Borrowing jbstatistics’ explanation:

  • If Z is standard normal, then Z-squared has a Chi-squared distribution, with one degree of freedom
  • If Z1, ..., Zk are independent standard normals, then Z1^2 + ... + Zk^2 has a Chi-squared distribution with k degrees of freedom

These distributions take on different shapes, depending on how many Z terms you mix in (and by extension value of k).

Per wikipedia:

Image('../images/chi_sq_dist.png')

png

Note: Each curve has:

  • Mean: degrees of freedom
  • Variance: degrees of freedom, squared
  • Mode: degrees of freedom - 2

Chi-Squared Tests in Python

Rehashing Sal’s example above, we’ll start with an array of the expected proportions, which sum to 1

import numpy as np

exp_prop = np.array([.10, .10, .16, .20, .30, .15])

Then we’ll create an array of our observed values

y_obs = np.array([30, 14, 34, 45, 57, 20])

Then, we’ll create an array, y_exp, that gives us the class count we’d expect to see, given the total count of observed people

y_exp = y_obs.sum() * exp_prop

Finally, performing the test is as easy as invoking a clean scipy.stats function

from scipy.stats import chisquare

chisquare(f_obs=y_obs, f_exp=y_exp)
Power_divergenceResult(statistic=11.033333333333333, pvalue=0.05072307251654765)