Chi-square test for equality of distributions
(Chi-square test of independence)
{From the Institute of Phonetic Sciences (IFA):
http://www.fon.hum.uva.nl/}
Characteristics:
This is the most widely used test on nominal data. Although the observations
(i.e., the numbers) are bi- or multi-nomial distributed, it is impractical to
calculate the levels of significance directly. Binomial distributions can be
approximated by a normal distributions if the expected number of observations is
large enough. This is used to calculate the "variance" of the observed
distribution. Under H0 this "variance" has a Chi-square distribution.
H0:
All samples have the same frequency distribution.
Assumptions:
None realy, except that the observations must be independent.
Scale:
Nominal
Procedure:
Calculate the expected number of observations, Eij, under H0:
Eij = Ni * Oj / N, in which Oj are the total number of observations
of categories j (j from 1 to J, i.e., the column totals) and Ni
the sizes of samples i (i from 1 to I, i.e., the row totals).
The test parameter is X^2 = Sum over all cells ( Oij - Eij )^2 / Eij
which follows a Chi-square distribution by approximation with (J-1)*(I-1)
Degrees of Freedom.
Although the above procedure is the one generally found in text-books, it is not
the best one. It ommits the continuity correction that is needed because a
discrete (multinomial) distribution is approximated with a continuous (X^2) one.
A better test parameter is:
X^2 = Sum over all cells ( |Oij - Eij| - 0.5 )^2 / Eij
(|a-b| indicates the absolute value of the difference). This is the
approach actually used to calculate the X^2 value in this example.
Level of Significance:
Use a table to look up the level of significance associated with
X^2 and the Degrees of Freedom.
Approximation:
If the Degrees of Freedom > 30, the distribution of
z = {(X^2/DoF)^(1/3) - (1 - 2/(9*DoF))}/SQRT(2/(9*DoF))
can be approximated by a Standard Normal Distribution.
Remarks:
This approach is an approximation, even with the continuity correction. The
Chi-square distribution can only be used if all expected values, i.e., all
Eij, are larger than five. If this does not hold, combine the rarer
categories with larger ones.
You can compute this test by clicking HERE.