Chapter 6 — Chi-Square Test and ANOVA
The previous chapter introduced parametric tests for means, proportions, and variances. This chapter covers two specific tests that recur in research: the chi-square () test, which extends hypothesis testing to categorical data and to comparisons of frequencies; and analysis of variance (ANOVA), which extends hypothesis testing of means to comparisons of three or more groups simultaneously. Both are workhorses of applied research — most quantitative theses use at least one of them. The chapter is heavy on worked numerical examples.
6.1 Chi-square as a test for comparing variance
The chi-square distribution appears in several roles in statistics. As an inferential tool for a single sample's variance, it provides a test that complements the -test (which compares two variances).
Testing a single variance against a hypothesised value
Question. Does the population variance equal a specified value ?
Hypotheses.
- : .
- : .
Test statistic.
Under , this statistic follows a chi-square distribution with degrees of freedom.
The chi-square distribution is right-skewed and lives only on the non-negative axis. Critical values come from a table. For a two-tailed test at significance level , the critical values are (upper) and (lower).
Worked example. A SCADA operator at NEA claims that the variance of measured voltage fluctuations on a 132 kV substation is 4 kV². A sample of 25 measurements during one morning peak gives a sample variance of 6.5 kV². Is the actual variance different from the claimed value?
- : .
- : .
- , df = 24.
Compute:
Critical values from the table for df = 24:
- Upper (): .
- Lower (): .
Decision: is just below the upper critical value of 39.36. Fail to reject at .
Conclusion. The evidence does not quite reach significance at . The sample variance (6.5) is higher than the claimed value (4), but a result this extreme could plausibly arise by sampling chance from a population with the claimed variance.
A note on test sensitivity
The chi-square test for variance is very sensitive to departures from normality — far more than the -test is for means. If the underlying data is even mildly non-normal, the test's results can be misleading. For practical use, the test should be combined with a normality check (Shapiro-Wilk test, Q-Q plot) before relying on its conclusion.
6.2 Chi-square as a non-parametric test
The chi-square test's more common use is non-parametric — it does not assume any particular distribution of the underlying data. The two standard uses:
Goodness-of-fit test
Question. Does the observed frequency distribution match a hypothesised distribution?
Hypotheses.
- : the observed distribution matches the expected.
- : the observed distribution differs from the expected.
Test statistic.
where is the observed frequency in category , is the expected frequency under , and is the number of categories. Degrees of freedom: (minus additional df for any parameters estimated from the data).
Test of independence
Question. Are two categorical variables independent in the population?
Hypotheses.
- : the variables are independent.
- : the variables are associated.
Data is presented in a contingency table with rows for one variable and columns for the other. The test statistic is the same formula as above, but applied to the cells of the contingency table.
Expected frequencies under independence:
Degrees of freedom for an table: .
Test of homogeneity
Question. Are two or more populations identical in their distribution across categories?
Mathematically identical to the test of independence. The framing is different — "are these subpopulations all drawn from the same distribution?" rather than "are these variables associated?"
6.3 Conditions for the application of the chi-square test
The chi-square test rests on several conditions. Violations make the test unreliable.
Independence of observations. Each observation must contribute to exactly one cell. Repeated measurements on the same subject across cells violate this.
Random sampling. The data should come from a random or otherwise appropriate sample of the population.
Adequate expected frequencies. A standard rule of thumb (Cochran's rule): all expected frequencies should be at least 1, and no more than 20% of expected frequencies should be less than 5. For 2×2 tables, all expected frequencies should be at least 5.
When expected frequencies are too small, remedies include:
- Combining sparse categories into broader ones.
- Using Fisher's exact test instead of chi-square for 2×2 tables.
- Increasing the sample size.
Mutually exclusive categories. Each observation belongs to exactly one cell.
Sufficient sample size. Small samples (below 30 or so) can produce unstable chi-square estimates even when expected frequencies appear adequate.
Use of frequencies, not percentages or rates. The chi-square formula applies to counts, not derived quantities.
For research papers and theses, all of these conditions should be checked and documented. A table that violates them should either be remedied (combining categories) or analysed with an alternative test (Fisher's exact, Monte Carlo simulation).
6.4 Steps in applying the chi-square test
The procedure has a standard shape regardless of whether it is goodness-of-fit, independence, or homogeneity:
- State the hypotheses. and in words and in terms of the cell probabilities.
- Choose significance level. , conventionally 0.05.
- Collect or tabulate the observed frequencies.
- Compute the expected frequencies under .
- Check the conditions. Especially the expected-frequency rule.
- Compute the chi-square test statistic.
- Determine degrees of freedom.
- Compare to the critical value (or compute the -value).
- Make a decision and interpret.
Worked example — goodness-of-fit
A research project tests whether eSewa users are equally distributed across four geographic regions of Nepal. Of 200 randomly sampled active users, the regional distribution is:
| Region | Observed (O) | Expected (E) under uniform |
|---|---|---|
| Eastern | 35 | 50 |
| Central | 80 | 50 |
| Western | 55 | 50 |
| Far-Western | 30 | 50 |
- : users are equally distributed across regions (each region has 25% probability).
- : users are not equally distributed.
- .
Compute the chi-square statistic:
df = .
Critical value at with df = 3: .
Decision: . Strongly reject .
Conclusion. Users are not uniformly distributed across regions. The Central region is heavily over-represented (80 vs the 50 expected), while Eastern and Far-Western are under-represented. This is statistically very significant.
In context, the result is unsurprising — the Central region (which includes Kathmandu Valley) is more urbanised and has higher digital-payment penetration. The chi-square test confirms what the data clearly shows.
Worked example — test of independence
A study examines whether mobile-banking adoption depends on age group among Nepali bank customers. A random sample of 300 customers gives:
| Uses mobile banking | Does not use | Row total | |
|---|---|---|---|
| Age 18-30 | 80 | 20 | 100 |
| Age 31-50 | 70 | 50 | 120 |
| Age 51+ | 25 | 55 | 80 |
| Column total | 175 | 125 | 300 |
- : mobile-banking adoption is independent of age group.
- : mobile-banking adoption is associated with age group.
- .
Compute expected frequencies under independence:
Check conditions: all expected frequencies are above 5. OK.
Compute the chi-square statistic:
Cell-by-cell:
Sum: .
df = .
Critical value at with df = 2: .
Decision: . Strongly reject .
Conclusion. Mobile-banking adoption is strongly associated with age group. Young customers (18-30) are much more likely to use mobile banking than expected under independence; older customers (51+) are much less likely. The 31-50 group is approximately at expectation.
Effect-size measures for chi-square
The chi-square statistic itself depends on the sample size — a small effect with many observations can produce a large . Effect-size measures normalise this.
Phi coefficient () for 2×2 tables:
Cramér's V for larger tables:
For our age-vs-mobile-banking example: .
Interpretation of Cramér's V (Cohen):
- 0.10 — small effect.
- 0.30 — medium effect.
- 0.50 — large effect.
The age-adoption association is between medium and large.
6.5 Analysis of variance (ANOVA) and the ANOVA technique
Why ANOVA
The two-sample -test compares two means. When comparing three or more means, running multiple pairwise -tests inflates the false-positive rate (the multiple-testing problem from Chapter 5). With four groups, six pairwise comparisons; the chance of at least one false significance at rises to roughly .
Analysis of variance (ANOVA) solves this by testing all the means at once.
Analysis of variance is the statistical method for testing whether the means of three or more groups differ from each other, by comparing the variability between group means to the variability within groups, using an F-statistic under the null hypothesis of equal means.
The intuition
ANOVA partitions the total variance in the data into two parts:
Between-group variance. How much the group means differ from the overall mean. Large between-group variance means the groups are pulling apart.
Within-group variance. How much the individual observations differ from their own group means. This is the "noise" against which the between-group signal is measured.
The ratio:
If the groups have the same mean (), the between-group variance is just sampling noise and should be roughly equal to the within-group variance — near 1. If the groups have different means, the between-group variance is inflated by the real differences — much larger than 1.
One-way ANOVA
For groups with observations in each:
- = mean of group .
- = overall mean (grand mean).
- = total number of observations.
Sum of squares between groups (SSB).
Sum of squares within groups (SSW), also called sum of squares error (SSE).
Total sum of squares (SST).
Degrees of freedom.
- Between: .
- Within: .
- Total: .
Mean squares.
F-statistic.
Under (all group means equal), follows an distribution with and degrees of freedom.
Assumptions of ANOVA
ANOVA assumes:
- Independence. Observations within and across groups are independent.
- Normality. The values within each group are approximately normally distributed.
- Homogeneity of variance. All groups have the same population variance (often tested with Levene's test).
- Continuous dependent variable. The variable being averaged is at the interval or ratio level.
Violations of normality and equal variance are tolerable when sample sizes are large and roughly equal. Severe violations may require transformations of the data or non-parametric alternatives (the Kruskal-Wallis test).
6.6 Setting up the ANOVA table
The standard presentation of ANOVA results is the ANOVA table:
| Source | Sum of Squares | df | Mean Square | F |
|---|---|---|---|---|
| Between groups | SSB | |||
| Within groups | SSW | |||
| Total | SST |
Worked example — one-way ANOVA
Three machine-learning algorithms for intrusion detection are evaluated. Each is run on 5 independent test sets (different random splits of the same data). F1-scores are recorded:
| Algorithm A | Algorithm B | Algorithm C |
|---|---|---|
| 0.82 | 0.78 | 0.86 |
| 0.85 | 0.80 | 0.88 |
| 0.83 | 0.76 | 0.85 |
| 0.84 | 0.79 | 0.87 |
| 0.81 | 0.77 | 0.89 |
Test whether the three algorithms have the same mean F1-score.
- : .
- : at least one mean differs.
- .
Compute group means:
- .
- .
- .
Grand mean: .
Between-group sum of squares:
Within-group sum of squares:
For group A (mean 0.830):
- Sum: 0.0010
For group B (mean 0.780):
- Sum: 0.0010
For group C (mean 0.870):
- Sum: 0.0010
.
Degrees of freedom: , .
Mean squares:
F-statistic:
ANOVA table:
| Source | SS | df | MS | F |
|---|---|---|---|---|
| Between groups | 0.02035 | 2 | 0.01017 | 40.68 |
| Within groups | 0.0030 | 12 | 0.00025 | |
| Total | 0.02335 | 14 |
Critical value at with df = (2, 12): .
. Strongly reject .
Conclusion. The three algorithms have significantly different mean F1-scores. Algorithm C has the highest (0.870), followed by A (0.830) and B (0.780).
Post-hoc tests
ANOVA tells us at least one group differs but not which one. Post-hoc tests identify the specific differences.
Common post-hoc tests:
- Tukey's HSD (Honestly Significant Difference). Compares all pairs while controlling the family-wise error rate. Standard default.
- Bonferroni-adjusted pairwise t-tests. Conservative; divides by the number of comparisons.
- Scheffé's test. Most conservative; useful when comparing complex contrasts.
- Dunnett's test. Compares each treatment group to a single control.
- Fisher's LSD. Liberal; appropriate only when the omnibus F-test is significant.
For the example above, post-hoc tests would confirm that all three pairs (A vs B, A vs C, B vs C) differ significantly.
6.7 Coding method
For computation by hand or by simple calculator, the coding method simplifies the arithmetic by transforming the data to smaller numbers before computation. The final results are then untransformed.
The two standard codings:
Subtracting a constant. If all values are large (e.g., NEA peak demand in MW, in the thousands), subtract a constant from each. Means and sums of squares are unaffected if the constant is chosen appropriately.
For data , let . Then:
- .
- .
The variance, standard deviation, and sums of squares are identical under translation.
Scaling by a constant. If all values are inconveniently small or scaled differently, multiply each by a constant. Means scale by the same factor; variances scale by the factor squared.
For :
- .
- .
Combined coding. .
Coding in ANOVA
For ANOVA with large or awkward numbers, code the data first. Compute SS in coded units. Sum of squares (involving squared deviations from a mean) is invariant to translation; only the scaling matters. If we scaled by , multiply the SS in coded units by to get SS in original units.
The -statistic is invariant to both translation and scaling — being a ratio of mean squares, the in numerator and denominator cancels.
In practice, with statistical software running on standard machines, coding is rarely necessary. The technique appears in older textbooks for hand computation; it remains useful when teaching ANOVA arithmetic in a classroom.
Example
For the F1-score data in Section 6.6, every value is around 0.8. Code by . The coded data:
| Algorithm A | Algorithm B | Algorithm C |
|---|---|---|
| 2 | -2 | 6 |
| 5 | 0 | 8 |
| 3 | -4 | 5 |
| 4 | -1 | 7 |
| 1 | -3 | 9 |
Means: , , . Grand mean: .
The F-statistic from the coded data is the same as from the original. Sums of squares scale by (since , , dividing by to "decode"). But ratios — which is what F is — are unchanged.
6.8 Two-way ANOVA
One-way ANOVA tests the effect of one factor. Two-way ANOVA tests the effects of two factors simultaneously and their interaction.
Two-way ANOVA setup
Suppose two factors:
- Factor A with levels.
- Factor B with levels.
Each combination of levels is a cell, with observations per cell. Total observations: .
The model:
where:
- is the grand mean.
- is the main effect of Factor A's level .
- is the main effect of Factor B's level .
- is the interaction effect.
- is the residual.
Three null hypotheses to test:
- : All (no main effect of Factor A).
- : All (no main effect of Factor B).
- : All (no interaction).
Two-way ANOVA table
The sums of squares decompose:
Standard ANOVA table:
| Source | SS | df | MS | F |
|---|---|---|---|---|
| Factor A | SSA | |||
| Factor B | SSB | |||
| Interaction AB | SSAB | |||
| Error | SSE | |||
| Total | SST |
Each F-statistic is compared to the appropriate critical value.
Main effects vs interactions
Main effect. Average effect of one factor across the levels of the other. "Algorithm matters" (regardless of which dataset).
Interaction. The effect of one factor depends on the level of the other. "Algorithm A is best on dataset X but Algorithm B is best on dataset Y" — the algorithm effect interacts with the dataset effect.
A significant interaction modifies how the main effects are interpreted. If the interaction is significant, the main effects must be discussed in light of which level of the other factor is considered.
Worked example — two-way ANOVA
A study evaluates two ML algorithms (Factor A: algorithm with levels Random Forest and XGBoost) on two types of fraud datasets (Factor B: dataset type with levels Type-1 = credit card fraud, Type-2 = mobile-wallet fraud). Three F1-scores per combination:
| Type-1 (B1) | Type-2 (B2) | |
|---|---|---|
| Random Forest (A1) | 0.82, 0.84, 0.83 | 0.75, 0.77, 0.76 |
| XGBoost (A2) | 0.85, 0.87, 0.86 | 0.88, 0.90, 0.89 |
Cell means:
- , .
- , .
Marginal means:
- (Random Forest overall).
- (XGBoost overall).
- (Type-1 overall).
- (Type-2 overall).
Grand mean: .
With , , per cell:
SSA (algorithm):
SSB (dataset):
SSAB (interaction):
For each cell:
- (A1, B1): , squared: 0.000900.
- (A1, B2): , squared: 0.000400.
- (A2, B1): , squared: 0.000400.
- (A2, B2): , squared: 0.000900.
Sum: 0.002600. Multiply by : .
Wait — let me recompute the contrast cleanly: SSAB indeed = .
SSE (error, within cells):
Compute deviations of individual observations from their cell mean.
For (A1, B1), cell mean 0.83: deviations = -0.01, 0.01, 0.00. Squared: 0.0001 + 0.0001 + 0 = 0.0002. For (A1, B2), cell mean 0.76: deviations = -0.01, 0.01, 0.00. Squared: 0.0002. For (A2, B1), cell mean 0.86: deviations = -0.01, 0.01, 0.00. Squared: 0.0002. For (A2, B2), cell mean 0.89: deviations = -0.01, 0.01, 0.00. Squared: 0.0002.
.
ANOVA table:
| Source | SS | df | MS | F |
|---|---|---|---|---|
| A (algorithm) | 0.0195 | 1 | 0.0195 | 195 |
| B (dataset) | 0.0015 | 1 | 0.0015 | 15 |
| Interaction AB | 0.0078 | 1 | 0.0078 | 78 |
| Error | 0.0008 | 8 | 0.0001 | |
| Total | 0.0296 | 11 |
(MSE = 0.0008/8 = 0.0001.)
Critical value at with df = (1, 8): .
Decisions.
- . Reject . Algorithm has a significant main effect.
- . Reject . Dataset has a significant main effect.
- . Reject . The interaction is significant.
Conclusion. All three effects are significant. XGBoost outperforms Random Forest on average; Type-1 fraud is easier than Type-2 on average. But the significant interaction means these main effects must be interpreted carefully — Random Forest performs much better on Type-1 (0.83) than on Type-2 (0.76), while XGBoost performs slightly better on Type-2 (0.89) than on Type-1 (0.86). The "best algorithm" depends on the dataset type.
The interaction is the substantively most interesting finding: XGBoost is the right choice especially when the dataset is mobile-wallet fraud, where the algorithm's advantage over Random Forest is largest.
Visualising interactions
The standard visualisation is an interaction plot — line plot with the dependent variable on the y-axis, one factor on the x-axis, and one line per level of the other factor. Parallel lines indicate no interaction. Crossing or diverging lines indicate interaction.
For the example above, plotting cell means with algorithm on the x-axis and one line per dataset would show the lines crossing — Random Forest is higher for Type-1, XGBoost higher for Type-2 (much higher).
Beyond two-way ANOVA
Higher-order ANOVAs (three-way, four-way) follow the same principles. Each factor adds main effects, all interactions among the factors, and higher-order interactions. Interpretation becomes harder as the number of factors grows.
Mixed-effects models generalise ANOVA further, allowing some factors to be treated as random rather than fixed — useful when the levels of a factor are a sample from a larger set.
MANOVA (multivariate analysis of variance) tests for differences in means across multiple dependent variables at once.
In modern research, these methods are typically implemented in statistical software (R, SPSS, SAS, Stata, Python with statsmodels) rather than by hand. The hand-computation worked examples in this chapter exist for pedagogical clarity — to show that the F-statistic is not magic, just a ratio of variance components, computable from first principles.
The next chapter turns from analysis to communication — the reporting of results, the publication process, the management of the research project, and the standards of scientific dissemination.