Skip to main content

Chapter 6 — Chi-Square Test and ANOVA

The previous chapter introduced parametric tests for means, proportions, and variances. This chapter covers two specific tests that recur in research: the chi-square (χ2\chi^2) test, which extends hypothesis testing to categorical data and to comparisons of frequencies; and analysis of variance (ANOVA), which extends hypothesis testing of means to comparisons of three or more groups simultaneously. Both are workhorses of applied research — most quantitative theses use at least one of them. The chapter is heavy on worked numerical examples.

6.1 Chi-square as a test for comparing variance

The chi-square distribution appears in several roles in statistics. As an inferential tool for a single sample's variance, it provides a test that complements the FF-test (which compares two variances).

Testing a single variance against a hypothesised value

Question. Does the population variance equal a specified value σ02\sigma_0^2?

Hypotheses.

  • H0H_0: σ2=σ02\sigma^2 = \sigma_0^2.
  • H1H_1: σ2σ02\sigma^2 \neq \sigma_0^2.

Test statistic.

χ2=(n1)s2σ02\chi^2 = \frac{(n-1) s^2}{\sigma_0^2}

Under H0H_0, this statistic follows a chi-square distribution with n1n - 1 degrees of freedom.

The chi-square distribution is right-skewed and lives only on the non-negative axis. Critical values come from a χ2\chi^2 table. For a two-tailed test at significance level α\alpha, the critical values are χα/2,n12\chi^2_{\alpha/2, n-1} (upper) and χ1α/2,n12\chi^2_{1-\alpha/2, n-1} (lower).

Worked example. A SCADA operator at NEA claims that the variance of measured voltage fluctuations on a 132 kV substation is 4 kV². A sample of 25 measurements during one morning peak gives a sample variance of 6.5 kV². Is the actual variance different from the claimed value?

  • H0H_0: σ2=4\sigma^2 = 4.
  • H1H_1: σ24\sigma^2 \neq 4.
  • α=0.05\alpha = 0.05, df = 24.

Compute:

χ2=246.54=1564=39.0\chi^2 = \frac{24 \cdot 6.5}{4} = \frac{156}{4} = 39.0

Critical values from the χ2\chi^2 table for df = 24:

  • Upper (α/2=0.025\alpha/2 = 0.025): χ0.025,24239.36\chi^2_{0.025, 24} \approx 39.36.
  • Lower (1α/2=0.9751 - \alpha/2 = 0.975): χ0.975,24212.40\chi^2_{0.975, 24} \approx 12.40.

Decision: χ2=39.0\chi^2 = 39.0 is just below the upper critical value of 39.36. Fail to reject H0H_0 at α=0.05\alpha = 0.05.

Conclusion. The evidence does not quite reach significance at α=0.05\alpha = 0.05. The sample variance (6.5) is higher than the claimed value (4), but a result this extreme could plausibly arise by sampling chance from a population with the claimed variance.

A note on test sensitivity

The chi-square test for variance is very sensitive to departures from normality — far more than the tt-test is for means. If the underlying data is even mildly non-normal, the test's results can be misleading. For practical use, the test should be combined with a normality check (Shapiro-Wilk test, Q-Q plot) before relying on its conclusion.

6.2 Chi-square as a non-parametric test

The chi-square test's more common use is non-parametric — it does not assume any particular distribution of the underlying data. The two standard uses:

Goodness-of-fit test

Question. Does the observed frequency distribution match a hypothesised distribution?

Hypotheses.

  • H0H_0: the observed distribution matches the expected.
  • H1H_1: the observed distribution differs from the expected.

Test statistic.

χ2=i=1k(OiEi)2Ei\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}

where OiO_i is the observed frequency in category ii, EiE_i is the expected frequency under H0H_0, and kk is the number of categories. Degrees of freedom: df=k1df = k - 1 (minus additional df for any parameters estimated from the data).

Test of independence

Question. Are two categorical variables independent in the population?

Hypotheses.

  • H0H_0: the variables are independent.
  • H1H_1: the variables are associated.

Data is presented in a contingency table with rows for one variable and columns for the other. The test statistic is the same formula as above, but applied to the cells of the contingency table.

Expected frequencies under independence:

Eij=(row totali)(column totalj)grand totalE_{ij} = \frac{(\text{row total}_i) \cdot (\text{column total}_j)}{\text{grand total}}

Degrees of freedom for an r×cr \times c table: df=(r1)(c1)df = (r-1)(c-1).

Test of homogeneity

Question. Are two or more populations identical in their distribution across categories?

Mathematically identical to the test of independence. The framing is different — "are these subpopulations all drawn from the same distribution?" rather than "are these variables associated?"

6.3 Conditions for the application of the chi-square test

The chi-square test rests on several conditions. Violations make the test unreliable.

Independence of observations. Each observation must contribute to exactly one cell. Repeated measurements on the same subject across cells violate this.

Random sampling. The data should come from a random or otherwise appropriate sample of the population.

Adequate expected frequencies. A standard rule of thumb (Cochran's rule): all expected frequencies should be at least 1, and no more than 20% of expected frequencies should be less than 5. For 2×2 tables, all expected frequencies should be at least 5.

When expected frequencies are too small, remedies include:

  • Combining sparse categories into broader ones.
  • Using Fisher's exact test instead of chi-square for 2×2 tables.
  • Increasing the sample size.

Mutually exclusive categories. Each observation belongs to exactly one cell.

Sufficient sample size. Small samples (below 30 or so) can produce unstable chi-square estimates even when expected frequencies appear adequate.

Use of frequencies, not percentages or rates. The chi-square formula applies to counts, not derived quantities.

For research papers and theses, all of these conditions should be checked and documented. A table that violates them should either be remedied (combining categories) or analysed with an alternative test (Fisher's exact, Monte Carlo simulation).

6.4 Steps in applying the chi-square test

The procedure has a standard shape regardless of whether it is goodness-of-fit, independence, or homogeneity:

  1. State the hypotheses. H0H_0 and H1H_1 in words and in terms of the cell probabilities.
  2. Choose significance level. α\alpha, conventionally 0.05.
  3. Collect or tabulate the observed frequencies.
  4. Compute the expected frequencies under H0H_0.
  5. Check the conditions. Especially the expected-frequency rule.
  6. Compute the chi-square test statistic.
  7. Determine degrees of freedom.
  8. Compare to the critical value (or compute the pp-value).
  9. Make a decision and interpret.

Worked example — goodness-of-fit

A research project tests whether eSewa users are equally distributed across four geographic regions of Nepal. Of 200 randomly sampled active users, the regional distribution is:

RegionObserved (O)Expected (E) under uniform
Eastern3550
Central8050
Western5550
Far-Western3050
  • H0H_0: users are equally distributed across regions (each region has 25% probability).
  • H1H_1: users are not equally distributed.
  • α=0.05\alpha = 0.05.

Compute the chi-square statistic:

χ2=(3550)250+(8050)250+(5550)250+(3050)250\chi^2 = \frac{(35-50)^2}{50} + \frac{(80-50)^2}{50} + \frac{(55-50)^2}{50} + \frac{(30-50)^2}{50} =22550+90050+2550+40050=4.5+18.0+0.5+8.0=31.0= \frac{225}{50} + \frac{900}{50} + \frac{25}{50} + \frac{400}{50} = 4.5 + 18.0 + 0.5 + 8.0 = 31.0

df = k1=41=3k - 1 = 4 - 1 = 3.

Critical value at α=0.05\alpha = 0.05 with df = 3: χ0.05,327.815\chi^2_{0.05, 3} \approx 7.815.

Decision: χ2=31.07.815\chi^2 = 31.0 \gg 7.815. Strongly reject H0H_0.

Conclusion. Users are not uniformly distributed across regions. The Central region is heavily over-represented (80 vs the 50 expected), while Eastern and Far-Western are under-represented. This is statistically very significant.

In context, the result is unsurprising — the Central region (which includes Kathmandu Valley) is more urbanised and has higher digital-payment penetration. The chi-square test confirms what the data clearly shows.

Worked example — test of independence

A study examines whether mobile-banking adoption depends on age group among Nepali bank customers. A random sample of 300 customers gives:

Uses mobile bankingDoes not useRow total
Age 18-308020100
Age 31-507050120
Age 51+255580
Column total175125300
  • H0H_0: mobile-banking adoption is independent of age group.
  • H1H_1: mobile-banking adoption is associated with age group.
  • α=0.05\alpha = 0.05.

Compute expected frequencies under independence:

Eij=(row total)(column total)300E_{ij} = \frac{(\text{row total}) (\text{column total})}{300}
  • E11=(100)(175)/300=58.33E_{11} = (100)(175)/300 = 58.33
  • E12=(100)(125)/300=41.67E_{12} = (100)(125)/300 = 41.67
  • E21=(120)(175)/300=70.00E_{21} = (120)(175)/300 = 70.00
  • E22=(120)(125)/300=50.00E_{22} = (120)(125)/300 = 50.00
  • E31=(80)(175)/300=46.67E_{31} = (80)(175)/300 = 46.67
  • E32=(80)(125)/300=33.33E_{32} = (80)(125)/300 = 33.33

Check conditions: all expected frequencies are above 5. OK.

Compute the chi-square statistic:

χ2=(8058.33)258.33+(2041.67)241.67+(7070)270+(5050)250+(2546.67)246.67+(5533.33)233.33\chi^2 = \frac{(80-58.33)^2}{58.33} + \frac{(20-41.67)^2}{41.67} + \frac{(70-70)^2}{70} + \frac{(50-50)^2}{50} + \frac{(25-46.67)^2}{46.67} + \frac{(55-33.33)^2}{33.33}

Cell-by-cell:

  • (8058.33)2/58.33=469.59/58.33=8.05(80 - 58.33)^2 / 58.33 = 469.59 / 58.33 = 8.05
  • (2041.67)2/41.67=469.59/41.67=11.27(20 - 41.67)^2 / 41.67 = 469.59 / 41.67 = 11.27
  • (7070)2/70=0(70 - 70)^2 / 70 = 0
  • (5050)2/50=0(50 - 50)^2 / 50 = 0
  • (2546.67)2/46.67=469.59/46.67=10.06(25 - 46.67)^2 / 46.67 = 469.59 / 46.67 = 10.06
  • (5533.33)2/33.33=469.59/33.33=14.09(55 - 33.33)^2 / 33.33 = 469.59 / 33.33 = 14.09

Sum: χ2=8.05+11.27+0+0+10.06+14.09=43.47\chi^2 = 8.05 + 11.27 + 0 + 0 + 10.06 + 14.09 = 43.47.

df = (31)(21)=2(3-1)(2-1) = 2.

Critical value at α=0.05\alpha = 0.05 with df = 2: χ0.05,225.991\chi^2_{0.05, 2} \approx 5.991.

Decision: χ2=43.475.991\chi^2 = 43.47 \gg 5.991. Strongly reject H0H_0.

Conclusion. Mobile-banking adoption is strongly associated with age group. Young customers (18-30) are much more likely to use mobile banking than expected under independence; older customers (51+) are much less likely. The 31-50 group is approximately at expectation.

Effect-size measures for chi-square

The chi-square statistic itself depends on the sample size — a small effect with many observations can produce a large χ2\chi^2. Effect-size measures normalise this.

Phi coefficient (ϕ\phi) for 2×2 tables:

ϕ=χ2n\phi = \sqrt{\frac{\chi^2}{n}}

Cramér's V for larger tables:

V=χ2nmin(r1,c1)V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}

For our age-vs-mobile-banking example: V=43.47/(3001)=0.1450.380V = \sqrt{43.47 / (300 \cdot 1)} = \sqrt{0.145} \approx 0.380.

Interpretation of Cramér's V (Cohen):

  • 0.10 — small effect.
  • 0.30 — medium effect.
  • 0.50 — large effect.

The age-adoption association is between medium and large.

6.5 Analysis of variance (ANOVA) and the ANOVA technique

Why ANOVA

The two-sample tt-test compares two means. When comparing three or more means, running multiple pairwise tt-tests inflates the false-positive rate (the multiple-testing problem from Chapter 5). With four groups, six pairwise comparisons; the chance of at least one false significance at α=0.05\alpha = 0.05 rises to roughly 1(10.05)60.261 - (1 - 0.05)^6 \approx 0.26.

Analysis of variance (ANOVA) solves this by testing all the means at once.

Analysis of variance is the statistical method for testing whether the means of three or more groups differ from each other, by comparing the variability between group means to the variability within groups, using an F-statistic under the null hypothesis of equal means.

The intuition

ANOVA partitions the total variance in the data into two parts:

Between-group variance. How much the group means differ from the overall mean. Large between-group variance means the groups are pulling apart.

Within-group variance. How much the individual observations differ from their own group means. This is the "noise" against which the between-group signal is measured.

The ratio:

F=Between-group varianceWithin-group varianceF = \frac{\text{Between-group variance}}{\text{Within-group variance}}

If the groups have the same mean (H0H_0), the between-group variance is just sampling noise and should be roughly equal to the within-group variance — FF near 1. If the groups have different means, the between-group variance is inflated by the real differences — FF much larger than 1.

One-way ANOVA

For kk groups with nin_i observations in each:

  • xˉi\bar{x}_i = mean of group ii.
  • xˉ\bar{x} = overall mean (grand mean).
  • NN = total number of observations.

Sum of squares between groups (SSB).

SSB=i=1kni(xˉixˉ)2SSB = \sum_{i=1}^{k} n_i (\bar{x}_i - \bar{x})^2

Sum of squares within groups (SSW), also called sum of squares error (SSE).

SSW=i=1kj=1ni(xijxˉi)2SSW = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x}_i)^2

Total sum of squares (SST).

SST=i=1kj=1ni(xijxˉ)2=SSB+SSWSST = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (x_{ij} - \bar{x})^2 = SSB + SSW

Degrees of freedom.

  • Between: dfB=k1df_B = k - 1.
  • Within: dfW=Nkdf_W = N - k.
  • Total: dfT=N1df_T = N - 1.

Mean squares.

MSB=SSBdfB,MSW=SSWdfWMSB = \frac{SSB}{df_B}, \quad MSW = \frac{SSW}{df_W}

F-statistic.

F=MSBMSWF = \frac{MSB}{MSW}

Under H0H_0 (all group means equal), FF follows an FF distribution with dfBdf_B and dfWdf_W degrees of freedom.

Assumptions of ANOVA

ANOVA assumes:

  1. Independence. Observations within and across groups are independent.
  2. Normality. The values within each group are approximately normally distributed.
  3. Homogeneity of variance. All groups have the same population variance (often tested with Levene's test).
  4. Continuous dependent variable. The variable being averaged is at the interval or ratio level.

Violations of normality and equal variance are tolerable when sample sizes are large and roughly equal. Severe violations may require transformations of the data or non-parametric alternatives (the Kruskal-Wallis test).

6.6 Setting up the ANOVA table

The standard presentation of ANOVA results is the ANOVA table:

SourceSum of SquaresdfMean SquareF
Between groupsSSBk1k - 1MSB=SSB/(k1)MSB = SSB / (k-1)F=MSB/MSWF = MSB / MSW
Within groupsSSWNkN - kMSW=SSW/(Nk)MSW = SSW / (N-k)
TotalSSTN1N - 1

Worked example — one-way ANOVA

Three machine-learning algorithms for intrusion detection are evaluated. Each is run on 5 independent test sets (different random splits of the same data). F1-scores are recorded:

Algorithm AAlgorithm BAlgorithm C
0.820.780.86
0.850.800.88
0.830.760.85
0.840.790.87
0.810.770.89

Test whether the three algorithms have the same mean F1-score.

  • H0H_0: μA=μB=μC\mu_A = \mu_B = \mu_C.
  • H1H_1: at least one mean differs.
  • α=0.05\alpha = 0.05.

Compute group means:

  • xˉA=(0.82+0.85+0.83+0.84+0.81)/5=4.15/5=0.830\bar{x}_A = (0.82 + 0.85 + 0.83 + 0.84 + 0.81)/5 = 4.15/5 = 0.830.
  • xˉB=(0.78+0.80+0.76+0.79+0.77)/5=3.90/5=0.780\bar{x}_B = (0.78 + 0.80 + 0.76 + 0.79 + 0.77)/5 = 3.90/5 = 0.780.
  • xˉC=(0.86+0.88+0.85+0.87+0.89)/5=4.35/5=0.870\bar{x}_C = (0.86 + 0.88 + 0.85 + 0.87 + 0.89)/5 = 4.35/5 = 0.870.

Grand mean: xˉ=(4.15+3.90+4.35)/15=12.40/150.8267\bar{x} = (4.15 + 3.90 + 4.35)/15 = 12.40/15 \approx 0.8267.

Between-group sum of squares:

SSB=5(0.8300.8267)2+5(0.7800.8267)2+5(0.8700.8267)2SSB = 5(0.830 - 0.8267)^2 + 5(0.780 - 0.8267)^2 + 5(0.870 - 0.8267)^2 =5(0.0033)2+5(0.0467)2+5(0.0433)2= 5(0.0033)^2 + 5(-0.0467)^2 + 5(0.0433)^2 =5(0.0000109)+5(0.00218)+5(0.00188)= 5(0.0000109) + 5(0.00218) + 5(0.00188) =0.000054+0.0109+0.009380.02035= 0.000054 + 0.0109 + 0.00938 \approx 0.02035

Within-group sum of squares:

For group A (mean 0.830):

  • (0.820.830)2=0.0001(0.82 - 0.830)^2 = 0.0001
  • (0.850.830)2=0.0004(0.85 - 0.830)^2 = 0.0004
  • (0.830.830)2=0.0000(0.83 - 0.830)^2 = 0.0000
  • (0.840.830)2=0.0001(0.84 - 0.830)^2 = 0.0001
  • (0.810.830)2=0.0004(0.81 - 0.830)^2 = 0.0004
  • Sum: 0.0010

For group B (mean 0.780):

  • (0.780.780)2=0.0000(0.78 - 0.780)^2 = 0.0000
  • (0.800.780)2=0.0004(0.80 - 0.780)^2 = 0.0004
  • (0.760.780)2=0.0004(0.76 - 0.780)^2 = 0.0004
  • (0.790.780)2=0.0001(0.79 - 0.780)^2 = 0.0001
  • (0.770.780)2=0.0001(0.77 - 0.780)^2 = 0.0001
  • Sum: 0.0010

For group C (mean 0.870):

  • (0.860.870)2=0.0001(0.86 - 0.870)^2 = 0.0001
  • (0.880.870)2=0.0001(0.88 - 0.870)^2 = 0.0001
  • (0.850.870)2=0.0004(0.85 - 0.870)^2 = 0.0004
  • (0.870.870)2=0.0000(0.87 - 0.870)^2 = 0.0000
  • (0.890.870)2=0.0004(0.89 - 0.870)^2 = 0.0004
  • Sum: 0.0010

SSW=0.0010+0.0010+0.0010=0.0030SSW = 0.0010 + 0.0010 + 0.0010 = 0.0030.

Degrees of freedom: dfB=31=2df_B = 3 - 1 = 2, dfW=153=12df_W = 15 - 3 = 12.

Mean squares:

MSB=0.02035/2=0.01017MSB = 0.02035 / 2 = 0.01017 MSW=0.0030/12=0.00025MSW = 0.0030 / 12 = 0.00025

F-statistic:

F=0.01017/0.00025=40.68F = 0.01017 / 0.00025 = 40.68

ANOVA table:

SourceSSdfMSF
Between groups0.0203520.0101740.68
Within groups0.0030120.00025
Total0.0233514

Critical value at α=0.05\alpha = 0.05 with df = (2, 12): F0.05,2,123.89F_{0.05, 2, 12} \approx 3.89.

F=40.683.89F = 40.68 \gg 3.89. Strongly reject H0H_0.

Conclusion. The three algorithms have significantly different mean F1-scores. Algorithm C has the highest (0.870), followed by A (0.830) and B (0.780).

Post-hoc tests

ANOVA tells us at least one group differs but not which one. Post-hoc tests identify the specific differences.

Common post-hoc tests:

  • Tukey's HSD (Honestly Significant Difference). Compares all pairs while controlling the family-wise error rate. Standard default.
  • Bonferroni-adjusted pairwise t-tests. Conservative; divides α\alpha by the number of comparisons.
  • Scheffé's test. Most conservative; useful when comparing complex contrasts.
  • Dunnett's test. Compares each treatment group to a single control.
  • Fisher's LSD. Liberal; appropriate only when the omnibus F-test is significant.

For the example above, post-hoc tests would confirm that all three pairs (A vs B, A vs C, B vs C) differ significantly.

6.7 Coding method

For computation by hand or by simple calculator, the coding method simplifies the arithmetic by transforming the data to smaller numbers before computation. The final results are then untransformed.

The two standard codings:

Subtracting a constant. If all values are large (e.g., NEA peak demand in MW, in the thousands), subtract a constant cc from each. Means and sums of squares are unaffected if the constant is chosen appropriately.

For data xix_i, let ui=xicu_i = x_i - c. Then:

  • uˉ=xˉc\bar{u} = \bar{x} - c.
  • (uiuˉ)2=(xixˉ)2\sum (u_i - \bar{u})^2 = \sum (x_i - \bar{x})^2.

The variance, standard deviation, and sums of squares are identical under translation.

Scaling by a constant. If all values are inconveniently small or scaled differently, multiply each by a constant. Means scale by the same factor; variances scale by the factor squared.

For ui=xi/du_i = x_i / d:

  • uˉ=xˉ/d\bar{u} = \bar{x} / d.
  • su2=sx2/d2s_u^2 = s_x^2 / d^2.

Combined coding. ui=(xic)/du_i = (x_i - c) / d.

Coding in ANOVA

For ANOVA with large or awkward numbers, code the data first. Compute SS in coded units. Sum of squares (involving squared deviations from a mean) is invariant to translation; only the scaling matters. If we scaled by dd, multiply the SS in coded units by d2d^2 to get SS in original units.

The FF-statistic is invariant to both translation and scaling — being a ratio of mean squares, the d2d^2 in numerator and denominator cancels.

In practice, with statistical software running on standard machines, coding is rarely necessary. The technique appears in older textbooks for hand computation; it remains useful when teaching ANOVA arithmetic in a classroom.

Example

For the F1-score data in Section 6.6, every value is around 0.8. Code by u=(x0.8)×100u = (x - 0.8) \times 100. The coded data:

Algorithm AAlgorithm BAlgorithm C
2-26
508
3-45
4-17
1-39

Means: uˉA=3\bar{u}_A = 3, uˉB=2\bar{u}_B = -2, uˉC=7\bar{u}_C = 7. Grand mean: uˉ=(1510+35)/15=8/32.667\bar{u} = (15 - 10 + 35)/15 = 8/3 \approx 2.667.

The F-statistic from the coded data is the same as from the original. Sums of squares scale by 1000010000 (since d=0.01d = 0.01, d2=0.0001d^2 = 0.0001, dividing by d2d^2 to "decode"). But ratios — which is what F is — are unchanged.

6.8 Two-way ANOVA

One-way ANOVA tests the effect of one factor. Two-way ANOVA tests the effects of two factors simultaneously and their interaction.

Two-way ANOVA setup

Suppose two factors:

  • Factor A with aa levels.
  • Factor B with bb levels.

Each combination of levels is a cell, with nn observations per cell. Total observations: N=abnN = a \cdot b \cdot n.

The model:

xijk=μ+αi+βj+(αβ)ij+ϵijkx_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta)_{ij} + \epsilon_{ijk}

where:

  • μ\mu is the grand mean.
  • αi\alpha_i is the main effect of Factor A's level ii.
  • βj\beta_j is the main effect of Factor B's level jj.
  • (αβ)ij(\alpha\beta)_{ij} is the interaction effect.
  • ϵijk\epsilon_{ijk} is the residual.

Three null hypotheses to test:

  • H0AH_0^A: All αi=0\alpha_i = 0 (no main effect of Factor A).
  • H0BH_0^B: All βj=0\beta_j = 0 (no main effect of Factor B).
  • H0ABH_0^{AB}: All (αβ)ij=0(\alpha\beta)_{ij} = 0 (no interaction).

Two-way ANOVA table

The sums of squares decompose:

SST=SSA+SSB+SSAB+SSESST = SSA + SSB + SSAB + SSE

Standard ANOVA table:

SourceSSdfMSF
Factor ASSAa1a - 1MSA=SSA/(a1)MSA = SSA/(a-1)FA=MSA/MSEF_A = MSA/MSE
Factor BSSBb1b - 1MSB=SSB/(b1)MSB = SSB/(b-1)FB=MSB/MSEF_B = MSB/MSE
Interaction ABSSAB(a1)(b1)(a-1)(b-1)MSAB=SSAB/[(a1)(b1)]MSAB = SSAB / [(a-1)(b-1)]FAB=MSAB/MSEF_{AB} = MSAB/MSE
ErrorSSEab(n1)ab(n-1)MSE=SSE/[ab(n1)]MSE = SSE/[ab(n-1)]
TotalSSTN1N - 1

Each F-statistic is compared to the appropriate FF critical value.

Main effects vs interactions

Main effect. Average effect of one factor across the levels of the other. "Algorithm matters" (regardless of which dataset).

Interaction. The effect of one factor depends on the level of the other. "Algorithm A is best on dataset X but Algorithm B is best on dataset Y" — the algorithm effect interacts with the dataset effect.

A significant interaction modifies how the main effects are interpreted. If the interaction is significant, the main effects must be discussed in light of which level of the other factor is considered.

Worked example — two-way ANOVA

A study evaluates two ML algorithms (Factor A: algorithm with levels Random Forest and XGBoost) on two types of fraud datasets (Factor B: dataset type with levels Type-1 = credit card fraud, Type-2 = mobile-wallet fraud). Three F1-scores per combination:

Type-1 (B1)Type-2 (B2)
Random Forest (A1)0.82, 0.84, 0.830.75, 0.77, 0.76
XGBoost (A2)0.85, 0.87, 0.860.88, 0.90, 0.89

Cell means:

  • xˉA1B1=0.83\bar{x}_{A1B1} = 0.83, xˉA1B2=0.76\bar{x}_{A1B2} = 0.76.
  • xˉA2B1=0.86\bar{x}_{A2B1} = 0.86, xˉA2B2=0.89\bar{x}_{A2B2} = 0.89.

Marginal means:

  • xˉA1=(0.83+0.76)/2=0.795\bar{x}_{A1} = (0.83 + 0.76)/2 = 0.795 (Random Forest overall).
  • xˉA2=(0.86+0.89)/2=0.875\bar{x}_{A2} = (0.86 + 0.89)/2 = 0.875 (XGBoost overall).
  • xˉB1=(0.83+0.86)/2=0.845\bar{x}_{B1} = (0.83 + 0.86)/2 = 0.845 (Type-1 overall).
  • xˉB2=(0.76+0.89)/2=0.825\bar{x}_{B2} = (0.76 + 0.89)/2 = 0.825 (Type-2 overall).

Grand mean: xˉ=0.8400\bar{x} = 0.8400.

With a=2a = 2, b=2b = 2, n=3n = 3 per cell:

SSA (algorithm):

SSA=nbi(xˉAixˉ)2=32[(0.7950.840)2+(0.8750.840)2]SSA = nb \sum_{i} (\bar{x}_{A_i} - \bar{x})^2 = 3 \cdot 2 \cdot [(0.795 - 0.840)^2 + (0.875 - 0.840)^2] =6[0.002025+0.001225]=60.003250=0.0195= 6 \cdot [0.002025 + 0.001225] = 6 \cdot 0.003250 = 0.0195

SSB (dataset):

SSB=naj(xˉBjxˉ)2=32[(0.8450.840)2+(0.8250.840)2]SSB = na \sum_{j} (\bar{x}_{B_j} - \bar{x})^2 = 3 \cdot 2 \cdot [(0.845 - 0.840)^2 + (0.825 - 0.840)^2] =6[0.000025+0.000225]=60.000250=0.0015= 6 \cdot [0.000025 + 0.000225] = 6 \cdot 0.000250 = 0.0015

SSAB (interaction):

SSAB=nij(xˉijxˉAixˉBj+xˉ)2SSAB = n \sum_{ij} (\bar{x}_{ij} - \bar{x}_{A_i} - \bar{x}_{B_j} + \bar{x})^2

For each cell:

  • (A1, B1): 0.830.7950.845+0.840=0.0300.83 - 0.795 - 0.845 + 0.840 = 0.030, squared: 0.000900.
  • (A1, B2): 0.760.7950.825+0.840=0.0200.76 - 0.795 - 0.825 + 0.840 = -0.020, squared: 0.000400.
  • (A2, B1): 0.860.8750.845+0.840=0.0200.86 - 0.875 - 0.845 + 0.840 = -0.020, squared: 0.000400.
  • (A2, B2): 0.890.8750.825+0.840=0.0300.89 - 0.875 - 0.825 + 0.840 = 0.030, squared: 0.000900.

Sum: 0.002600. Multiply by n=3n = 3: SSAB=30.002600=0.0078SSAB = 3 \cdot 0.002600 = 0.0078.

Wait — let me recompute the contrast cleanly: SSAB indeed = n(xˉijxˉAixˉBj+xˉ)2=30.0026=0.0078n \sum (\bar{x}_{ij} - \bar{x}_{A_i} - \bar{x}_{B_j} + \bar{x})^2 = 3 \cdot 0.0026 = 0.0078.

SSE (error, within cells):

Compute deviations of individual observations from their cell mean.

For (A1, B1), cell mean 0.83: deviations = -0.01, 0.01, 0.00. Squared: 0.0001 + 0.0001 + 0 = 0.0002. For (A1, B2), cell mean 0.76: deviations = -0.01, 0.01, 0.00. Squared: 0.0002. For (A2, B1), cell mean 0.86: deviations = -0.01, 0.01, 0.00. Squared: 0.0002. For (A2, B2), cell mean 0.89: deviations = -0.01, 0.01, 0.00. Squared: 0.0002.

SSE=0.00024=0.0008SSE = 0.0002 \cdot 4 = 0.0008.

ANOVA table:

SourceSSdfMSF
A (algorithm)0.019510.0195195
B (dataset)0.001510.001515
Interaction AB0.007810.007878
Error0.000880.0001
Total0.029611

(MSE = 0.0008/8 = 0.0001.)

Critical value at α=0.05\alpha = 0.05 with df = (1, 8): F0.05,1,85.32F_{0.05, 1, 8} \approx 5.32.

Decisions.

  • FA=1955.32F_A = 195 \gg 5.32. Reject H0AH_0^A. Algorithm has a significant main effect.
  • FB=15>5.32F_B = 15 > 5.32. Reject H0BH_0^B. Dataset has a significant main effect.
  • FAB=78>5.32F_{AB} = 78 > 5.32. Reject H0ABH_0^{AB}. The interaction is significant.

Conclusion. All three effects are significant. XGBoost outperforms Random Forest on average; Type-1 fraud is easier than Type-2 on average. But the significant interaction means these main effects must be interpreted carefully — Random Forest performs much better on Type-1 (0.83) than on Type-2 (0.76), while XGBoost performs slightly better on Type-2 (0.89) than on Type-1 (0.86). The "best algorithm" depends on the dataset type.

The interaction is the substantively most interesting finding: XGBoost is the right choice especially when the dataset is mobile-wallet fraud, where the algorithm's advantage over Random Forest is largest.

Visualising interactions

The standard visualisation is an interaction plot — line plot with the dependent variable on the y-axis, one factor on the x-axis, and one line per level of the other factor. Parallel lines indicate no interaction. Crossing or diverging lines indicate interaction.

For the example above, plotting cell means with algorithm on the x-axis and one line per dataset would show the lines crossing — Random Forest is higher for Type-1, XGBoost higher for Type-2 (much higher).

Beyond two-way ANOVA

Higher-order ANOVAs (three-way, four-way) follow the same principles. Each factor adds main effects, all interactions among the factors, and higher-order interactions. Interpretation becomes harder as the number of factors grows.

Mixed-effects models generalise ANOVA further, allowing some factors to be treated as random rather than fixed — useful when the levels of a factor are a sample from a larger set.

MANOVA (multivariate analysis of variance) tests for differences in means across multiple dependent variables at once.

In modern research, these methods are typically implemented in statistical software (R, SPSS, SAS, Stata, Python with statsmodels) rather than by hand. The hand-computation worked examples in this chapter exist for pedagogical clarity — to show that the F-statistic is not magic, just a ratio of variance components, computable from first principles.

The next chapter turns from analysis to communication — the reporting of results, the publication process, the management of the research project, and the standards of scientific dissemination.

· min read