Chapter 5 — Testing of Hypotheses
Most quantitative research ends with a question of the form "is this difference real, or could it have arisen by chance?" Hypothesis testing is the formal procedure that answers such questions. The framework was developed in the early twentieth century by R.A. Fisher, Jerzy Neyman, and Egon Pearson; refined in countless statistics textbooks since; and remains the standard of practice in engineering, biomedical, social science, and policy research. This chapter covers the framework, the standard tests, and the typical traps. The chapter is heavily numerical, with worked examples that demonstrate the standard tests on data shaped like real research problems in Nepal.
5.1 Definition of hypothesis
Hypothesis
A hypothesis is a tentative, testable, and falsifiable statement that proposes a relationship between variables or a value for a population parameter, formulated before data collection so that the analysis can either support or contradict it.
A hypothesis is a claim that data can refute. "More awareness training reduces phishing-click rates among Nepali bank employees" is a hypothesis. "Cybersecurity is important" is not — it is a value statement, not a refutable claim about the world.
The Chapter 1 definition is the substantive one. The statistical apparatus in this chapter operationalises it.
Null and alternative hypotheses
Hypothesis testing pits two competing statements against each other.
The null hypothesis is the statement that there is no effect, no difference, or no relationship in the population, the default position the researcher must produce evidence against.
The alternative hypothesis (or ) is the statement that there is an effect, a difference, or a relationship, the position the researcher must accumulate evidence for.
The framework is asymmetric. The null hypothesis is "innocent until proven guilty" — the default that holds unless the data shows otherwise. The alternative hypothesis is "what we are trying to demonstrate." If the data does not provide enough evidence against , the researcher "fails to reject" — not "accepts as true."
Example. A researcher tests whether a new fraud-detection model has different F1-score than the existing model.
- : F1-score of the new model = F1-score of the existing model.
- : F1-score of the new model ≠ F1-score of the existing model.
The data either rejects (the new model is significantly different) or fails to reject it (no evidence of difference).
Simple and composite hypotheses
Simple hypothesis. Specifies a single value for the parameter. ": mean response time = 250 ms."
Composite hypothesis. Specifies a set of values. ": mean response time > 250 ms" — every value above 250 satisfies the hypothesis.
One-tailed vs two-tailed hypotheses
Two-tailed test. is non-directional. ": mean ≠ 250 ms." Rejection region is in both tails of the distribution.
One-tailed test. is directional. ": mean > 250 ms" (right-tailed) or ": mean < 250 ms" (left-tailed). Rejection region is in one tail only.
One-tailed tests have more power to detect an effect in the specified direction but cannot detect an effect in the opposite direction. The choice between one-tailed and two-tailed must be made before looking at the data, on the basis of the research question.
For most research, two-tailed tests are the default. One-tailed tests are used when there is a strong prior reason to expect the effect in a specific direction and the opposite direction would be of no interest.
5.2 Basic concepts in hypothesis testing
Several concepts run through every test.
Type I and Type II errors
The test has four possible outcomes, depending on the true state and the test's decision:
| is true | is false | |
|---|---|---|
| Reject | Type I error (false positive) | Correct decision |
| Fail to reject | Correct decision | Type II error (false negative) |
A Type I error is rejecting the null hypothesis when it is actually true — concluding there is an effect when there is none, also called a false positive.
A Type II error is failing to reject the null hypothesis when it is actually false — failing to detect a real effect, also called a false negative.
The probability of a Type I error is denoted (alpha). The probability of a Type II error is denoted (beta).
Significance level
The significance level is the maximum acceptable probability of making a Type I error, chosen by the researcher before the test, with as the standard convention.
A significance level of 0.05 means: "I am willing to falsely declare an effect (Type I error) at most 5% of the time when in fact there is no effect."
Common choices:
- — the standard in most fields.
- — used when false positives are particularly costly.
- — used in exploratory analysis or with small samples.
The choice of is a value judgement about how much false-positive risk is tolerable. It is not derivable from the data.
Power and Type II error rate
The power of a test is the probability of correctly rejecting the null hypothesis when it is false, equal to , conventionally set at or above 0.80 in research planning.
Power depends on:
- The true effect size — larger effects are easier to detect.
- The sample size — larger samples have more power.
- The significance level — a stricter (smaller value) reduces power.
- The variance of the data — less variance gives more power.
A power analysis at the design stage (Chapter 2) ensures that a planned study has enough power to detect the effect of interest. A study with low power can produce a non-significant result for either of two very different reasons — the effect is genuinely absent, or the effect is real but the study was too small to detect it.
Test statistic
A test statistic is a numerical summary computed from the sample data that has a known distribution under the null hypothesis, used to decide whether the data are consistent with or constitute evidence against it.
Each test has its own test statistic — for the -test, for the -test, for the chi-square test, for the -test. The test statistic measures, in standardised units, how far the observed data is from what would predict.
A test statistic far from zero is evidence against . A test statistic close to zero is consistent with .
p-value
The p-value is the probability, under the assumption that the null hypothesis is true, of observing a test statistic at least as extreme as the one actually observed.
The -value translates the test statistic into a probability scale that is directly comparable to .
Decision rule.
- If , reject . The observed data is unlikely under .
- If , fail to reject . The data is consistent with .
A -value of 0.03 means: "If were true, we would see a test statistic this extreme only 3% of the time." If our chosen is 0.05, this is unusual enough to reject .
Common misinterpretations.
The -value is not the probability that is true. That would be a Bayesian posterior, which requires a prior probability for . The -value is a frequentist quantity: probability of data given , not probability of given data.
The -value is not the probability of making an error. The error probabilities are and .
A small -value does not mean the effect is large or important. It only means the effect is unlikely to be zero. The effect could be tiny but statistically significant if the sample is huge.
Critical value and rejection region
The critical value is the threshold beyond which the test statistic is considered too extreme to be consistent with . The rejection region is the set of test-statistic values that lead to rejection of .
For a two-tailed test with using the standard normal distribution: critical values are . Reject if . The rejection region is .
For a one-tailed test with (right-tailed): critical value is 1.645. Reject if .
The -value and critical-value approaches give the same decision. They are alternative views of the same procedure.
Sampling distribution
A sampling distribution is the probability distribution of a statistic (such as the sample mean) computed from all possible samples of a given size drawn from the population, the foundation on which hypothesis tests rest.
For a sample mean drawn from a population with mean and standard deviation , the sampling distribution of has mean and standard deviation — the standard error.
By the Central Limit Theorem, the sampling distribution of approaches a normal distribution as grows, regardless of the population's distribution (provided the population has finite variance). This is why the -test and -test work for a wide range of populations.
5.3 Procedure for hypothesis testing
The general procedure has six steps. Every parametric test follows the same shape.
- State the hypotheses. Specify and . Decide one-tailed or two-tailed.
- Choose the significance level. is the default.
- Identify the appropriate test. Based on the type of data, the parameter being tested, the sample size, and the assumptions that can be reasonably made.
- Compute the test statistic. From the sample data, using the test's formula.
- Compare to critical value or compute the -value. Use the known distribution of the test statistic under .
- Make a decision and interpret. Reject or fail to reject . State the substantive conclusion in plain language.
The steps are mechanical once the test is chosen. The judgement is in steps 1-3. The interpretation is in step 6.
5.4 Important parametric tests
The four standard parametric tests cover most situations.
z-test
The z-test is a parametric test that uses the standard normal distribution as the sampling distribution of the test statistic, appropriate when the population standard deviation is known or when the sample size is large enough (typically ) that the sample standard deviation can be used in its place.
For testing whether a sample mean differs from a hypothesised population mean :
When is unknown and is large, replace with the sample standard deviation :
When to use:
- Sample size is large ().
- Or the population is known to be normal and is known.
- Testing means or proportions.
Critical values for common :
| Two-tailed | One-tailed | |
|---|---|---|
| 0.10 | ||
| 0.05 | ||
| 0.01 |
t-test
The t-test is a parametric test that uses Student's t distribution as the sampling distribution of the test statistic, appropriate when the population standard deviation is unknown and must be estimated from the sample, particularly suited to small samples () from approximately normal populations.
The -statistic looks similar to the -statistic:
The difference is the reference distribution. The -distribution has heavier tails than the normal — accounting for the extra uncertainty when is estimated from a small sample.
The -distribution depends on its degrees of freedom (df). For a one-sample -test, df = . As df grows, the -distribution approaches the normal distribution. For df > 30, the two are very close — which is why the -test is acceptable for large samples even when is unknown.
Critical values from the -table depend on both and df. A -table or statistical software lookup is needed.
For df = 9 (a sample of 10):
- Two-tailed : .
- Two-tailed : .
For df = 29 (a sample of 30):
- Two-tailed : .
Chi-square test for variance
The test has several uses. As a test for variance:
Used to test whether a sample variance differs from a hypothesised population variance . The test statistic follows a chi-square distribution with degrees of freedom.
The chi-square distribution is right-skewed and depends on df. The critical values come from a chi-square table.
The chi-square test as a non-parametric test (for goodness-of-fit, independence, and homogeneity) is covered in Chapter 6.
F-test
The F-test is a parametric test that uses the F distribution as the sampling distribution of the test statistic, used to compare two variances and as the foundational test in analysis of variance (ANOVA).
For testing whether two population variances are equal, the test statistic is the ratio of sample variances: