# Testing the Difference of Two Population Means

### From MM*Stat International

English |

Português |

Français |

Español |

Italiano |

Nederlands |

The unknown parameter to be tested now is the difference of two expectations in two distinguishable populations, . Our parameter tests will be based on individual samples arising from these two populations; we will thus be dealing with two-sample tests.
There are many different ways of constructing tests for the difference in two population expectations. Our tests will be suited to the following assumptions:

- There are two populations. The random variable observed in the first, has expectation and variance ; the parameters of the random variable observed in the second population, , are and . We test for the difference in their expected values, because we have to regard and as unknown.
- The sizes of the two populations, and , are sufficiently large to base the test procedures on simple random samples drawn without replacement. The sample sizes are denoted by and , respectively.
- The two samples are independent. This means they are drawn independently of each other so as to not convey any cross-sample information.
- Either the random variates and are normally distributed ( and ), or their distributions can be approximated sufficiently accurately by a normal distribution via the central limit theorem. For this to be feasible, the sample sizes and have to be sufficiently large.

There is a hypothesis about the difference, expressed in terms of . A special case of particular practical interest is that of hypothetical equality of the two population means, i.e. . The test will be conducted at a of .

### Hypotheses

Depending on the application at hand, a two- or one-sided test will be carried out: 1) Two-sided test 2) Right-sided test 3) Left-sided test The choice of the appropriate test should be guided by the considerations laid out in the section on one-sample tests of .

### Test statistic and its distribution; decision regions

We have already shown, that the estimator of the difference of two expectations, where and are the sample means, that is, has normal distribution with expectation . Independence of the sample variables implies the variance of the sample mean differential is the difference of the variances of the sample means:
Assume that is the true distance between the population expectations: . Then follows a normal distribution with expectation and variance .
In constructing an appropriate test statistic, we have to make the same distinction concerning our knowledge about the standard deviations and as in the one-sample case. Let’s start with the simplifying (and unrealistic) assumption that, for some miraculous reason, we know the standard deviations in both populations, and .
If we know and , the distribution of is fully specified as above, and we can standardize to ensure the applicability of numerical tables for the standard normal distribution:
Under , has (at least approximately) *standard normal distribution*, and the table of numerical values of the cumulative standard normal distribution can be used to determine *critical values*. These normal quantiles translate into the following decision regions for tests at a significance level :

Test | Rejection region for H | Non-rejection region for H |
---|---|---|

Two-sided | ||

Right-sided | ||

Left-sided |

We have to estimate the unknown quantities and using their sample counterparts:
Assuming *homogeneity in variances*, i.e. the random variable under consideration has the same dispersion in both populations, , the estimation function of the joint variance is a weighted arithmetic average of the two variance estimators and :
Thus, we can write the estimator of as
The test statistic is then calculated as and has distribution with degrees of freedom.
Under the assumption of *heterogenous variances*, , the estimator can only be approximated as
Welsh has suggested to base the test statistic on this approximation and use as test statistic.
Under the null hypothesis, can be approximated by a distribution with degrees of freedom calculated as follows:
In both cases (homogenous and heterogeneous variances) critical values can be taken from the distribution table. The following table shows the derived decision regions for the three test situations (for significance level ).

Test | Rejection region for H | Non-rejection region for H |
---|---|---|

Two-sided | ||

Right-sided | ||

Left-sided |

Note that the distribution quantiles in above table can be approximated by standard normal quantiles, if both sample sizes and are big enough to justify the application of the ( *and* ). The resulting decision regions are then similar to those in the case of known variances.

### Sampling and computing the test statistic

On the basis of an observed sample, the two sample means and and, if needed, the empirical standard deviations and can be computed. Plugging these values into the test statistic formula gives the realized test statistic value .

### Test decision and interpretation

Test decision and interpretation are carried out analogously to the one-sample mean test.
Consider a population of 3,100 supermarket branches with both cheese and meat counters. Define ‘Queuing duration in minutes at cheese counter’ and ‘Queuing duration in minutes at meat counter’.
Assume that and have normal distribution with unknown expectations and and unknown, but equal variances (*variance homogeneity*).
We want to conduct a test at a significance level of on the basis of two simple random samples of sizes and , whether the average time customers have to queue up on either counter before they are being served is equal, i.e. whether the differential in the true parameters equals :
In this interactive example you can conduct this test as often as you like. Each repetition is based on freshly simulated random samples of and and carried out using your specified test parameters. You can:

- repeatedly observe test decisions on the basis of unchanged significance level and sample sizes and ;
- alter for constant and ;
- vary the sample sizes and , holding the significance level constant; or
- vary , and simultaneously.

Mr. Schmidt and Mr. Maier, two senior bank managers, enjoy lunch hours that are long enough to start arguing about the average age of their colleagues.
**1st dispute**Mr. Schmidt claims that the average age of female employees differs from that of the male employees—an opinion Mr. Maier cannot and, more importantly, doesn’t want to share.
**2nd dispute**Mr. Schmidt even believes to know the direction of the deviation: Female workers are older on average, it appears to him. Being opposed to Schmidts first claim, Maier cannot but disagree with his second.
**3rd dispute**The above is not enough confrontation to override the boredom that has spread after numerous discussions about the fair value of the Euro and the best national football team coach. Mr. Schmidt cannot help himself and switches to attack: ‘On average, the women in our bank are years older than the men!’ Mr. Maier is more than happy to disagree, even though he suddenly concedes that the average male colleague might be younger than the average female. But he cannot rule out the possibility that these subjective impressions could be subject to a focus bias arising from a more critical examination of their female colleagues (Maier and Schmidt are both married).
To settle their disputes and hence make space for other future discussions, Maier and Schmidt decide to carry out a statistical investigation. They are both surprised that they can agree on the following settings:
The statistical test will be based on the difference of two population means ; significance level is .
Random variable captures the age of a female banker, the age of a male banker. Expectations , and variances , are unknown. *Homogeneity of variances* cannot be assumed, Maier and Schmidt agree. Furthermore, there is no prior knowledge about the shape of the distribution of and . Consequently, sample sizes and will have to be sufficiently large to justify the application of the . Maier and Schmidt know that there are approximately as many female as male workers in the bank, and they thus choose equal sample sizes: . They ask human resources for support in there ground-breaking investigations. Of course, personnel could simply provide them with the exact data, but they agree to draw two samples of size at random, without replacing the sampled entity after each draw. They assure that the two samples from the male and female population can be regarded as independent. Sample averages and variances are computed for both samples.

### Test statistic and its distribution; decision regions

As and are unknown and Maier&Schmidt have to assume *heterogeneity of variances*, they employ the where are the sample means and are estimators of the population variances and .
As the sample sizes satisfy respectively , the central limit theorem can be applied, and the distribution of can, under , be approximated by the standard normal distribution (bell curve). Maier&Schmidt will thus apply an asymptotic or approximate test for .

## 1st dispute

### Hypothesis

Mr. Schmidts first claim is general in that he doesn’t specify direction or size of the proposed average age differential. Thus, a two-sided test with has to be specified: or, equivalently,

### Determining the decision regions for H

The upper critical value satisfying can be looked up in the normal distribution table as the per cent quantile: . From the symmetry of the normal distribution around zero follows for the lower critical value , such that . We thus have the following decision regions: Approximate non-rejection region for H:. Approximate rejection region for H:.

### Sampling and computing the test statistic

Personnel submits the following data computed from the two samples: Female bank clerks: Male bank clerks: Using , Maier&Schmidt derive a test statistic value of .

### Test decision and interpretation

The test statistic value of falls into the non-rejection region for H, and consequently the null hypothesis is not rejected. Based on two independent random samples of sizes , Maier&Schmidt couldn’t prove statistically the existence of a significant difference in the population averages of female and male bank clerks’ ages, and .
Having not-rejected the null hypothesis, Maier&Schmidt may have made a wrong decision. This is the case, if in reality the two population means *do* differ. The probability of the occurrence of a type II error () can only be computed for ’hypothetical’ true parameter values, i.e. the parameter region of the is narrowed to a single parameter point.

## 2nd dispute

### Hypothesis

Mr. Schmidt believes that subsequently he has come up with some substantial new arguments in favour of his proposition and insists in putting it as the alternative hypothesis in a further test to be conducted. If the null hypothesis is rejected and thus his hypothesis verified, he can quantify the maximum type I error probability to be and has thus scientific backing for maintaining his position. The resulting test is a right-sided one, still without quantification of the suggested positive difference: : or, equivalently,

### Determining the decision regions for H

The critical value satisfying can be found in the normal distribution table to be . The decision regions are then: Approximative non-rejection region for H:. Approximative rejection region for H:.

### Sampling and computing the test statistic

Human resources supplies Mr. Maier and Mr. Schmidt with the following sample characteristics: Female bank clerks: Male bank clerks: Using , Maier&Schmidt compute the test statistic value as .

### Test decision and interpretation

As the test statistic value of falls into the rejection region for H , the null hypothesis is rejected. Maier&Schmidt could show on the basis of two independent random samples of sizes , that the difference is significant at the level. Thus, Schmidt has reason to maintain his claim, that the average female bank clerk is older than the average male. The probability of having made a wrong conclusion in a repeated test context, i.e. the type I error probability , is constrained by the significance level . Compared to the two-sided test, the rejection region for H doesn’t consist of two segments, but is located on the right hand side of . As the area under the normal curve corresponding to this region has to equal the ‘entire’ quantity , the critical value is smaller than that for the two-sided version. For this reason the null hypothesis is more likely to be rejected for the same significance level and sample sizes and in the one-sided test than in the two-sided test for equal deviations of the test statistic from the hypothetical boundary parameter value in the same direction.

## 3rd dispute

### Hypothesis

In his third claim, Mr. Schmidt has gone one step further in that he has quantified the average age of his female colleagues to be at least years higher than the average age of his male coworkers. Translated into our test formalization, the hypothetical difference is . Maier agrees to adopt the same test structure as in the second dispute, leaving Schmidts claim as alternative hypothesis. The resulting right-sided test is:

### Determining the decision regions for H

The critical value for is looked up in the normal distribution table: . The resulting approximate decision regions are the same as in the second dispute: Approximative non-rejection region for H:. Approximative rejection region for H:.

### Sampling and computing the test statistic

Human resources submit the following statistics: Female bank clerks: Male bank clerks: This time Maier&Schmidt compute the test statistic value using , yielding .

### Test decision and interpretation

The test statistic value belongs to the non-rejection region for H, and the null hypothesis is thus not rejected. On the basis of two independent random samples of sizes , Maier&Schmidt couldn’t verify statistically, that the difference is significantly greater than . Schmidt hence couldn’t prove statistically at a significance level of , that the average female bank clerk is years older than the average male bank worker. The test delivers an objective decision basis for a proposed difference of exactly —nothing can be said about any other positive difference smaller than (neither for true differences greater than , owing to the possibility of the type II error). Thus, if the average female banker is older than the average male banker in the population, Mr. Schmidt has either overstated the difference or is a victim of the type II error, , the probability of which can only be computed for specific values of the true population parameter differential. Student Sabine visits two farms to buy fresh eggs. The farms are populated by two different breeds of hens—one on each. Sabine randomly picks eggs from the first and eggs from the second farm. Back home, she has the impression that the eggs produced by the hens on the first farm are heavier than those from the second. To verify this suspicion, she conducts a statistical test at a significance level of . Sabine compares two (weight) averages by testing for the difference of two means.

### Hypothesis

As Sabine has reason to believe that the average weight of one egg variety is greater than that of the other, a single-sided test is indicated. She wants to prove statistically, that the first farm produces heavier eggs and consequently puts her conjecture as alternative hypothesis , hoping that her sample will reject the null hypothesis which states the negation of the statement she wants to verify positively. But Sabine has no idea as to how great the average weight difference could be and thus sets the hypothetical difference that has to be exceeded to prove her right to zero: . She can formalize her test as or, equivalently,

### Test statistic and its distribution; decision regions

Sabine has picked the eggs at random—in particular, she hasn’t tried to get hold of the biggest ones on either farm. Naturally, she sampled without replacement, but we must also assume that the population of daily produced eggs on both farms is sufficiently large to justify the assumption of a simple random sample. Clearly, Sabine has drawn the samples independently, for she sampled on two unrelated farms. Sabine assumes that the random variables ‘egg weight of first breed’ and ‘egg weight of second breed’ are : and . Expectations and and variances and are unknown. To simplify matters, Sabine assumes that the population variances are homogenous: . This assumption implies that a differential in the expectation doesn’t induce a differential in the variances—a rather adventurous assumption. Nevertheless, acknowledging the above assumptions (and the possibility of their violation), Sabine can base her test on the test statistic Here, and are the sample sizes, and are the sample means and and are the estimators of and . Under , has distribution with degrees of freedom. In the corresponding t-table we find the quantile to be the critical value satisfying and hence have the following decision regions: Non-rejection region for H: . Rejection region for H: .

### Sampling and computing the test statistic

Sabine weighs the eggs and computes the sample-specific arithmetic averages and variances: 1st breed: 2nd breed: Using she calculates a test statistic value of .

### Test decision and interpretation

The test statistic realization falls into the rejection region for H . Thus, Sabine couldn’t prove statistically on the basis of two independent random samples of sizes and and a significance level of , that the difference of the population averages of the eggs’ weights is significantly negative. As the type I error probability can not exceed , Sabine has scientific backing for her claim that the eggs from breed 1 hens are heavier than those from the second farm—on average!