Confidence Interval for the Difference of Two Means
From MM*Stat International
English |
Português |
Français |
Español |
Italiano |
Nederlands |
There are various ways to construct a confidence interval for the difference of two means depending on the assumptions one makes. Our assumptions are as follows:
- In the two populations the random variables and are normally distributed with parameters , , and , i.e. and .
- From each population a random sample is drawn (with replacement) The sample sizes are denoted by and respectively.
- The random samples are independent of each other.
When constructing confidence intervals for the difference of two means one is often interested in seeing whether the value is covered by the interval. If is not an element of the interval, then the two populations are different at least with respect to their means. Since and are normally distributed, and are also normal (see chapter ). Moreover we have:
. |
In summary Since linear combinations of normals are normal we also have that the difference of the two sample means is normally distributed with expectation and variance The standardized random variable is therefore . We distinguish between two cases:
- the variances of the two populations and are known
- the variances of the two populations and are unknown
1. Case: the variances and of the two populations are known. If both variances and known, we have the confidence interval for the difference at confidence level ; i.e.,
- By construction these confidence intervals assign equal probability mass to the tails:
- The confidence interval is symmetric around the estimated difference ;
- The length of the interval is constant given and , the variances and and the confidence level .
- If we cannot assume the populations to be normally distributed, but the two sample sizes and , the central limit theorem may be used to justify the same confidence interval procedure. In this case, the confidence level is approximately .
2. Case: The variances and of the two populations are unknown. In this case and are estimated using the unbiased and consistent estimators If we can assume variance homogeneity , i.e. , one may produce an estimate for the joint variance . This is the weighted arithmetic mean of the two sample variances: is also called a pooled variance. The estimator for is hence: The standard deviation – the square root of – is used to standardize. The resulting random variable is t-distributed with degrees of freedom. We may now construct a confidence interval for the difference : at a confidence level : If one has variance heterogeneity, i.e., , we use the estimator of given by: If the two sample sizes are sufficiently large ( and ), then we may use for the confidence interval at level , i.e.,
- By construction these confidence intervals assign equal probability mass to the tails.
- The confidence interval is symmetric around the .
- The length of the confidence interval is random since it depends on and .
- The confidence interval also depends on the sample sizes and and on the confidence level .
We are given a population of cars produced by Speed, Eco, Space and Run. We measure the following variables: = gas consumption per 100 km for cars produced by Speed = gas consumption per 100 km for cars produced by Eco = gas consumption per 100 km for cars produced by Space = gas consumption per 100 km for cars produced by RunMeans and variances are unknown. We would like to assess differences in mean gas consumption between pairs of companies. For a given random sample from two companies, find a point and interval estimate for the unknown difference of the means and . Assume that variances are heterogeneous and that the populations are normally distributed. You will be able to examine the effect of the confidence level and sample size on the length of the confidence interval. We recommend that only one of these features be altered at a time. Please select
- the pair of companies to be analyzed
- the sample sizes and
- the confidence level (as decimal number, i.e. 0.95)
Results:This interactive example will provide:
- a confidence interval for the selected confidence level
If you repeatedly draw data on the same pair of companies, but select different confidence levels/sample sizes, previous results will continue to be displayed for comparison purposes.
Company X wants to analyze its share performance on two stock exchanges using the spot price which is observed daily at 12.00 p.m. The company is particularly interested in the difference of mean spot prices. We will construct both a point estimate and a confidence interval at level . The random variables are: - the spot price on the first stock exchange - the spot price on the second stock exchangeThe means , and variances , are unknown. We assume that
- prices are independent at the two stock exchanges
- the variances are equal (variance homogeneity)
We draw a random sample from each population. The sample sizes are and . Since the company X has been traded at the two stock exchanges for a long time, both populations are large. Hence we can assume that we are sampling with replacement. Moreover we assume independence of the two samples. In demonstrating the construction of confidence intervals for the difference , consider the following two cases:
- and are normally distributed
- the distributions of and are unknown
1. Case: We have and . The standardized random variable is t-distributed with degrees of freedom. Under these assumptions is a confidence interval for the difference of the two of spot price means at a confidence level For , we find . For randomly selected days, we record spot prices on each of the two exchanges (column 2 and 3). Columns 4 and 5 below contain squared deviations from the estimated means which are used to calculate the individual variances.
1 | 18.50 | 18.45 | 0.0841 | 0.1296 |
2 | 19.00 | 18.90 | 0.0441 | 0.0081 |
3 | 18.70 | 18.80 | 0.0081 | 0.0001 |
4 | 19.30 | 19.50 | 0.2601 | 0.4761 |
5 | 17.10 | 17.30 | 2.8561 | 2.2801 |
6 | 18.30 | 18.10 | 0.2401 | 0.5041 |
7 | 18.60 | 18.80 | 0.0361 | 00001 |
8 | 19.00 | 18.85 | 0.0441 | 0.0016 |
9 | 19.40 | 19.50 | 0.3721 | 0.4761 |
10 | 20.00 | 19.90 | 1.4641 | 1.1881 |
We obtain: Since we have assumed homogeneity of variances, the point estimate for the joint or pooled variance is given by the weighted arithmetic mean of the sample variances: The variance of the difference of the sample means is and the standard deviation is . A confidence interval for the difference is given by: which is small relative to the levels of the individual spot prices. The confidence interval includes the value 0. Hence there does not appear to be an appreciable difference between the two mean spot prices and . In a later chapter we will see how this implies that there is no statistically significant difference between the two prices. 2. Case: We will now drop the assumption of normality of and . We will require larger sample sizes in order that we may rely upon the central limit theorem as an approximation to the distributions of and (and their difference ). We will draw samples of size , . The standardized random variable is approximately normally distributed. Under the above assumptions is an approximate confidence interval for the difference at confidence level .95: where . Using our two samples of 50 observations each we obtain: Since we assumed homogeneous variances, we estimate using and The standard deviation is . The confidence interval is given by: The interpretation follows as in case 1 above. Comparing the two approaches, one may conclude:
- In case 1 we had more information about the population than in case 2.
- The difference of the two sample means and the joint variances are approximately of the same size in both cases.
- The variance and standard deviation of the difference are much smaller in case 2 due to the larger sample size.
- The length of the confidence interval in case 2 is much smaller than in case 1.
- The confidence interval in case 2 is approximate because of the absence of exact knowledge of the underlying distributions.
An automobile club wants to compare (highway) gas consumption two similar types of cars produced by company A and B. To assist the club, we will construct a confidence interval for the difference of the two means at a confidence level . We make the following assumptions:
It is assumed that the random variables:
= gas consumption per 100 km of A type cars
= gas consumption per 100 km of B type carsare normally distributed with unknown means and and unknown variances and . We do not assume that variances are equal.
We assume and .
The populations are large so we will perform sampling with replacement.
We will assume all observations are independent.
The confidence interval for the difference can be constructed using with approximate confidence level where . The automobile club tests 36 cars of company A and 40 cars of company B. The following quantities are calculated (in liters per 100 kilometer, l/ 100 km) :
= 9.2 l/100 km | = 0.6 l/100 km |
= 8.4 l/100 km | = 0.4 l/100 km |
The confidence interval is: This interval does not cover 0. We will see later that this implies a statistically significant difference in mean gas consumption between the two populations.