Sampling Distribution of the Mean
From MM*Stat International
English |
Português |
Français |
Español |
Italiano |
Nederlands |
The distribution of a statistic (which is itself a function of the sample) is called a sampling distribution. Statistics are used for estimating unknown population characteristics or parameters and for testing hypotheses. These tasks involve probability statements which can only be made if the sampling distributions of the statistics are known (or can be approximated). For the most important statistics, we now present in each case the sampling distribution its expected value and variance.
Distribution of the sample mean
Consider sampling from a population with distribution function , expected value and variance One of the most important statistics is the sample mean. The sample mean (or sample average) is given by:
Expected value, variance and standard deviation of the sample mean
Expected value, variance and standard deviation of the sample mean are given by:
- for a random sample without replacement The factor is called the finite sample correction.
If the population variance is unknown it has to be estimated by the statistic In the above formulas is replaced by which leads to an estimator of the variance of the sample mean given by:
- for a simple random sample:
- for a random sample without replacement
These results for the expectation and variance of the sample mean hold regardless of the specific form of its sampling distribution.
Distribution of the sample mean
The sampling distribution of the sample mean is determined by the distribution of the variable in the population. In each case below we assume a random sample with replacement.
It is assumed that is normally distributed with expected value and variance , that is, :
The population variance is known; in this case has the following normal distribution: and the standardized random variable has the standard normal distribution .
The population variance is unknown In this case, it may be estimated by The transformed random variable:
has a tabulated distribution with parameter the ’degrees of freedom’ which equals . This distribution is called the and it is usually denoted by .
As increases, the t-distribution converges to a standard normal. Indeed the latter provides a good approximation when
This is the most relevant case for applications in business and economics since the distribution of many interesting variables may not be well approximated by the normal or its specific form is simply unknown.
Consider i.i.d. random variables with unknown distribution. The random variables have expectation and variance According to the the following propositions hold:
If is known, then the random variable is approximately standard normal for sufficiently large .
If is unknown, then the random variable is also approximately standard normal for sufficiently large .
As rule of thumb, the normal distribution can be used for . If is normally distributed with known and so that also follows the normal distribution then the calculation of probabilities may be done as in Chapter VI. Calculations hold approximately if is arbitrarily distributed and is sufficiently large. More generally, if the distribution of is not normal, but is known, then it is in principle possible to calculate the sampling distribution of and the probabilities that falls in a given interval (though the results may be quite complicated).
Weak law of large numbers
Suppose are independent and identically distributed random variables with expectation and variance . Then, for each it holds that: This can be shown as follows:According to it holds that After inserting If approaches infinity the second term on the right hand side goes to zero. Implication of this law:With increasing , the probability that the sample mean will deviate from its expectation by less than converges to one. If the sample size is large enough the sample mean will take on values within a pre-specified interval with high probability, regardless of the distribution of .
Enhanced example for sampling distributions
This example is devoted to formally explaining the sampling distribution of the sample mean, its expectation and variance. To this end, certain assumptions must be made about the population. In particular, it is assumed that the mean hourly gross earnings of all 5000 workers of a company equals $27.30 with a standard deviation of $5.90 and variance of $34.81.
Problem 1:
Suppose that the variable = “Gross hourly earnings of a (randomly selected) worker in this company” is normally distributed. That is, . From the population of all workers of this company, a random sample (with replacement) of workers is selected. The sample mean gives the average gross hourly earnings of the workers in the sample. Calculate the expected value, variance, standard deviation and find the specific form of the distribution of for the following sample sizes:
- Regardless of , the expected value of is
- The variance of the sample mean is equal to
Thus, ^{2}(\bar{x})=5.9^{2}/10=34.81/10=3.481</math>. ^{2}(\bar{x})=5.9^{2}/50=34.81/50=0.6962</math> {x})=\$0.4172</math>. Obviously, the standard deviation of is smaller than the standard deviation of in the population. Moreover, the standard deviation of decreases from 1.8657 to 0.8344 and to 0.4172, as the sample size is increased from 10 to 50 and eventually to 200. Increasing the sample size by a factor of five cuts the standard deviation roughly by half. Increasing the sample size twentyfold reduces the standard deviation by more than 3/4.
- Since is assumed to be normally distributed it follows that the sample mean is also normally distributed under random sampling with replacement, regardless of the sample size.
Thus:
for random samples of size The red curve in the graph corresponds to the distribution of in the population while the blue curve depicts the distribution of the sample mean .
for random samples of size
for random samples of size
Problem 2:
Suppose that the variable = “gross hourly earnings of a (randomly selected) worker of this company” is normally distributed. Hence, . A sample of size is randomly drawn without replacement. The sample mean gives the gross hourly earnings of the workers in the sample. Calculate the expected value, variance, and standard deviation of for the following sample sizes:
- All random samples without replacement, regardless of , have the same expected value as in the first problem:
- In the case of sampling without replacement, the variance of the sample mean is reduced by a ’finite sample correction factor’.Specifically, the variance of the sample mean is given by However, the finite sample correction can be neglected if is sufficiently small relative to for example if .
Thus, {x})=5.9^{2}/10=34.81/10=3.481</math>.In comparison, the finite sample correction yields and , which demonstrates the negligibility of the correction. {x})=\sigma^{2}(\bar{x})=\sigma^{2}/n</math>. This leads to the same result as in problem 1: which is very similar to the finite sample corrected result Var(\bar{x})=\sigma^{2}(\bar{x}) & =\frac{\sigma^{2}}{n}\cdot\frac{N-n}{N-1}\\ & =\frac{5.9^{2}}{1000}\cdot\frac{5000-1000}{5000-1}=0.0279\ \\ \sigma(\bar{x}) & =\$0.1669.\end{align}</math>
Problem 3:
Suppose that, more realistically, that the distribution of = “gross hourly earnings of a (randomly selected) worker from this company” is unknown. Hence, all that is known is and . A sample of size is randomly drawn. The sample mean gives the gross hourly earnings of the workers in the sample. Calculate the expected value, variance, standard deviation and find the specific form of the distribution of for the following sample sizes:
- How the expected value is calculated does not depend on the distribution of in the population. Hence, there are no new aspects in the present situation and the results are identical to the previous two problems:
How the variance of is calculated does not depend on the distribution of in the population but it does depend on the type and size of the random sample. In the statement of problem 3 the sampling scheme has not been specified. However, for all three sample sizes and, hence, if the sample is drawn without replacement the formula can be used as an approximation.
for | ||
for | ||
for |
- Since the distribution of in the population is unknown no exact statement can be made about the distribution of
However, the central limit theorem implies that the standardized random variable is approximately standard normal if the sample size and –in random sampling without replacement– the size of the population is sufficiently large. This is satisfied for the cases b) and c) .
Example of sampling distribution
students take part in an exam for a graduate course and obtain the following scores: Table 1:
Student | A | B | C | D | E | F | G |
---|---|---|---|---|---|---|---|
Score | 10 | 11 | 11 | 12 | 12 | 12 | 16 |
The variable = “score of an exam” has the following population frequency distribution: Table 2:
10 | 1 | 1/7 | 1/7 |
11 | 2 | 2/7 | 3/7 |
12 | 3 | 3/7 | 6/7 |
16 | 1 | 1/7 | 7/7 |
with population parameters and .
Random sampling with replacement
exams are sampled with replacement from the population. Table 3 contains all possible samples of size with replacement and paying attention to the order of the draws: Table 3:
1. exam | |||||||
10 | 11 | 11 | 12 | 12 | 12 | 16 | |
10 | 10;10 | 10;11 | 10;11 | 10;12 | 10;12 | 10;12 | 10;16 |
11 | 11;10 | 11;11 | 11;11 | 11;12 | 11;12 | 11;12 | 11;16 |
11 | 11;10 | 11;11 | 11;11 | 11;12 | 11;12 | 11;12 | 11;16 |
12 | 12;10 | 12;11 | 12;11 | 12;12 | 12;12 | 12;12 | 12;16 |
12 | 12;10 | 12;11 | 12;11 | 12;12 | 12;12 | 12;12 | 12;16 |
12 | 12;10 | 12;11 | 12;11 | 12;12 | 12;12 | 12;12 | 12;16 |
16 | 16;10 | 16;11 | 16;11 | 16;12 | 16;12 | 16;12 | 16;16 |
For each possible sample, the sample mean can be calculated and is recorded in Table 4. Table 4:
1. exam | |||||||
10 | 11 | 11 | 12 | 12 | 12 | 16 | |
10 | 10 | 10.5 | 10.5 | 11 | 11 | 11 | 13 |
11 | 10.5 | 11 | 11 | 11.5 | 11.5 | 11.5 | 13.5 |
11 | 10.5 | 11 | 11 | 11.5 | 11.5 | 11.5 | 13.5 |
12 | 11 | 11.5 | 11.5 | 12 | 12 | 12 | 14 |
12 | 11 | 11.5 | 11.5 | 12 | 12 | 12 | 14 |
12 | 11 | 11.5 | 11.5 | 12 | 12 | 12 | 14 |
16 | 13 | 13.5 | 13.5 | 14 | 14 | 14 | 16 |
therefore can take on various values with certain probabilities. From Table 4 the distribution of can be determined as given in the first two columns of Table 5. Table 5:
10 | 1 / 49 | - 2 | 4 | 4 / 49 |
10.5 | 4 / 49 | - 1.5 | 2.25 | 9 / 49 |
11 | 10 / 49 | - 1 | 1 | 10 / 49 |
11.5 | 12 / 49 | - 0.5 | 0.25 | 3 / 49 |
12 | 9 / 49 | 0 | 0 | 0 |
13 | 2 / 49 | 1 | 1 | 2 / 49 |
13.5 | 4 / 49 | 1.5 | 2.25 | 9 / 49 |
14 | 6 / 49 | 2 | 4 | 24 / 49 |
16 | 1 / 49 | 4 | 16 | 16 / 49 |
The mean of this distribution, i.e. the expected value of , is given by which is equal to the expected value of the variable in the population: . Using the intermediate results in columns three to five of Table 5 allows one to calculate the variance of : This result is in agreement with the formula for given above: It is easy to see that the variance of is indeed smaller that the variance of in the population.
Random sampling without replacement
From the population, exams are randomly drawn without replacement. Table 6 displays all possible samples of size from sampling without replacement, paying attention to the order of the draws. Table 6:
1. exam | |||||||
10 | 11 | 11 | 12 | 12 | 12 | 16 | |
10 | 10;11 | 10;11 | 10;12 | 10;12 | 10;12 | 10;16 | |
11 | 11;10 | 11;11 | 11;12 | 11;12 | 11;12 | 11;16 | |
11 | 11;10 | 11;11 | 11;12 | 11;12 | 11;12 | 11;16 | |
12 | 12;10 | 12;11 | 12;11 | 12;12 | 12;12 | 12;16 | |
12 | 12;10 | 12;11 | 12;11 | 12;12 | 12;12 | 12;16 | |
12 | 12;10 | 12;11 | 12;11 | 12;12 | 12;12 | 12;16 | |
16 | 16;10 | 16;11 | 16;11 | 16;12 | 16;12 | 16;12 |
For each possible sample, the sample mean is calculated and reported in Table 7: Table 7:
1. exam | |||||||
10 | 11 | 11 | 12 | 12 | 12 | 16 | |
10 | 10.5 | 10.5 | 11 | 11 | 11 | 13 | |
11 | 10.5 | 11 | 11.5 | 11.5 | 11.5 | 13.5 | |
11 | 10.5 | 11 | 11.5 | 11.5 | 11.5 | 13.5 | |
12 | 11 | 11.5 | 11.5 | 12 | 12 | 14 | |
12 | 11 | 11.5 | 11.5 | 12 | 12 | 14 | |
12 | 11 | 11.5 | 11.5 | 12 | 12 | 14 | |
16 | 13 | 13.5 | 13.5 | 14 | 14 | 14 |
The first two columns of Table 8 contain the probability distribution of the sample mean Table 8:
10.5 | 4 / 42 | - 1.5 | 2.25 | 9 / 42 |
11 | 8 / 42 | - 1 | 1 | 8 / 42 |
11.5 | 12 / 42 | - 0.5 | 0.25 | 3 / 42 |
12 | 6 / 42 | 0 | 0 | 0 |
13 | 2 / 42 | 1 | 1 | 2 / 42 |
13.5 | 4 / 42 | 1.5 | 2.25 | 9 / 42 |
14 | 6 / 42 | 2 | 4 | 24 / 42 |
The expected value is and is equal to the expected value of in the population. The variance is equal to which is in agreement with the formula for calculating given earlier:
More Information
Consider a population with distribution function , expected value and variance . The random variables all have the same distribution function , expectation and variance .
Expectation of the sample mean
Using the rules for the expectation of a linear combination of random variables it is easy to calculate that with . This result holds under random sampling with or without replacement and is valid for any positive sample size
Variance of the sample mean
(1)
For each , Furthermore, under random sampling with replacement the random variables are independent and therefore have . The variance of the sample mean thus simplifies to Note that the variance of is equal to the variance of the population variable divided by This implies that is smaller than and that is decreasing with increasing In other words, for large the distribution of is tightly concentrated around its expected value .
(2) The derivation of in the case of random sampling without replacement is similar but more complicated because of the dependency of the random variables. Regarding the finite sample correction, for large populations the following approximation is quite accurate and the approximate correction can be used. In sampling without replacement cannot exceed . For fixed , the finite sample correction approaches 1 with increasing : In applications, the correction can be ignored if is small relative to .Rule of thumb: However, this will only give an approximation to . On the distribution of Suppose that follows a normal distribution in the population with expectation and variance : . In this case, the random variables are all normally distributed: for each . The sum of independent and identically normally distributed random variables also follows a normal distribution: The statistic differs from this sum only by the constant factor and, hence, is also normally distributed: . Since only the standard normal distribution is tabulated the following standardized version of is considered: which follows the standard normal distribution: . Evidently, using the standardized variable hinges on knowing the population variance If the population variance is unknown:The unknown variance is estimated by Dividing both sides by gives To simplify, set .In random sampling with replacement, the are independent and is therefore the sum of squared independent standard normal random variables. It follows that is chi-square distributed with degrees of freedom . Using the standardized random variable to construct the ratio gives rise to the random variable which follows the t-distribution with degrees of freedom (Recall from Chapter 6 that a random variable is the ratio of a standard normal to the square root of an independent chi-square divided by its degrees of freedom.) Inserting the expressions for ,and and rearranging terms yields:
Probability statements about :
If the sampling distribution of including all its parameters are known, then probability statements about can be made in the usual way. Suppose one wants to find a symmetric interval around the true mean which will contain with probability That is, we need to find such that .. It will be convenient to use the standardized random variable the distribution of which we will assume to be symmetric. Thus, the deviation from is a multiple of . Inserting leads to the interval with probability If is normally distributed then the central interval of variation with pre-specified probability is determined by reading from the standard normal table. The probability is approximately valid if has an arbitrary distribution and the sample size is sufficiently large.