# Testing Normal Means

### From MM*Stat International

English |

Português |

Français |

Español |

Italiano |

Nederlands |

In many applications one is interested in the mean of the population distribution of a particular attribute (random variable). Statistical estimation theory ‘tells’ us how to best estimate the expectation for a given distribution shape, yet doesn’t help us in assessing the uncertainty of the estimated average: an average computed from a sample of size will be a single number as will be the one based on a sample size of . Intuition (and the law of large numbers) leads us to believe that the latter estimate is ‘probably’ more representative than the former in that on the average the sample mean (e.g. the arithmetic mean) of large samples is closer to the population than that of small samples. That is, sample means computed from large samples are statistically more reliable. A method of quantifying the average closeness to the population parameter is to compute the standard error of the statistic under consideration (here: the mean), i.e. the square root of the estimated average squared deviation of the estimator from the population parameter. The actual sample mean for a given sample in conjunction with its standard deviation would specify an interval (i.e. the sample mean plus/minus one or more standard errors) in which the sample mean isn’t ‘unlikely’ to fall into, given the theoretical mean equals the one estimated from the observed sample. Now suppose a scientist proposes a value for the theoretical mean derived from some theory or prior data analysis. If the hypothetical value turns out to be close to the sample mean and, in particular, within a certain range around the sample mean like the one specified by the standard error, he is more likely to propose it to be the true population mean then if he had initially proposed a more distant value. But how can the distance of the sample mean from the hypothetical population mean be assessed in probabilistic terms suitable for decision making based on the error concept? In other words: How can we construct a statistical test for the mean of a random variable?
Our goal is to test for a specific value of the expectation of a population distribution. Our data are a randomly drawn sample of size , theoretically represented by the , and we want to base the test decision at a significance level of .

### Hypotheses

We can construct one- and two-sided tests.
1) Two-sided test
2) Right-sided test
3) Left-sided test
In a one-sided statistical hypothesis testing problem the scientific conjecture to be validated is usually stated as rather than the null hypothesis . That is, the researcher tries to statistically verify that the negation of the hypothesis to be tested does *not* hold for a certain significance level . This is due to the ‘nature’ of the significance level we have mentioned earlier: Rejecting the null hypothesis at a given significance level only means that the probability of it not being false is no greater than . Yet this is chosen small (most commonly or ), as one tries to control the error in order to be ‘reasonably certain’ that an ‘unwanted’ proposition is *not* true. This makes sense if one thinks of some critical applications that rely on this approach. In testing a new drug for harmful side effects, for example, one wants to have a rational for rejecting their systematic occurrence. In doing so one accepts the converse claim that side effects are ‘negligible’. Underlying this approach is the (unknown) relationship between and : Whereas we can control the former, the latter is a function of not only the former but also other test conditions such as the underlying distribution.
For these reasons it is common to speak of *not rejecting* a hypothesis instead of *accepting* it. **Test statistic, its distribution and derived decision regions**
We need a quantity to condense the information in the random sample that is required to make a probabilistic statement about the unknown distribution characteristic (in the present case the population mean). For parametric tests, this is an estimator of the parameter. We have already shown that the arithmetic mean is a statistically ‘reasonable’ point estimator of the unknown population mean i.e. the unknown expectation , in particular it’s unbiased and consistent. The variance and standard deviation of computed from a (i.e. independent and identically distributed—i.i.d.) are given by

We will construct our test statistic around the sample mean . In order to derive the (rejection/non-rejection) regions corresponding to a given significance level, we need to make an assumption concerning the distribution of the sample mean. Either

- The random variable under investigation is normally distributed, implying normal distribution of ;
*or* - is sufficiently large to justify the application of the central limit theorem: If the are i.i.d. with finite mean and variance, is approximately normally distributed regardless of the underlying (continuous or discrete, symmetric or skewed) distribution. In this case, our test will in turn be an approximate one, i.e. has additional imprecision.

We thus postulate: is (at least approximately) normally distributed with expectation and variance .
Thus, the distribution of the estimator of the population mean depends on exactly the unknown parameter we are seeking to test . The only way to overcome this circular reference is to assign a numerical value to . The least arbitrary value to take is the boundary value in the null hypothesis, i.e. the value that separates the parameter ranges for H and : . This approach does in fact make sense, if you recall the principle of rejecting the null hypothesis in order to not-reject the : Basing the decision on a postulated distribution of our test statistic with parameter enables us to test this particular , by removing the uncertainty in the . Note that in the two-sided test this makes up the entire parameter space of the null hypothesis. In one-sided tests, it is the boundary value.
Let’s put our assumption into practice and set the expectation of , i.e. , to : Given the null hypothesis is true, respectively equals the boundary value of the null hypothesis for single-sided test, we can write is (at least approximately) normally distributed with expectation and variance , or, using common notation for normal distribution functions:
So far, we have focused on the location parameter . But what about the second central moment that specifies a particular normal distribution, the variance (respectively standard deviation) of the random variable? As you will see, it is critical to the construction of a decision rule to distinguish between situations in which we can regard as known and those where we can’t.
Given a known , the distribution of is completely specified. As we cannot analytically integrate the normal density function to get a closed-form normal distribution function, we rely on tables of numerical solutions for . We thus standardize and take as our test statistic.
Given is true, has (approximately) standard normal distribution:
The critical value corresponding to the relevant significance level can thus be taken from a standard normal distribution table.
We can now write down the decision regions for the three types of test for significance level , given the boundary expectation from , i.e. , is the true population mean.
1) Two-sided test
The probability of falling into the rejection region for H must equal the given significance level :
For we can retrieve the upper critical value from the cumulative standard normal distribution table : . Symmetry of the normal (bell) curve implies .
The *rejection region* for H is thus given by
The *non-rejection region* for H is then
The probability of assuming a value from the non-rejection region for H is
2) Right-sided test
Deviations of the standardized test statistic from to the ‘right side’ (i.e. positive ) tend to falsify . The rejection region will thus be a range of positive test statistic realizations (i.e. a positive critical value). The probability of observing realization of within this region must equal the given significance level :
For we find the in the table for the cumulative standard normal distribution : .
The rejection region for H is given by
and the non-rejection region for H is
The probability of assuming a value within the non-rejection region for H is
3) Left-sided test
Sample means smaller than imply negative realizations of the test statistic , that is, deviations of from to the left side on the real line. In this case, rejection region for H therefore consists of negative outcomes. Consequently, the will be negative.
Once again, we require the probability of observing realization of within the rejection region to equal :
Using the symmetry property of the normal distribution, we can translate into . Thus, the absolute value of the critical value, , is the value of the cumulative normal distribution function for probability , i.e. , and
The rejection region for H is given by
and the non-rejection region for H is
The probability of taking on a value within the non-rejection region for H is
If we don’t have any a priori knowledge about the standard deviation of the random variable under investigation, we need to plug an estimator of it into the test statistic
An unbiased estimator of the population variance is
Replacing by the square root of yields our new test statistic:
If the null hypothesis is true, has (at least approximately) a distribution with degrees of freedom.
For a given significance level and degrees of freedom, the critical values can be read from the distribution table.
If we denote the cumulative distribution with degrees of freedom for probability by , and assume is the true population mean, we have the following decision regions for the test situations under consideration.
1) Two-sided test
rejection region for H: where is a realization of the random variable computed from an observed sample of size .
Non-rejection region for H:
2) Right-sided test
rejection region for H:
Non-rejection region for H:
3) Left-sided test
rejection region for H:
Non-rejection region for H:
Note: If the sample size is sufficiently large (), the distribution can be adequately approximated by the standard normal distribution. That is, is approximately distributed. Critical values can then be read from the normal table, and the decision regions equal those derived for known population standard deviation . Hence, for large we can estimate by and abstract from the estimation error (that will occur with probability one, even if the estimator hits the correct parameter on average, i.e. is unbiased).

### Calculating the test statistic from an observed sample

When we have obtained a random sample , we can compute the empirical counterparts of the theoretical test statistics we have based our test procedures on. On the theoretical level, we have expressed them in terms of (theoretical) , i.e. , that is, have denoted them by capital letters: , and . Actual values calculated from a sample of size , , are denoted by , and and differ from their theoretical counterparts only in that now the variables stand for real numbers rather than a range of theoretically permissible values. Hence, the respective empirical formulae for sample mean and sample standard deviation are and Accordingly, the two realized test statistics for testing normal means for known and unknown variance respectively are and You may have recognized that we have already applied this notation when specifying the decision regions.

### Test decision and interpretation

If the test statistic falls into the rejection region, the null hypothesis is rejected on the basis of a of size and a given a significance level : . Statistically, we have concluded that the true expectation does not equal the hypothetical .
If the *true* parameter *does* belong to the range postulated in the null hypothesis (), we have made a type I error: . In fact, in choosing a particular significance level, we are really deciding about the probability of making exactly this error, since the decision regions are constructed such that the probability of making a type I error equals the significance level: .
If, on the other hand, falls into the non-rejection region, the particular sample leads to a non-rejection of the null hypothesis for the given significance level: . Thus, we are not able to show statistically, that the true parameter deviates from the hypothetical one (). Chances are, though, non trivial that we are making a type II error, i.e. the correctly describes reality: . As already pointed out, the probability of making a error is, in general, unknown and has to be computed for individual alternative parameter values .

### Power

How can we assess the ‘goodness’ of a test? We have seen that in setting up a test procedure we are controlling the probability of making an error (by assigning a value to the significance level ). The probability of making a error is then determined by the true (and unknown) parameter. The smaller is for a given true parameter , the more reliable the test is in that it more frequently rejects the null hypothesis when the alternative hypothesis is really true. Hence, given a specific significance level, we want to be as small as possible for true parameter ranges outside that specified in the null hypothesis, or, equivalently, we want to maximize the probability of making the correct decision , that is maximize the quantity for any given true outside the null hypothesis region, i.e. inside that of the alternative hypothesis.
This notion of ‘goodness’ of a test is conceptualized with the so-called *power*, a function assigning probabilities of rejecting to true parameter values within the parameter region for given and hypothetical parameter . These probabilities represent the theoretical averages of making a right decision in rejecting over all possible samples (given and ). They can thus be computed without utilizing actual samples; in fact, the power is computed because we can obtain only a limited sample and aim to quantify the expected ‘accuracy’ of the individual test procedure.
Technically, the power yields the probability of rejecting given hypothetical parameters :
1) Two-sided test
In a two-sided test, the null hypothesis is true if and only if . Rejecting given that it is true means we have made a type I error:
For all other possible parameter values, rejecting is a right decision:
We thus have
Using our normality assumption about the underlying probability distribution, we can analytically calculate the power for the case of a two-sided test:
The probability of a type II error can be calculated from the power:
Properties of the power for a two-sided test:

- For , the power assumes its minimum, .
- The power is symmetrical around the hypothetical parameter value
- The power increases with growing distance of the true parameter from the hypothetical and converges to one as the distance increases to or respectively.

The above characteristics are illustrated in the following power curve diagram.

In the above diagram, two alternative true parameter values and are depicted. If is the true parameter, the distance is comparatively high. Consequently, the probability of making a right decision in not-rejecting the alternative hypothesis (conversely, correctly rejecting the null) is relatively high and the probability of making a type II error, , small. The distance of the ‘hypothetically true’ parameter value from the hypothetical parameter value , , is relatively small. Hence, the probability of making a right decision in rejecting the null hypothesis, , is smaller than in the first example, and the probability of making a type II error, , greater. This is intuitively plausible, i.e. that relatively small deviations are less easily discovered by the test. 2) Right-sided test In a right-sided test, the null hypothesis is true if the true parameter is less than or equal to the hypothetical boundary value , i.e. if . If this is the case, the maximum probability of rejecting the null hypothesis and hence making a type I error, equals the significance level : If the alternative hypothesis, i.e. , is true, rejecting the null hypothesis and hence making a right decision occurs with probability: Combining these formulae for the two disjoint subsets of the parameter space gives the power: We can explicitly calculate the power for our right-sided test problem for all possible true parameter values : The following diagram displays the typical shape of the power for a right-sided test problem.

For all values within the parameter set of the alternative hypothesis, the power increases monotonically to one. The greater the distance , the higher the probability of making a right decision in not-rejecting the alternative hypothesis, and hence the smaller the probability of making a type II error. At the point the power is , the given significance level. For all other values associated with the , i.e. , the power is less than . That’s what we assumed when we constructed the test: We want to be the *maximum* probability of rejecting the null hypothesis for a true null hypothesis. As you can see from the graph, this probability decreases with rising absolute distance .
3) Left-sided test
In a left-sided test, the null hypothesis is true if the true parameter is greater than or equal to the hypothetical boundary value, that is, if . In this case, rejecting the null hypothesis and hence making a type I error, will happen with probability of no more than :
If the alternative hypothesis is true, i.e. , the researcher makes a right decision in rejecting the null hypothesis, the odds being:
For the entire parameter space we thus have:
For our normally distributed population we can calculate the probability of rejecting as a function of the true parameter value (the power) explicitly:
A typical graph of a power for a left-sided test is depicted in the following diagram

The graph is interpreted similar to the right-sided test case. Suppose we consider the following right-sided test: The standard deviation in the population is known to be . In this interactive example you can study the impact of the significance level and the sample size on the size of the type II error. You can specify

- the sample size ,
- the significance level ,
- and a true that will give rise to a type II error (that is, a greater than zero).

After you have made your choices you will be presented a display containing

- the distribution of the sample mean under (red bell curve),
- the distribution of the sample mean under using your chosen (blue bell curve),
- the critical value for rejecting the null hypothesis (black vertical line),
- the probability of making a type I error (red area under the red bell curve),
- and the probability of making a type II error (blue area under the blue bell curve).

By varying , and , you can explore the impact of these test parameters on the type II error probability. To isolate the impacts we recommend change the value of only one parameter in successive trials. To facilitate easy comparison you will be shown a display for the current (lower display) and the previous run (upper display). Assume that the random variable in a population of overdraft facilities has normal distribution with unknown expectation and known standard deviation . Based on a simple random sample, the hypothesis that equals the hypothetical value has to be tested at a significance level of : You can carry out this test as often as you like—for every new run a new sample is drawn from the population. It is up to you to control the significance level and the sample size . You can vary them as you like and isolate their effects by holding either of these test parameters constant. In particular, you can

- Hold both the significance level
*and*sample size constant to observe different test decisions based on different samples; - Vary the significance level for a fixed sample size ;
- Change the sample size and leave the fixed to your chosen level; or
- Vary both the significance level and the sample size .

We will now illustrate how information about the population can influence the choice of the test statistic, the decision regions and—depending on the sample at hand—the test decision. A car tire producer alters the mix of raw material entering the production process in an attempt to increase the average life of the output. After the first new tires have been sold, competitors criticize that the average life of the new tires doesn’t exceed that of the old ones, which is known to be . The random variable under investigation is the actual life of the population of new tires, measured in km, denoted by , and the producers’ claim is that its expectation is higher than the historical one of the old types, . Management wishes to scientifically test this claim and commissions a statistical investigation hoping to verify that the average life has in fact increased, i.e. that . But they also want to minimize the risk of making a wrong decision so as not to be exposed to competitors’ (justified) counter arguments.

## Hypothesis

Since deviations in one direction are the subject of the dispute, a one-sided test will be conducted. We state the producers’ claim as the with the hope that the sample rejects it, yielding a right-sided test: where . Does this operationalisation support the producers’ intention? We can answer this question by analyzing the possible errors. Rejecting gives rise to the possibility of a type I error. Not rejecting the null hypothesis exposes the decision-maker to a type II error. The producers’ emphasis is on keeping the type I error small, as its implications are more severe than those of the type II error: With the production process going ahead and thus the available sample of tires gradually increasing, an actual average life below the acclaimed one would sooner or later be revealed. The maximum probability of the type I error, is given by the significance level , a parameter the producer can control. Thus, the test is in line with the producers’ requirements. The probability of making a type II error, , is unknown, as the true average life of the new processes’ output is unknown. The probability of not verifying an increase in the average life of the tires that has actually taken place, can be substantial. That’s the price the producer has to pay for choosing the conservative approach of stating the claim as alternative hypothesis and actively controlling the significance level and thus keeping the crucial type I error small. This trade-off makes sense, as the perceived long term reliability of the producer is more important than short term sales gains.

## 1st alternative

### Significance level and sample size

The test will be conducted at a significance level. A sample of size is taken from the output. As the population is reasonably large (a couple of thousand tires have already been produced), the sample can be regarded as a simple random sample.

### Test statistic and its distribution; decision regions

Sample-based investigations into the tires’ properties carried out prior to the implementation of changes in the production process, indicate, that the fluctuations in the life of the tires can be described ‘reasonably’ well by a normal distribution with standard deviation . Assuming, this variability is still valid in the new production regime, we have for the distribution of the sample mean under the null hypothesis: Under , the test statistic follows the standard normal distribution: The critical value that satisfies can be found from the cumulative standard normal distribution table as the 95 % quantile: . The resulting decision regions are Non-rejection region for H: . Rejection region for H: .

### Sampling and computing the test statistic

Suppose the average life of randomly selected tires is . Then the realized test statistic value is

### Test decision and interpretation

As is element of the rejection region for H, the null hypothesis is rejected. Based on a sample of size and a significance level of , we have shown statistically, that the new tires can be used significantly longer than the old ones, that is, that the true expectation of the tires’ life is greater than the hypothetical value . The test has resulted in a non-rejection of the alternative hypothesis . The producer makes a type I error () if the null hypothesis correctly describes reality (). But the probability of an occurrence of this error has intentionally been kept small with the significance level . If the alternative hypothesis is true, a right decision has been made: . The probability of this situation can only be computed for specific true population parameters. Assuming this value is , the power is The greater the increase in average life, the higher the power of the test i.e. the probability . E.g., if an increase to had been achieved, the power would be : .

## 2nd alternative

The significance level and sample size remain constant, and we continue to assume a normal distribution of the new tires’ lives. But we drop the restrictive assumption of a constant standard deviation. We now allow for it to have changed with the introduction of the new production process.

### Test statistic and its distribution; decision regions

Since we now have to estimate the unknown standard deviation with its empirical counterpart, the square root of the sample variance, , we must employ the -statistic which, under , has a -distribution with degrees of freedom. We can look up the critical value satisfying as the upper quantile of the -distribution with degrees of freedom in a -distribution table and find it to be . Thus, our decision regions are: Non-rejection region for H: . Rejection region for H: . You will notice that the size of the non-rejection region has increased. This is due to the added uncertainty about the unknown dispersion parameter . Consequently, there must be a larger allowance for variability in the test statistic for the same and sample size than in the normal test for known standard deviation.

### Sampling and computing the test statistic

Along with the sample mean the sample standard deviation has to be computed. Suppose their realized values are and . Thus, the realized value is

### Test decision and interpretation

As falls into the rejection region, the is rejected. Based on a sample of size and a significance level of , we were again able to statistically show that the true (and unknown) expectation of the new tires’ lives has increased from its former (i.e. hypothetical) level of .
Of course, we still don’t know the true parameter , and if it happens to be less than (or equal to) , we have made a type I error, for we have rejected a *true* null hypothesis: . In choosing a of per cent we have restricted the probability of this error to a maximum of per cent (the actual value depending on the true parameter ).
If the true parameter *does* lie within the region specified by the alternative hypothesis, we have made a right decision in rejecting the null hypothesis: . The probability of this event, , can be (approximately) computed for alternative true population means if we assume the sample standard deviation to be the true one in the population, i.e. .

## 3rd alternative

Suppose we now drop the assumption of normality, which is a situation more relevant to practical applications. In order to conduct an approximate test about , we require the sample size to be greater than . If the sample size is smaller than , we cannot justify the application of the , as the approximation wouldn’t be good enough. The managers decide to pick a sample of tires, incurring further sampling costs as the price to employ a more suitable and therefore reliable statistical procedure. Further, suppose that the significance level is chosen to be .

### Test statistic and its distribution; decision regions

As in the 2nd alternative, the -statistic has to be used. Having chosen independent observations, we can justify to employ the central limit theorem and approximate the distribution of this standardized statistic by a standard normal distribution: In above statement, ‘as’ stands for ‘asymptotically’: is asymptotically standard normal, that is, the standard normal distribution is the limit it converges to as tends to infinity. For finite samples, the standard normal distribution serves as an approximation. The satisfying is then (approximately) the upper per cent quantile of the standard normal distribution, , and we have the following decision regions: Non-rejection region for H: . Rejection region for H: .

### Sampling and computing the test statistic

As in alternative 2, we have to compute both the sample mean and the sample standard deviation as estimators for their population counterparts and . Suppose, their values are and for our new sample of size . Then the realized test statistic value is:

### Test decision and interpretation

As lies within the rejection region, the is rejected. On the basis of a particular sample of size and a significance level of we were able to statistically verify that the true population mean of the new tires’ lives is greater than the tires’ expected life before the implementation of the new process, .
If the null hypothesis is in fact true, we have made a type I error. Fortunately, the probability of this happening (given we *have* rejected as is the case here) has been chosen not to exceed for any true population mean within the parameter space specified in .
Given the small (maximum) type I error probability of , it is much more likely that we are right in rejecting the null hypothesis: . But the associated probability, , can only be computed for specific true parameter values. As in alternative 2, we have to assume a known in order to calculate this quantity by setting .
A company is packing wheat flour. The machine has been set up to fill gramms (g) into each bag. Of course, the probability of any bag containing exactly 1 kg, is zero (as weight is a continuous variable), and even if we take into account the limited precision of measurement, we will still expect some fluctuation around the desired (theoretical) content of 1 kg in actual output. But without prior knowledge we can’t even be sure, if the *average* weight of output is actually 1 kg. Fortunately, we have means of testing this statistically. Denote by the actual net weight per bag. We are interested in the expectation of this random variable, i.e. the average net bag weight, . Is it sufficiently close to , the ideal quantity we want the machine to fill into each bag? As the machine has to be readjusted from time to time to produce output statistically close enough to the required weight, the producer regularly takes samples to assess the then current precision of the packing process. If the mean of any of these samples statistically differs significantly from the hypothetical value , the machine has to be readjusted.

## Hypothesis

Management is interested in deviations of the actual from the desired weight of in both directions. Filling in too much isn’t cost-effective and putting in too little may trigger investigations from consumer organizations, with all the negative publicity that comes with it. Thus, a two-sided test is indicated: where .

## Sample size and significance level

The statistician decides to test at a level and asks a technician to extract a sample of bags. As the population, that is, the overall production, is large compared to the sample size, the statistician can regard the sample as a simple random sample.

## Test statistic and its distribution; decision regions

The estimator of the unknown population mean is the sample mean . Experience has shown that the actual weight can be approximated sufficiently closely by a normal distributions with standard deviation . The estimator is then normally distributed with standard deviation . Under , i.e. given, the true population parameter equals the hypothetical (desired) one, , is thus normally distributed with parameters and : The test statistic is the standardization of the sample mean, and follows the standard normal distribution: We can look up the upper critical value in the cumulative standard normal distribution table as to satisfy . Using symmetry of the normal curve, . We thus have: the non-rejection region for H: and the rejection region for H: .

Rejection region non-rejection region rejection region

## Drawing the sample and calculating the test statistic

bags are selected randomly and their net content is weighed. The arithmetic mean of these measurements is . The realized test statistic value is thus

## Test decision and interpretation

As lies within the non-rejection region for H, the hypothesis is not-rejected. Based on a sample of size , the hypothetical mean value couldn’t be shown to differ statistically significantly from the true parameter value , i.e. we couldn’t verify that the packing process is not precise.

## Power

Not having rejected the null hypothesis, we are inevitably taking the risk of making a type II error:, i.e. the alternative hypothesis is true and we have rejected it. We should therefore assess the reliability of our decision in terms of type II error probabilities for parameter values different from that stated in the null hypothesis, i,e. . They are given by . Suppose, is the true average weight and the alternative hypothesis therefore a true statement. As the power assigns probabilities for right decisions to alternative true parameter values, is the probability of making a right decision (correctly rejecting the null hypothesis): Plugging , , and into the formula for the power gives The probability of making a type II error if the true population mean is , is therefore There, if the true average weight is , 83 % of all samples of size would not convert that fact into a correct test decision (rejection of the null) for the given significance level of . Since is only a relatively small difference, in statistical terms, the probability of a type II error is large. If, on the other hand, gramms is the true average weight, returns the probability of making a right decision in rejecting the null hypothesis: , and we can calculate In this case, only 0.02 % of all samples will result in a non-rejection of the null hypothesis and hence a wrong decision. The probability of a type II error is small, because the difference is large in statistical terms.. The following table lists values of and for selected true population averages , given the above , and .

True hypothesis | |||
---|---|---|---|

The following diagram shows the graph of the power curve.

We can alter the shape of the power curve for a (given) fixed significance level in our favour by increasing the sample size . We will illustrate the effect of a change in the sample size for the two ‘hypothetically’ true parameter values and . The other test parameters remain constant: , and .

The next diagram displays the power of the two-sided test for these alternative sample sizes.

When there is reason to believe that the machine produces output with small deviations from the desired weight, an increase of the significance level is advisable to statistically ‘discover’ these deviations reliably and minimize the type II error risk—given the incurred extra sampling costs are outweighed by the information gain.

## Formulating the Hypotheses

Let us illustrate the problem of choosing an appropriate null (and hence alternative) hypothesis with a real-world example.
Consider a company manufacturing car tires. Alterations in the production process are undertaken in order to increase the tires’ lives. Yet competitors will not hesitate to claim that the average life of the tires hasn’t increased from the initial, pre-restructuring value of kilometers (km). The producers’ management wants to justify the investment into the new production process and subsequent advertising campaign (i.e. save their necks) and commissions a scientific, i.e. statistical, investigation.
That’s our part. The variable of interest is the life of an individual tire measured in km, denoted by, say, . It is a random variable, because its fluctuations in magnitude depend on many unknown and known factors, that cannot practically be taken into account (such as speed, weight of the individual car, driving patterns, weather conditions, and even slight variations in the production process etc.). Before the ‘improvements’ in the production process, the average life of the particular type of car tire was km; in theoretical terms, the expectation was . The mean value under the new production process is unknown and, in fact, the quantity we want to compare in statistical terms with : The producer pays the statistician(s) to objectively show, if . Note that we denote the true expectation under the new regime by , as this is the parameter we are interested in and thus want to test. The ‘old’ mean ‘merely’ serves as benchmark, and the actual output it represents (the old tires) doesn’t receive further attention (and in particular neither does its fluctuations around the mean).
The statement that management hopes that the statistician will ‘prove’ scientifically, , looks very much like a readily testable . But as we have emphasized earlier, there is a crucial difference between formalized statements of scientific interest and the means of testing it by stating a null hypothesis suitable to make a reliable decision, that is, a decision that is backed by acceptable type I and II errors.
So which hypothesis shall we test? It should be clear, that the problem at hand is a single-sided one; only deviations of the new expected life from the historical expected life in one direction are of interest. In deciding whether to test the hypothesis as it is already formalized using a left-sided test procedure or testing the negation, , on a right-sided basis, we have to focus on the actual aim of the investigation: The tire producer intends to verify the claim of being greater than , whilst at the same time controlling the risk of making a wrong decision (type I error) to a level that allows him to regard the (hopefully positive, i.e. a rejection of the null) test decision as statistically proven. This would be the case if the reverse claim of the new tires being less durable can be rejected with an acceptable (i.e. small) significance level, for this would imply that there is only a small probability that the null hypothesis, , is true and hence the alternative hypothesis, , not true. But that’s exactly the result the managers want to see. Let’s therefore state the negation of the statement to be tested as null hypothesis (and hope it will be rejected on the given significance level): with .
If the sample of new tires’ usable life leads to a rejection of the null hypothesis (), a type I error will be made if the null hypothesis is true. If the null hypothesis is *not-* rejected on the basis of a particular sample of size , the conjecture stated in the alternative hypothesis may still be true, in which case, the researcher has (unknowingly) made a type II error.
Comparing the implications of type I and type II error for this example shows that the former’s impact on the manufacturers fortune is the crucial one, for

- the competitors can carry out (more or less) similar investigations using a left-sided test, leading to the PR nightmare associated with a possible contradiction of the producers’ test result,
- future investigation into tires subsequently produced would reveal the actual properties of the tires as the sample size inevitably increases with the amount sold, triggering even more embarrassing questions concerning the integrity and reliability of the manufacturer.

For these reasons, the tire manufacturer is best advised to keep the probability of a type I error, , small, by controlling the significance level, e.g. setting it to .

## Decision regions

When testing with either single- or two-sided tests the size of the non-rejection and rejection regions on the or (standardized test statistic) axis depends only on:

- the given (chosen) level of significance : ceteris paribus, increasing will increase the size of the rejection region for H, and will reduce the size of the non-rejection region (and vice versa).

Alternatively, when testing with either single- or two-sided tests the size of the non-rejection and rejection regions on the (our original random variable) axis depends on:

- the given (chosen) level of significance : ceteris paribus, increasing will increase the size of the rejection region for H, and will reduce the size of the non-rejection region (and vice versa);
- the sample size : ceteris paribus, the larger the sample size, the greater the size of the rejection region for H, and the smaller the size of the non-rejection region (and vice versa); and
- the dispersion of the variable in the population and therefore in the sample: ceteris paribus, an increased variability or leads to a decrease in the size of the rejection region for H , and increases the size of the non-rejection region (and vice versa).

That is, the critical values on the standardized test statistic axis are independent of the size of or (alternatively, ). The same can not be said for the ”equivalent” critical values for the original axis where sample size and dispersion affect the magnitude of ”acceptable” expected deviations from the null.
If the population variance is known, the critical values and therefore the non-rejection/rejection regions for H can easily be calculated for the sample mean . We will do this for a two-sided test.
We have derived the test statistic as a standardization of the estimator : and, in terms of realizations s of sample variables s:
In a two-sided test the non-rejection region for H consists of all realization of greater than or equal to *and* less than or equal to :
Thus, the critical values and are possible realization of the test statistic . They are subject to the same standardization carried out to convert into to express it in units comparable with standard normal quantiles:
As is the lower critical value with respect to , we similarly have denoted the lower for by (the same applies to the upper bound of the non-rejection region, denoted by the subindex ).
We can isolate the upper and lower bound of the for H in terms of the units of the sample mean:
The resulting non-rejection region for H in terms of is: and the associated rejection region is given by the complement
Similar transformations can be imposed on the estimators for one-sided tests.

## Power curve

We will derive the power curve for a two-sided population mean test. The power is calculated as Assuming to be the true population mean, we have Adding to the numerator of the middle term yields The power for the one-sided tests can be derived in a similar fashion. From a decision-theoretical point of view it is desirable, that the probability of correctly rejecting the null hypothesis increases quickly with a growing distance between the true parameter and the hypothetical value , that is, we want the graph of the power curve to be as steep as possible in that range of the true parameter value. For a given estimator and test statistic, there are two possible ways of improving the ‘shape’ of the power curve. 1) Increasing the sample size The above formula for the power of a two-sided test for the mean is clearly positively related to the size of the sample . In general, ceteris paribus, the graph of the power curve becomes steeper with growing : For any true parameter value within the region (i.e. for the two-sided, for the right-sided and for the left-sided test), the probability of rejecting the null hypothesis, and hence making a right decision, increases with growing . That’s mirrored by a decreasing probability of making a type II error. Thus, the probability of correctly discriminating between the true and the hypothetical parameter value grows with increasing sample size. Given a fixed significance level , the probability of a type II error can be improved (reduced) by ‘simply’ enlarging the sample. The following diagram displays the graphs of power curves based on four distinct sample sizes, with .

2) Varying the significance level
Ceteris paribus, allowing for a higher probability of making a type I error, i.e. increasing the significance level , will shift the graph of the power curve upwards. This means, that a higher leads to an increase in the probability of rejecting the null hypothesis for *all* possible true parameter values . If the true parameter value within the H region ( for the two-sided, for the right-sided and for the left-sided test), rejecting the null is a right decision—the probability of correctly rejecting the null hypothesis has increased, the probability of making a type II error has decreased. **But** the probability of rejecting the null hypothesis has also increased for true parameter values within the region, increasing the probability of making a type I error. Hence, we encounter a trade-off between the probabilities of making a type I and type II error, a problem that cannot be overcome mechanically, but has to be tackled within some sort of preference-based decision-theoretical approach.
In the diagram below the power curve of a two-sided test with fixed sample size for two alternative significance levels is depicted. The red graph represents for , the blue one for .