# Key Concepts

 English Português Français ‎Español Italiano Nederlands

Statistical tests are tools for the analysis of hypotheses about the characteristics of unknown probability distributions or relationships between random variables. If the probability distribution is specified up to a finite set of parameters, testing for the fully specified probability density amounts to testing whether the parameters take on specific values. As the mathematical specification of a class of probability distributions involves writing down a function that contains parameters whose values aren’t known a priori, tests based on postulated parameters that determine the characteristics of a probability distribution are dubbed ‘parametric’ tests. Statistical estimation procedures can be used to obtain estimates of the specific parameter(s) of interest. Statistical test theory provides a means of quantifying the significance of such estimates. Closely related to the choice of the parameter value(s) is the choice of the class of probability distributions. Such a fully specified distribution has to describe reality as accurately and reliably as possible. In practice, the choice of a functional class such as the Normal (or Gaussian) distribution and estimating and testing parameters is an iterative process. Empirical researchers will have to consider various models (alternative distributions) at the explorative stage of the investigation into the nature of the phenomena of interest. However, very often certain probability models are chosen a priori for their tractability rather than on theoretical grounds. When the postulated class of distribution functions is theory-driven in that it is the result of logical deduction from accepted premises, testing for the significance of parameters forms an important part of the verification of scientific theory. Much of empirical research is, however, data-driven in that there is no a priori distribution function. The objective of a parametric statistical hypothesis test procedure can be summarized as follows. Given a certain population with parametric distribution function ${\displaystyle F\left(x\right)}$ (with parameters such as expected value ${\displaystyle \mu }$ and variance ${\displaystyle \sigma _{2}}$ in the class of normal distributions or the proportion ${\displaystyle \pi }$ in a repeated Bernoulli experiment), some ‘guess’ (hypothesis) about the true parameter value(s) has to be tested on the basis of an observed sample of finite size. Clearly, this would not be necessary if one could observe the random variable under consideration for all members of the population of interest (which of course is not theoretically possible for continuous random variables, as a continuous distribution is comprised of infinitely many possible outcomes). In general, (sub-)samples cannot convey all the information necessary to precisely describe the underlying distribution (even if they are representative in terms of some suitable concept) and are the result of a (random) sampling process, therefore their implied parameter values (as determined by statistical estimation procedures conducted with the sample data) are themselves random variables. Often these estimates will only equal the correct population parameter value on average. Fortunately, statistical tests provide an appropriate yardstick that allows us to quantify and assess whether the difference between the sample-specific (i.e. statistically estimated) and hypothesized parameter values is statistically significant. In short, we evaluate whether our hypothesized parameter value is close enough to the estimated parameter value for the sampling process to have caused the difference or whether the two numbers (respectively vectors) cannot be reconciled even after having allowed for sampling noise (i.e. whether the noise created by observing on only a finite sample of elements can account for the difference). In order to put above verification problem on an objective decision-theoretical basis, statistical tests have been devised to tackle the problems that may otherwise lead us to rely on subjective assessments. Questions that we will need to address include:

• What is the ’correct’ formulation of the actual hypothesis in mathematical terms?
• How is the data to be condensed? (i.e. which statistics or estimators are to be used)
• How is the difference of the condensed collected data from the structure implied by the hypothesis to be quantified? (i.e. what expression will we use for our test statistic)
• How is the quantified difference to be evaluated in decision-theoretical terms? When is the difference statistically significant? (i.e. what is the distribution of our test statistic and what is acceptable sampling noise)

To provide an objective rational for verification of hypotheses (given certain assumptions about the functional class of the distribution etc.), statistical tests must statisfactorily address all of the above issues. We can get a grasp of the key concepts and terms of statistical tests by considering an example of a parametric test. Let ${\displaystyle \theta }$ be a parameter of the distribution function of random variable ${\displaystyle X}$ . Its true value is unknown, but we can specify the parameter space, which is the set of possible values it can assume.

## Formulating the Hypothesis

The hypothesis states a relation between the true parameter ${\displaystyle \theta }$ and the hypothetical value ${\displaystyle \theta _{0}}$. Usually a pair of connected hypotheses is formulated, the null hypothesis ${\displaystyle {\text{H}}_{0}}$ and the alternative hypothesis ${\displaystyle {\text{H}}_{1}}$. The null hypothesis is the statistical statement to be tested; thus it has to be formulated in such a way that statistical tests can be performed upon it. Sometimes the underlying scientific hypothesis can be tested directly, but in general the scientific statement has to be translated into a statistically tractable null hypothesis. In many cases the null will be the converse of the conjecture to be tested. This is due to certain properties of parametric statistical tests, which we will be dealing with later on. The asserted relation between the true parameter ${\displaystyle \theta }$ and the hypothetical value ${\displaystyle \theta _{0}}$ is stated so that the combined feasible parameter values of both null and alternative hypothesis capture the entire parameter space. Clearly the alternative hypothesis can be thought of as a converse of the null hypothesis. Here are the possible variants:

Null Hypothesis Alternative Hypothesis
a) Two-sided test ${\displaystyle {\text{H}}_{0}:\theta =\theta _{0}}$ ${\displaystyle {\text{H}}_{1}:\theta \neq \theta _{0}}$
b) One-sided tests
Right-sided test ${\displaystyle {\text{H}}_{0}:\theta \leq \theta _{0}}$ ${\displaystyle {\text{H}}_{1}:\theta >\theta _{0}}$
Left-sided test ${\displaystyle {\text{H}}_{0}:\theta \geq \theta _{0}}$ ${\displaystyle {\text{H}}_{1}:\theta <\theta _{0}}$

The two-sided hypothesis in a) is a so-called simple hypothesis, because the parameter set of the null hypothesis contains exactly one value. As the alternative hypothesis highlights, deviations from the hypothetical value ${\displaystyle \theta _{0}}$ in both directions are relevant to the validity of the hypothesis. That’s why it is referred to as two-sided. The hypotheses of the one-sided tests under b) belong to the class of composite hypotheses. ‘Composite’ refers to the parameter set of the null hypothesis being composed of more than one value. Consequently, not rejecting the null hypothesis wouldn’t completely specify the distribution function, as there is a set of (in above cases infinite many) parameter values that have not been rejected. The hypotheses are one-sided, because deviation from the hypothetical parameter value in only one direction can negate the null hypothesis—depending on that direction these tests are further characterized as left- or right-sided. Clearly, the scientific problem to be formulated in statistical terms determines which test will be of interest (applied). Note some important principles of hypothesis formulation:

• Statistical test procedures ‘test’ (i.e. reject or do not reject) the null hypothesis.
• Null and alternative hypothesis are disjoint, that is, their respective parameter spaces don’t contain the same value.
• Parameter sets encompassing exactly one value will always belong to the null hypothesis.

## Test Statistic

In order to follow the above procedure, we need a quantity to base our decision rule on. We need a suitable estimator in order to extract the information required to properly compare the hypothetical with the sample-specific parameter value(s). If an estimator is used as a verification quantity within a statistical test procedure, we call it a test statistic, or simply a statistic. We will denote the statistic by ${\displaystyle V=V\left(X_{1},\ldots ,X_{n}\right)}$. The statistic ${\displaystyle V}$ is a function of the sample variables ${\displaystyle X_{1},\ldots ,X_{n}}$ and hence itself a random variable with some distribution ${\displaystyle F_{V}\left(v\right)}$. In order to conduct a statistical test, the distribution of ${\displaystyle V}$ for a valid null hypothesis has to be known (at least approximately). Thus, we consider ${\displaystyle F_{V}}$ conditional on (given) the null hypothesis: ${\displaystyle F_{V}=F_{V}\left(v|{\text{H}}_{0}\right)}$. So in the case of a parametric test this means that the distribution of the test statistic depends on the (unknown) parameter ${\displaystyle \theta }$: ${\displaystyle F\left(v\,|\,\theta \right)}$. In order to determine this distribution, the parameter ${\displaystyle \theta }$ has to be specified numerically. But the only a priori information about ${\displaystyle \theta }$ at hand is the hypothetical boundary value ${\displaystyle \theta _{0}}$. Thus we will now (at least for the time being) assume that ${\displaystyle \theta _{0}}$ is the true parameter value prevailing in the population, i.e. ${\displaystyle \theta =\theta _{0}}$. In a two-sided test, this assumption accurately reflects the null hypothesis. In a one-sided test, the boundary value ${\displaystyle \theta _{0}}$ must belong to the null hypothesis—one reason, why ‘equality’, i.e. ${\displaystyle \theta =\theta _{0}}$ always belongs to the parameter space of the . For all three possible test scenarios we are thus assuming that the test statistic ${\displaystyle V}$ has a distribution with parameter ${\displaystyle \theta _{0}}$ under the null hypothesis. Observing the random variable under consideration on ${\displaystyle n}$ statistical observations yields a sample ${\displaystyle x_{1},\ldots ,x_{n}}$. Plugging these realizations into the test statistic gives a realization of the test statistic: ${\displaystyle v=v\left(x_{1},\ldots ,x_{n}\right)}$.

## Decision regions and significance level

Being a random variable, the test statistic can take on one of several possible values. If the test statistic for a given sample is sufficiently close to the hypothetical parameter value, the difference may be considered ‘random’. In this case the null hypothesis won’t be rejected. Yet this doesn’t mean that the null hypothesis is correct (or has been ’accepted’) and hence that ${\displaystyle \theta _{0}}$ is the true parameter value. The only permissible statement is that, given the particular sample, it cannot be ruled out for a certain degree of confidence, that the underlying population follows a distribution specified by the parameter value ${\displaystyle \theta _{0}}$. Large deviations of the test statistic from the hypothetical parameter value make the null hypothesis appear implausible. In this situation the sample may ‘as well’ have been generated by a population distributed according to parameter values suggested in the alternative hypothesis. We can then assume that a parameter value other than ${\displaystyle \theta _{0}}$ specifies the true population distribution. Still that doesn’t mean ${\displaystyle \theta _{0}}$ is wrong with certainty. We can only say that it is very unlikely that a population following the thus specified probability distribution has generated the sample we have observed. Following these considerations, the set of possible test statistic realizations is partitioned into two disjoint regions, reflecting whether the observed sample can be reconciled with the for a given level of ‘plausibility’ (non-rejection region) or not (rejection region).

## Non-rejection region of null hypothesis

The non-rejection region for H${\displaystyle _{0}}$ is the set of possible outcomes of the test statistic leading to a decision in favour for H${\displaystyle _{0}}$, i.e. non-rejection for H${\displaystyle _{0}}$.

## Rejection region of null hypothesis

The rejection region (or ) for H${\displaystyle _{0}}$ encompasses all possible outcomes of the test statistic that lead to a rejection for H${\displaystyle _{0}}$. Rejection and non-rejection regions for H${\displaystyle _{0}}$ form an disjoint and exhaustive decomposition of all possible outcomes of the test statistic. If the outcomes are real-valued, there are boundary values termed ‘critical values’, that partition the real line into rejection and non-rejection regions. The itself belongs to the non-rejection region. In order to obtain a usable, decision rule, these critical values have to be computed. This is accomplished using probability theory. The probability, that, any sample induces the test to reject H${\displaystyle _{0}}$ given the null hypothesis is actually true (i.e. the true parameter value falls into the region stated in the ) must not be greater than the significance level ${\displaystyle \alpha }$: ${\displaystyle P\left(V{\text{ is element of rejection region for H}}_{0}\,|\,\theta _{0}\right)\leq \alpha .}$ Accordingly, the probability of ${\displaystyle V}$ assuming a value in the non-rejection region, when ${\displaystyle V}$ is computed from a sample drawn from a population with parameter ${\displaystyle \theta _{0}}$, is at least ${\displaystyle \left(1-\alpha \right)}$: ${\displaystyle P\left(V{\text{ is element of non-rejection region associated with H}}_{0}\,|\,\theta _{0}\right)\geq 1-\alpha .}$ Given the probability ${\displaystyle \alpha }$, critical values can be derived from the test statistics’ conditional probability distribution ${\displaystyle {\text{F}}\left(v|{\text{H}}_{0}\right)}$. This helps us to understand why the distribution of the test statistic given ${\displaystyle {\text{H}}_{0}}$ is true must be known (at least approximately). As the probability ${\displaystyle \alpha }$ determines whether any given sample deviates significantly from the value implied by the hypothesized parameter set, it is termed the level of significance. For heuristic reasons (mainly historical in nature), the significance level is chosen to be small such that the null hypothesis is only rejected if the sample is very unlikely to stem from the hypothesized distribution—usually either ${\displaystyle 0.01}$, ${\displaystyle 0.05}$ or ${\displaystyle 0.10}$. We will now derive decision regions for the three test scenarios we have introduced earlier for a given significance level ${\displaystyle \alpha }$ and validity for H${\displaystyle _{0}}$. For convenience’s sake in what follows below we assume ${\displaystyle V}$ to be normally distributed. ${\displaystyle {\text{H}}_{0}:\theta =\theta _{0}\quad {\text{ versus }}\quad {\text{H}}_{1}:\theta \neq \theta _{0}}$ rejection region for ${\displaystyle {\text{H}}_{0}}$: In a two-sided test, the rejection region is composed of two sets (areas), as deviations of the sample statistic from the hypothesized parameter value ${\displaystyle \theta _{0}}$ in two directions matter. The non-rejection region is separated from these two rejection regions by two critical values ${\displaystyle c_{l}}$ and ${\displaystyle c_{u}}$ (it actually resides between the two portions of the rejection region - this helps explain why two-sided tests are also often referred to as two-tailed tests, the two rejection regions reside in the tails of the probability distribution of ${\displaystyle V}$). The rejection region consists of all realizations ${\displaystyle v}$ of the test statistic ${\displaystyle V}$ smaller than the lower critical value ${\displaystyle c_{l}}$ or greater than the upper critical value ${\displaystyle c_{u}}$: ${\displaystyle \left\{v|vc_{u}\right\}.}$ The combined probability of sampling a value from the rejection region, given H${\displaystyle _{0}}$ (i.e. ${\displaystyle \theta _{0}}$) is true, equals the given significance level ${\displaystyle \alpha }$: ${\displaystyle {\text{P}}\left(Vc_{u}|\theta _{0}\right)=\alpha /2+\alpha /2=\alpha }$ Non-rejection region for ${\displaystyle {\text{H}}_{0}}$: The non-rejection region for ${\displaystyle {\text{H}}_{0}}$ encompasses all possible values ${\displaystyle v}$ of the test statistic ${\displaystyle V}$ smaller than (or equal to) the upper critical value ${\displaystyle c_{u}}$ and greater than (or equal to) the lower critical value ${\displaystyle c_{l}}$: ${\displaystyle \left\{c_{l}\leq v\leq c_{u}\right\}.}$ The probability of encountering a test statistic realization within the rejection region, given ${\displaystyle \theta _{0}}$ is true, is ${\displaystyle \left(1-\alpha \right)}$ : ${\displaystyle {\text{P}}\left\{c_{l}\leq V\leq c_{u}\,|\,\theta _{0}\right\}=\left(1-\alpha \right).}$

By design, there is exactly one critical region associated with one-sided tests: Deviations of the test statistics from the hypothetical parameter value are ‘significant’ in only one direction. The critical value splitting non-rejection and rejection region is denoted by ${\displaystyle c}$. Left-sided test: ${\displaystyle {\text{H}}_{0}:\theta \geq \theta _{0}\quad {\text{ versus }}\quad {\text{H}}_{1}:\theta <\theta _{0}}$ rejection region for ${\displaystyle {\text{H}}_{0}}$:The critical or rejection region for ${\displaystyle {\text{H}}_{0}}$ consists of realizations ${\displaystyle v}$ of the test statistic ${\displaystyle V}$ smaller than ${\displaystyle c}$: ${\displaystyle \left\{v\,|\,v The probability that the test statistic assumes a value from the rejection region, given ${\displaystyle {\text{H}}_{0}}$ is true, is less than or equal to the significance level ${\displaystyle \alpha }$: ${\displaystyle {\text{P}}\left\{V Non-rejection region for ${\displaystyle {\text{H}}_{0}}$:The non-rejection region for ${\displaystyle {\text{H}}_{0}}$ encompasses all realizations ${\displaystyle v}$ of the test statistic ${\displaystyle V}$ greater than or equal to ${\displaystyle c}$: ${\displaystyle \left\{v\,|\,v\geq c\right\}.}$ The probability of the test statistic assuming a value within the non-rejection region, given ${\displaystyle {\text{H}}_{0}}$ is true, is at least ${\displaystyle \left(1-\alpha \right)}$: ${\displaystyle {\text{P}}\left\{V\geq c\,|\,\theta _{0}\right\}\geq 1-\alpha .}$

Rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Non-rejection region ${\displaystyle {\text{H}}_{0}}$ Right-sided test:${\displaystyle {\text{H}}_{0}:\theta \leq \theta _{0}\quad {\text{ versus }}\quad {\text{H}}_{0}:\theta >\theta _{0}}$ rejection region for H${\displaystyle _{0}}$:The rejection region for H${\displaystyle _{0}}$ consists of all realizations ${\displaystyle v}$ of the test statistic ${\displaystyle V}$ greater than ${\displaystyle c}$: ${\displaystyle \left\{v\,|\,v>c\right\}.}$ The probability of ${\displaystyle v}$ falling into the rejection region, given ${\displaystyle {\text{H}}_{0}}$ is true, is less than or equal to the given (chosen) significance level ${\displaystyle \alpha }$: ${\displaystyle {\text{P}}\left\{V>c\,|\,\theta _{0}\right\}\leq \alpha .}$ Non-rejection region for H${\displaystyle _{0}}$:The non-rejection region for H${\displaystyle _{0}}$ is the set of test statistic values ${\displaystyle v}$ less than or equal to ${\displaystyle c}$: ${\displaystyle \left\{v\,|\,v\leq c\right\}.}$ The probability of ${\displaystyle v}$ assuming a value from the non-rejection region, given ${\displaystyle {\text{H}}_{0}}$ is true, is greater than or equal to ${\displaystyle \left(1-\alpha \right)}$: ${\displaystyle {\text{P}}\left\{V\leq c\,|\,\theta _{0}\right\}\geq 1-\alpha .}$

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$ As statistical tests are based on finite samples from the (theoretically infinitely large) population, wrong decisions concerning the parameter values specifying the underlying distribution cannot be ruled out. Depending on the actual value of the test statistic ${\displaystyle v}$, the null hypothesis will either be not-rejected or rejected. We will symbolize this as follows: ‘${\displaystyle {\text{H}}_{0}}$’: Test does not-reject the null hypothesis. ‘${\displaystyle {\text{H}}_{1}}$’: Test rejects the null hypothesis. Irrespective of the two decision made on the basis of particular samples, there are two possible ’true’ states of the world, only one of which can be true at any point in time: ${\displaystyle {\text{H}}_{0}}$: Null hypothesis is ‘really’ true ${\displaystyle {\text{H}}_{1}}$: Null hypothesis is wrong, i.e. the alternative hypothesis is true Joining the categorizations of the sample-induced test decision and true situation together yields a 2-by-2 table of possible combinations:

 Sample-based decision ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\text{H}}_{1}}$ ‘${\displaystyle {\text{H}}_{0}}$’   (i.e. Test does not-reject H${\displaystyle _{0}}$) Right Decision Type II error ‘${\displaystyle {\text{H}}_{0}^{'}|{\text{H}}_{0}}$ & ‘${\displaystyle {\text{H}}_{0}^{'}|{\text{H}}_{1}}$ & ${\displaystyle P\left('{\text{H}}_{0}^{'}|{\text{H}}_{0}\right)=1-\alpha }$ & ${\displaystyle P\left('{\text{H}}_{0}^{'}|{\text{H}}_{1}\right)=\beta }$ ‘${\displaystyle {\text{H}}_{1}}$’   (i.e. Test rejects H${\displaystyle _{0}}$) & Type I error & Right Decision & ‘${\displaystyle {\text{H}}_{1}^{'}|{\text{H}}_{0}}$ & ‘${\displaystyle {\text{H}}_{1}^{'}|{\text{H}}_{1}}$ & ${\displaystyle P\left('{\text{H}}_{1}^{'}|{\text{H}}_{0}\right)=\alpha }$ & ${\displaystyle P\left('{\text{H}}_{1}^{'}|{\text{H}}_{1}\right)=1-\beta }$

This table merits further clarification. Let us first examine the nature of the wrong and right decision to be made given the null hypothesis ${\displaystyle {\text{H}}_{0}}$ is true ‘in reality’. Suppose, a test statistic computed using an observed sample deviates substantially from the proposed boundary parameter value ${\displaystyle \theta _{0}}$. It is in fact the scope of statistical tests to rationally assess these deviations in terms of significance, i.e. evaluate whether the deviation is substantial in statistical terms. But for the moment assume that the deviation is substantial in that the test statistic realization ${\displaystyle v}$ falls into the rejection region. Following the decision rule created for the test, the null hypothesis will be rejected. Yet our decision doesn’t affect the true data generation process, and consequently we may have made an error which we expect to make with probability ${\displaystyle \alpha }$ (when our null hypothesis is true). This error is dubbed type I error or ${\displaystyle \alpha }$-error, and its (probabilistic) magnitude is what we control when we set up the test procedure. By fixing (choosing) ${\displaystyle \alpha }$ we set the probability ${\displaystyle P\left('{\text{H}}_{1}^{'}|{\text{H}}_{0}\right)=P({\text{Test rejects null given the null is true}}){\text{ }}=\alpha }$ as a parameter—the significance level. Even though we can vary the significance level ${\displaystyle \alpha }$, we cannot completely prevent the occurrence of a Type I Error (which will occur with probability ${\displaystyle \alpha }$). Setting ${\displaystyle \alpha }$ to zero amounts to never rejecting the , consequently never rejecting given the describes reality correctly. The probability of making the right decision, given the null hypothesis is true is computed as ${\displaystyle P\left('{\text{H}}_{0}^{'}|{\text{H}}_{0}\right)=P({\text{Test does not-reject the null given the null is true}})=1-\alpha ,}$ which equals one, if we set ${\displaystyle \alpha }$ to zero. As tempting as setting ${\displaystyle \alpha }$ to zero sounds there is a down side which we will see occurs when the alternative, rather than the null, hypothesis is true. What are the right and wrong decisions that can be made when the alternative hypothesis states the true parameter range? If the test statistic computed from an observed sample indicates a relatively small deviation from the parameter value ${\displaystyle \theta }$ proposed in the null hypothesis, the decision rule will induce us to not-reject the null hypothesis ${\displaystyle {\text{H}}_{0}}$. Since we are presently postulating ${\displaystyle {\text{H}}_{1}}$ to be true we know that this is a error. This outcome ${\displaystyle '{\text{H}}_{0}^{'}|{\text{H}}_{1}}$  (non-rejection of a false null) is commonly known as the type II error or ${\displaystyle \beta }$ -error. As is the case in the situation called ${\displaystyle \alpha }$-error, we cannot rule out the ${\displaystyle \beta }$-error either: Even though it is ‘unlikely’ that a sample drawn from a population that does not belong to the null hypothesis gives a test statistic value ’close’ to the null hypothesis value, it is still possible—and this will happen with probability ${\displaystyle P\left('{\text{H}}_{0}^{'}|{\text{H}}_{1}\right)=\beta \left(\theta _{1}\right),}$ given the alternative hypothesis correctly describes reality. Note that ${\displaystyle \beta }$ depends on the true parameter value ${\displaystyle \theta _{1}}$. As this still hasn’t been disclosed to us (and never will), we cannot compute this probability. There is, of course, also the possibility of the decision rule inducing us to make a right decision, i.e. reject ${\displaystyle {\text{H}}_{0}}$ when the alternative hypothesis is true: ${\displaystyle '{\text{H}}_{1}^{'}|{\text{H}}_{1}}$. The conditional probability of this happening is (conditional on the alternative hypothesis being true): ${\displaystyle P\left('{\text{H}}_{1}^{'}|{\text{H}}_{1}\right)=1-\beta \left(\theta _{1}\right).}$ The probability ${\displaystyle \beta \left(\theta _{1}\right)}$ of making a type II error depends on the given significance level ${\displaystyle \alpha }$. Decreasing ${\displaystyle \alpha }$ for a constant sample size ${\displaystyle n}$ will result in an increased probability of the ${\displaystyle \beta }$-error, and vice versa. This ‘error trade-off’ cannot be overcome, that is, it is not possible to reduce ${\displaystyle \alpha }$ whilst also reducing ${\displaystyle \beta }$ . This dilemma is depicted in the next two diagrams below.

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$ As already mentioned, the probability of making a type II error also depends on the true value of the parameter to be tested. Given a fixed sample size ${\displaystyle n}$ and significance level ${\displaystyle \alpha }$, the distance between ${\displaystyle \theta _{1}}$ and ${\displaystyle \theta _{0}}$ is inversely related to ${\displaystyle \beta \left(\theta _{1}\right)}$: The greater the distance, the smaller is the probability of making a type II error when the alternative hypothesis is true. The following two diagrams show this for our normally distributed test statistic ${\displaystyle V}$.

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$ Statistics Inference is a means of inferring probability distributions (or their characteristics, e.g. parameters like the ) from samples with limited size for either practical or economical reasons. As these subsets of the population don’t convey the complete information about the distribution of the variable under consideration, making errors is inevitable. All we try to achieve is to quantify and control them in the sense that in a repeated sampling context they occur with a certain probability. As already pointed out: Rejecting a hypothesis doesn’t prove it wrong—the probability of the hypothesis actually being right (i.e. of making a type I error) merely doesn’t exceed a small threshold that is set by the researcher. Not-rejecting the null hypothesis exposes the researcher to the risk of making a type II error which occurs with a probability that cannot be quantified statistically. As we have seen, depending on the true parameter, the corresponding probability ${\displaystyle \beta }$ can be ‘significantly’ greater than the controlled ${\displaystyle \alpha }$-probability. For this reason, the scientific conjecture to be tested statistically is usually chosen as the null, rather than the alternative, hypothesis so that the probability of rejecting it in error (a type I error) can be controlled. The possibility of a reject H${\displaystyle _{0}}$ decision being wrong can then be quantified to be no more than ${\displaystyle \alpha }$. The same logic applies, if the decision object is of high ethical or moral importance, e.g. human health when it comes to testing a new drug or the assumption of innocence until guilt is proven in the case of suspected criminals.

## Power of a test

The probability of rejecting the null hypothesis as a function of all possible parameter values (that is, those ${\displaystyle \theta }$ of the null and alternative hypothesis) is called the power of a test, denoted by ${\displaystyle P\left(\theta \right)}$: ${\displaystyle P\left(\theta \right)=P\left(V{\text{ is element of the rejection region for H}}_{0}\,|\,\theta \right)=P\left('{\text{H}}_{1}^{'}|\theta \right).}$ If the true parameter ${\displaystyle \theta }$ is element of the subset of the parameter space stated in the alternative hypothesis, a right decision has been made: ${\displaystyle \left('{\text{H}}_{1}^{'}|{\text{H}}_{1}\right)}$ . Hence, for all true parameter values ${\displaystyle \theta }$ that agree with the alternative hypothesis, the power measures the probability of correctly rejecting the null hypothesis (respectively correctly not-rejecting the alternative hypothesis): ${\displaystyle P\left(\theta \right)=P\left('{\text{H}}_{1}^{'}|{\text{H}}_{1}\right)=1-\beta \left(\theta \right)\,;\quad \forall \theta \in \theta _{1},}$ where ${\displaystyle \theta _{1}}$ is the subset of the parameter space (the parameter space is the set of all parameters that ${\displaystyle \theta }$ can equal) specified by the alternative hypothesis. If the true parameter equals ${\displaystyle \theta _{0}}$, the set of values under the null hypothesis, the power returns the probability of making a wrong decision, i.e. the probability of the situation ${\displaystyle \left('{\text{H}}_{1}^{'}|{\text{H}}_{0}\right)}$. This is a familiar quantity, namely, the probability of making a type I, or ${\displaystyle \alpha }$ error: ${\displaystyle P\left(\theta \right)=P\left('{\text{H}}_{1}^{'}|{\text{H}}_{0}\right)\leq \alpha \left(\theta \right)\,;\quad \forall \theta \in \theta _{0},}$ where ${\displaystyle \theta _{0}}$ is the subset of the parameter space specified by the null hypothesis. The power measures the reliability of a test procedure in correctly rejecting a false null hypothesis

## OC-curve

The operating characteristic (OC-curve) is equal to ${\displaystyle 1-P\left(\theta \right)}$, it provides the probability of not rejecting the as a function of all possible ${\displaystyle \theta }$: ${\displaystyle 1-P\left(\theta \right)=P\left(V{\text{ is element of the non-rejection region for H}}_{0}|\theta \right)=P\left('{\text{H}}_{0}^{'}|\theta \right).}$ If the true parameter ${\displaystyle \theta }$ is a member of the subset of the parameter space associated with the alternative hypothesis, the operating characteristic assigns a probability of making the wrong decision ${\displaystyle \left('{\text{H}}_{0}^{'}|{\text{H}}_{1}\right)}$, that is, the probability of making a type II error: ${\displaystyle 1-P\left(\theta \right)=P\left('{\text{H}}_{0}^{'}|{\text{H}}_{1}\right)=\beta \left(\theta \right)\,;\quad \forall \theta \in \theta _{1},}$ where ${\displaystyle \theta _{1}}$ is the subset of parameters specified by the alternative hypothesis. If, on the other hand, the true parameter is in the subset of values specified by the null hypothesis, the operating characteristic measures the probability of the situation,${\displaystyle \left('{\text{H}}_{0}^{'}|{\text{H}}_{0}\right)}$, i.e. making the right decision in not rejecting the null hypothesis: ${\displaystyle 1-P\left(\theta \right)=P\left('{\text{H}}_{0}^{'}|{\text{H}}_{0}\right)\geq 1-\alpha \left(\theta \right)\,;\quad \forall \theta \in \theta _{0},}$ where ${\displaystyle \theta _{0}}$ is the parameter set of the null hypothesis. The shape of the graph of the operating characteristic curve (similarly the power curve) depends on the:

• test statistic and its distribution, which must be determined not only for the boundary parameter value delineated by the null hypothesis ${\displaystyle \theta _{0}}$, but for all admissible parameter values;
• given significance level ${\displaystyle \alpha }$; and
• sample size ${\displaystyle n}$.

## A Decision-Theoretical View on Statistical Hypothesis Testing

In the absence of a consistent rational for the conduct of empirical research (and the economic trade-offs involved in deciding what proportion of resources to allocate to different competing lines of research/thought), the scientific community has more or less explicitly agreed that certain significance levels (most notably ${\displaystyle 0.05}$ and ${\displaystyle 0.01}$) are adequate. Of course, use of these vary with the degree to which various measured variables or impacts can be accurately quantified. If making errors can be tackled within a cost benefit decision-making approach, an approximate collective preference order can be assumed that strikes a balance between the long-term scientific success of a society, economic success and short-term costs. As it is impossible to predict the future value of undertaking a particular scientific effort, the economics of science as an allocation tool itself has to deal with uncertainty in the level of generated knowledge for each feasible research environment. For these reasons, significance levels chosen for empirical research not closely linked to a specific application will always be conventions based on some human perception of how frequent ‘infrequent’ should be. But even on the more applied level, significance levels aren’t usually the result of a systematic analysis of the relevant preference system and the experimental conditions. Consider some crucial problems of public choice. In deciding how many ambulances to fund for a particular area, a community actively caps the number of patients to be catered for at the same time. If you wanted to test whether three ambulances are sufficient, i.e. not more than three citizens become critically ill at any time, where would you fix the significance level? Setting it to zero would imply the decision is to buy as many ambulances and employ as many staff as there are citizens, if one cannot rule out the occurrence of an epidemic possibility of all citizens coincidentally becoming ill at the same time. Clearly, this is not feasible in any society. No matter which significance level the decision-maker chooses—she will always have to accept the possibility of (rather) unlikely events causing unfortunate outcomes for society (i.e. deaths in the community in the case of choice of how many ambulances). As noted above, the choice of a suitable significance level is—more or less—arbitrary, because at least one important component of the specification of the decision problem cannot be observed or formalized: on the general level of fundamental research, the future benefits are unknown or they cannot be compared to todays resource spending as their pecuniary value cannot be determined. On the more applied level, research into health or other issues related to the well-being of humans cannot be rationalized for the intangibility of the involved ‘commodities’ (i.e. health). But there are certain applications that can be reduced to cost benefit analysis. Carrying out sample-based quality control in a manufacturing company, for example, requires inspectors to accurately quantify the impact of given choices on the proportion of defective output. She can estimate the expected number of returned items and resulting currency value of related lost sales etc as market prices (values) already exist for such items. The preference order applied could for example be the appetite of shareholders to face a particular risk-return profile implied by the choice of alternative work practices. Let’s assume you want to carry out a right-sided statistical test about a parameter ${\displaystyle \theta }$: ${\displaystyle {\text{H}}_{0}:\theta \leq \theta _{0}}$ and ${\displaystyle {\text{H}}_{1}:\theta >\theta _{0}}$ For simplicity, we also assume that the test statistic ${\displaystyle V}$ follows a standard normal distribution (that is a normal distribution with mean ${\displaystyle 0}$ and variance ${\displaystyle 1}$ ). The rejection region for H${\displaystyle _{0}}$ is the set of all test statistic realizations ${\displaystyle v}$ greater than the critical value ${\displaystyle c}$: ${\displaystyle \left\{v|V>c\right\}}$. The probability of the test statistic assuming a value within the rejection region equals the given (chosen) significance level, ${\displaystyle \alpha =P\left(V>c|\theta _{0}\right)}$, and is given by the green area in the diagram below.

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$ The test decision is made by comparing the realized test statistic value with the critical value: If the realized test statistic value , computed from a particular sample of size ${\displaystyle n}$, is greater than the , then the is rejected. The critical value splits the distribution of all possible test statistic values into two sets with probabilities ${\displaystyle \alpha }$ and ${\displaystyle 1-\alpha }$. Popular statistical software packages (e.g. SAS, SPSS, Statistica, Systat, XploRe) not only compute the test statistic value ${\displaystyle v}$, but additionally return a so-called ${\displaystyle p}$-value. This is the theoretical probability, that ${\displaystyle V}$ assumes a value greater than that computed from the given sample: ${\displaystyle P\left(V>v|\theta _{0}\right)}$. The ${\displaystyle p}$-value is sometimes called significance or 1-tailed P, and we will denote it by ${\displaystyle p=P\left(V>v|\theta _{0}\right)}$. The crucial assumption underlying its computation is that the distribution of ${\displaystyle V}$ is the one that follows from assuming that ${\displaystyle \theta _{0}}$ is the true parameter value. In the next diagram, ${\displaystyle p}$ is depicted by the blue area.

As the ${\displaystyle p}$-value represents the minimum significance level for not rejecting the null hypothesis, the user doesn’t need to look up the critical value corresponding to the given significance level in a table. She merely needs to compare ${\displaystyle \alpha }$ with the size of the ${\displaystyle p}$-value as follows: If the parameter estimate is ‘substantially’ larger than the hypothetical parameter value ${\displaystyle \theta _{0}}$, the ${\displaystyle p}$-value will be relatively small. Recall that the is one sided with values less than or equal to ${\displaystyle \theta _{0}}$, consequently estimates that are greater than ${\displaystyle \theta _{0}}$ are less easily reconciled with the null hypothesis than those within the postulated parameter range. The ‘farther’ away the estimate lies from the null hypothesis, the less probable it is to have been generated by sampling from a population distribution with ${\displaystyle \theta }$ less than or equal to ${\displaystyle \theta _{0}}$ . The ${\displaystyle p}$-value is the probability, that ${\displaystyle v}$ will be observed given a true parameter ${\displaystyle \theta _{0}}$. In our example, this becomes decreasingly likely with rising parameter estimate, and a sufficiently large parameter estimate will induce us to infer that ${\displaystyle \theta _{0}}$ and ${\displaystyle \theta }$ differ significantly. Given a parameter estimate, the ${\displaystyle p}$-value tells us how likely observed the distance to ${\displaystyle \theta _{0}}$ is to occur. When this probability is small, the risk of being wrong in rejecting the null hypothesis is small. That is, we conclude that the null hypothesis is false rather than conclude that the null is not false and that a highly unlikely outcome (under the null) has occurred. Let’s translate these considerations into a decision rule: A ${\displaystyle p}$-value smaller than ${\displaystyle \alpha }$ is a reflection of the test statistic value ${\displaystyle v}$ falling into the rejection region for H${\displaystyle _{0}}$ for the given significance level ${\displaystyle \alpha }$. Thus, the null hypothesis is rejected. This is true for both left- and right-sided tests, as we did not specify how ${\displaystyle p}$ was computed. In our example, it’s ${\displaystyle p=P\left(V>v|\theta _{0}\right)}$, but for a left-sided test it would be ${\displaystyle p=P\left(V. The following diagram shows the right-sided test case.

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$ If the parameter estimate value is close to the hypothetical parameter value ${\displaystyle \theta _{0}}$, then the validity of the null hypothesis appears ‘relatively’ plausible. The probability of the estimate assuming values greater than ${\displaystyle \theta _{0}}$ than ${\displaystyle v}$, is relatively high. In other words, ${\displaystyle \theta _{0}}$ and the estimate are close enough to interpret their distance as the result of the noise created by the sampling process. Consequently, ${\displaystyle {\text{H}}_{0}}$ won’t be rejected. Hence the following decision rule: For ${\displaystyle p>\alpha }$ the test statistic realization ${\displaystyle v}$ is an element of the non-rejection region for H${\displaystyle _{0}}$, and the null hypothesis isn’t rejected. Once again this rule holds for all single- and two-sided tests, ${\displaystyle p}$ suitably computed.

Non-rejection region ${\displaystyle {\text{H}}_{0}}$ ${\displaystyle {\vert }}$ Rejection region ${\displaystyle {\text{H}}_{0}}$ Statistical tests are procedures for the analysis of assumptions about unknown probability distributions or their characteristics. The ‘behavior’ of random variables from a population is inferred from a sample of limited size, constrained by either practical or economical parameters. This inductive character makes them an important pillar of inferential statistics, the second branch being statistical estimation procedures. We will now illustrate the theory introduced in this chapter with some practical examples. A large software company is subject to a court trial, media coverage bringing it to the forefront of public debate. The management wants to assess the impact of the legal action on revenues. Average monthly sales before the lawsuit are known, serving as the hypothetical value to be tested. The average of a randomly selected sample of monthly revenues from the time after the trial firstly hit the news is calculated. The directors are particularly interested whether the revenues have fallen and ask their in-house statistician to test the hypothesis that the average monthly revenue has fallen since the beginning of the lawsuit. Hence, the monthly revenue is treated as a random variable, and the test is based on its mean. An environmental organization claims that the proportion of citizens opposed to nuclear power is ${\displaystyle 60\%}$. The operators of the nuclear power plants dismiss this figure as overstated and commission a statistical analysis based on a random sample. The variable ‘attitude to nuclear energy’ is measured by only two outcomes, e.g. ‘support’ and ‘opposed’. Hence, the statistician tests the mean of the population distribution of a dichotomous variable: Can the hypothetical value of ${\displaystyle 0.6}$ be reconciled by the sample? In both examples an unknown parameter of the probability distribution in the population is tested. The test procedures employed are known as parametric tests. Furthermore, as they are based on one single sample, they are called one-sample tests. Two producers of mobile phones launch separate advertising campaigns claiming to build the phones with the longest stand-by time. Clearly, one of them must be wrong—given, that stand-by time is sufficiently precisely measured such that the average stand-by time doesn’t coincide and standby-time varies across individual phones, i.e. is a random variable. The managers of a consumer organization are concerned and want to assess whether the cellular phones manufactured by the two companies differ significantly with respect to stand-by time. The statistical investigation has to be based on an average to account for the fluctuations of stand-by time across output. Samples are drawn independently from both producers’ output in order to compare the average location of the duration as measured by the sample means. An inductive statement is sought whether or not the mean stand-by times in the overall outputs are (significantly) different or not. The test procedure applied is a parametric test, as one tests for equality of the two means. This can only be done on the basis of two samples: This is an example of a two-sample test procedure. Someone claims that a specific die (single dice) is what statisticians call a fair die: The probability of any outcome is equal. The hypothesis to be tested is that the outcomes of the die rolling process has a discrete uniform distribution. This test doesn’t refer to a parameter of the underlying population distribution, i.e. doesn’t take a particular distribution class as given. Consequently, it is classified as nonparametric test or distribution-free test. This particular type belongs to the class of goodness-of-fit tests, as one wants to verify how good a given sample can be explained to be generated by a particular, completely specified, theoretical probability distribution.