# Distribution of the Sample Variance

 English Português Français ‎Español Italiano Nederlands

Consider a population variable ${\displaystyle X}$ with ${\displaystyle E(X)=\mu }$ and ${\displaystyle Var(X)=\sigma ^{2}}$. From this population a random sample of size ${\displaystyle n}$ is drawn. The sample variance is based on the sum of squared deviations of the random variables ${\displaystyle X_{i},i=1,\dots ,n}$ from the mean.  We have proposed two estimators for the variance, the ${\displaystyle MSD}$ and ${\displaystyle s^{2}.}$ Since ${\displaystyle E(X)=\mu }$ is usually unknown and estimated by the ${\displaystyle {\bar {x}},}$the sample variance is calculated as ${\displaystyle s^{2}={\frac {1}{n-1}}\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}$ Alternatively, the sample variance may also be calculated as ${\displaystyle MSD={\frac {1}{n}}\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}$ See the entry under ”Information” for more on this version of the sample variance. The derivation of the distribution of the sample variances ${\displaystyle \ s^{2}}$ will be given for the case of a normally distributed population, i.e. ${\displaystyle X\sim N(\mu .\sigma ^{2})}$. Under these assumptions, the  random variables ${\displaystyle X_{i},i=1,\dots ,n}$ are independently and identically normally distributed with ${\displaystyle E(X_{i})=\mu }$ and ${\displaystyle Var(X_{i})=\sigma ^{2}}$: ${\displaystyle X_{i}\sim N(\mu ,\sigma )\ \,\,\,i=1,\dots ,n}$ Moreover, the sample mean ${\displaystyle {\bar {X}}}$ is also normally distributed with ${\displaystyle E({\bar {x}})=\mu }$ and ${\displaystyle Var({\bar {x}})=\sigma ^{2}({\bar {x}})=\sigma ^{2}/n}$: ${\displaystyle {\bar {x}}\sim N(\mu ,\sigma )\,.}$

Distribution of the sample variance ${\displaystyle s^{2}}$

Consider for the moment the random variable ${\displaystyle \sum \limits _{i=1}^{n}\left({\frac {X_{i}-\mu }{\sigma }}\right)^{2}.}$ It is the sum of squares of ${\displaystyle n}$ independent standard normals, and hence has a chi-square distribution with ${\displaystyle n}$ degrees of freedom, i.e., ${\displaystyle \chi _{n}^{2}.}$  Now consider ${\displaystyle {\frac {(n-1)s^{2}}{\sigma ^{2}}}={\frac {1}{\sigma ^{2}}}\sum \limits _{i=1}^{n}(X_{i}-{\bar {x}})^{2}=\sum \limits _{i=1}^{n}\left({\frac {X_{i}-{\bar {x}}}{\sigma }}\right)^{2}.}$ and note the similarity.  By now using ${\displaystyle {\bar {x}}}$  as an estimator of ${\displaystyle \mu }$  it can be shown that we have a sum of squares of ${\displaystyle n-1}$ independent standard normals in which case ${\displaystyle (n-1)s^{2}/\sigma ^{2}}$ is chi-square with ${\displaystyle n-1}$ degrees of freedom.  The distribution of ${\displaystyle s^{2}}$ is a simple rescaling of ${\displaystyle (n-1)s^{2}/\sigma ^{2}.}$. Thus we may make probability statements about ${\displaystyle s^{2}}$. Using the properties of the chi-square distribution,  the expected value and variance of ${\displaystyle s^{2}}$ are: ${\displaystyle E(s^{2})=\sigma ^{2},\qquad Var(s^{2})=2\sigma ^{4}/(n-1)}$

Probability Statements about ${\displaystyle s^{2}}$:

For known variance ${\displaystyle \sigma ^{2}}$ and a normally distributed population one can calculate the probability that the sample variance ${\displaystyle s^{2}}$ will take on values in a central interval with pre-specified probability ${\displaystyle 1-\alpha .}$ ${\displaystyle P\left(v_{1}\leq {\frac {(n-1)s^{2}}{\sigma ^{2}}}\leq v_{2}\right)=1-\alpha }$ Furthermore, if we want to put equal probability mass in the tails, i.e., we impose: ${\displaystyle P\left({\frac {(n-1)s^{2}}{\sigma ^{2}}}v_{2}\right)={\frac {\alpha }{2}}}$ With ${\displaystyle n-1}$  degrees of freedom, the interval boundaries can be obtained from tables of the chi-square distribution ${\displaystyle v_{1}=\chi _{{\frac {\alpha }{2}};n-1}^{2}\,;\quad v_{2}=\chi _{1-{\frac {\alpha }{2}};n-1}^{2}}$ Thus, ${\displaystyle P\left(\chi _{{\frac {\alpha }{2}};n-1}^{2}\leq {\frac {(n-1)s^{2}}{\sigma ^{2}}}\leq \chi _{1-{\frac {\alpha }{2}};n-1}^{2}\right)=1-\alpha }$ Rearranging yields the probability statement: ${\displaystyle P\left({\frac {\sigma ^{2}\chi _{{\frac {\alpha }{2}};n-1}^{2}}{n-1}}\leq s^{2}\leq {\frac {\sigma ^{2}\chi _{1-{\frac {\alpha }{2}};n-1}^{2}}{n-1}}\right)=1-\alpha }$

### Example for the distribution of the sample variance

To measure the variation in time needed for a certain task, the variance is often utilized. Let the time a worker needs to complete a certain task be the random variable ${\displaystyle X}$. Suppose ${\displaystyle X}$ is normally distributed with ${\displaystyle E(X)=\mu }$ and ${\displaystyle Var(X)=\sigma ^{2}}$. A random sample of size ${\displaystyle n}$ is drawn with replacement.  The random variables ${\displaystyle X_{i}}$ (${\displaystyle i=1,\dots ,n}$) are therefore independent and identically normally distributed.

problem 1:

A random sample of size ${\displaystyle n=15}$ is drawn.What is the probability that the sample variance ${\displaystyle s^{2}}$ will take on values in the interval ${\displaystyle [0.5\cdot \sigma ^{2};1.5\cdot \sigma ^{2}]}$? That is, the probability to be calculated is ${\displaystyle P(0.5\sigma ^{2}\leq s^{2}\leq 1.5\sigma ^{2})}$. To solve the problem, each side is multiplied by ${\displaystyle (n-1)/\sigma ^{2}}$: {\displaystyle {\begin{aligned}P(0.5\sigma ^{2}\leq s^{2}\leq 1.5\sigma ^{2})&=P\left({\frac {n-1}{\sigma ^{2}}}0.5\sigma ^{2}\leq {\frac {n-1}{\sigma ^{2}}}s^{2}\leq {\frac {n-1}{\sigma ^{2}}}1.5\sigma ^{2}\right)\\&=P\left((n-1)\cdot 1.5\leq {\frac {n-1}{\sigma ^{2}}}s^{2}\leq (n-1)\cdot 1.5\right)\\&\end{aligned}}} Since ${\displaystyle n-1=14}$  it follows that: ${\displaystyle P(0.5\sigma ^{2}\leq s^{2}\leq 1.5\sigma ^{2})=P\left(7\leq {\frac {n-1}{\sigma ^{2}}}s^{2}\leq 21\right)}$ The probability that ${\displaystyle s^{2}}$ will take on values between ${\displaystyle 0.5\cdot \sigma ^{2}}$ and ${\displaystyle 1.5\cdot \sigma ^{2}}$is identical to the probability that the transformed random variable ${\displaystyle (n-1)s^{2}/\sigma ^{2}}$ will take values between 7 and 21. The random variable ${\displaystyle (n-1)s^{2}/\sigma ^{2}}$ is chi-square ${\displaystyle n-1=14}$ degrees of freedom.  The probability can be found by using a table of the chi-square distribution. {\displaystyle {\begin{aligned}P(0.5\sigma ^{2}\leq s^{2}\leq 1.5\sigma ^{2})&=P\left(7\leq {\frac {n-1}{\sigma ^{2}}}s^{2}\leq 21\right)\\&=P\left({\frac {n-1}{\sigma ^{2}}}s^{2}\leq 21\right)-P\left({\frac {n-1}{\sigma ^{2}}}s^{2}\leq 7\right)\\&=0.8984-0.0653=0.8331\\&\end{aligned}}} The probability that ${\displaystyle s^{2}}$ will lie in the interval ${\displaystyle [0.5\cdot \sigma ^{2}}$ and ${\displaystyle 1.5\cdot \sigma ^{2}]}$ is equal to 0.8331. The following graph shows the density function of the chi-square distribution with 14 degrees of freedom, where the symbol ${\displaystyle Y}$ is a shorthand for ${\displaystyle (n-1)S^{2}/\sigma ^{2}}$.

problem 2:

The goal is to determine a central interval of variation for the sample variance ${\displaystyle s^{2}}$ with pre-specified probability ${\displaystyle 1-\alpha =0.95}$  We assume the same population as in problem 1 and use a random sample of size ${\displaystyle n=30.}$ Since ${\displaystyle P\left(v_{1}\leq {\frac {(n-1)s^{2}}{\sigma ^{2}}}\leq v_{2}\right)=0.95}$ and we again put equal probability mass in the tails: ${\displaystyle P\left({\frac {(n-1)s^{2}}{\sigma ^{2}}}\leq v_{1}\right)=0.025\ ;\qquad P\left({\frac {(n-1)s^{2}}{\sigma ^{2}}}\leq v_{2}\right)=0.975}$ Using tables for the chi-square distribution with ${\displaystyle 29}$ degrees of freedom we obtain ${\displaystyle v_{1}=16.05}$ and ${\displaystyle v_{2}=45.72}$. Thus, ${\displaystyle P\left(16.05\leq {\frac {(n-1)s^{2}}{\sigma ^{2}}}\leq 45.72\right)=0.95}$ With probability 0.95, the transformed random variable ${\displaystyle (n-1)s^{2}/\sigma ^{2}}$ takes values in the interval ${\displaystyle [16.05;45.72]}$. Rearranging gives the interval: ${\displaystyle P\left({\frac {16.05\sigma ^{2}}{n-1}}${\displaystyle P(0.5534\sigma ^{2} With probability 0.95 will the sample variance ${\displaystyle s^{2}}$ takes values in the interval ${\displaystyle [0.5534\sigma ^{2};1.5766\sigma ^{2}]}$. The exact numerical boundaries of the interval can be determined only if the population variance ${\displaystyle \sigma ^{2}}$ of the variable ${\displaystyle X}$ is known.

${\displaystyle \mu }$ is known

Consider the simplifying assumption that ${\displaystyle \mu }$ is known and let us modify ${\displaystyle s^{\ast 2{\text{ }}}}$as follows: ${\displaystyle s^{\ast 2}={\frac {1}{n}}\sum \limits _{i=1}^{n}(X_{i}-\mu )^{2}}$ {\displaystyle {\begin{aligned}E(s^{\ast 2})&=E\left[{\frac {1}{n}}\sum \limits _{i=1}^{n}(X_{i}-\mu )^{2}\right]={\frac {1}{n}}E\left[\sum \limits _{i=1}^{n}(X_{i}-\mu )^{2}\right]\\&={\frac {1}{n}}\sum \limits _{i=1}^{n}E[(X_{i}-\mu )^{2}]\ ={\frac {1}{n}}\sum \limits _{i=1}^{n}\sigma ^{2}={\frac {1}{n}}n\sigma ^{2}\\&=\sigma ^{2}\\&\end{aligned}}} Note that the above argument does not assume a distribution for the ${\displaystyle X_{i}.}$ It is only assumed that they are i.i.d. with common variance ${\displaystyle Var(X_{i})=E[(X_{i}-\mu )^{2}]=\sigma ^{2}}$.  In this case we assume that the ${\displaystyle X_{i}}$  are i.i.d.  ${\displaystyle N(\mu ,\sigma ^{2})}$   Recall that the variance of a chi-square random variable with ${\displaystyle n}$ degrees of freedom has mean ${\displaystyle n}$ and variance ${\displaystyle 2n}$.  Since ${\displaystyle ns^{\ast 2}/\sigma ^{2}}$ has a chi-square distribution with ${\displaystyle n}$ degrees of freedom, it follows that: ${\displaystyle Var\left({\frac {ns^{\ast 2}}{\sigma ^{2}}}\right)={\frac {n^{2}}{\sigma ^{2}}}Var(s^{\ast 2})=2n}$ and therefore ${\displaystyle Var(s^{\ast 2})={\frac {2\sigma ^{4}}{n}}\,.}$ Note also that we can derive the mean of ${\displaystyle s^{\ast 2}}$ using: ${\displaystyle E\left({\frac {ns^{\ast 2}}{\sigma ^{2}}}\right)=n}$ and therefore ${\displaystyle E(s^{\ast 2})=\sigma ^{2}\,.}$

${\displaystyle \mu }$ is unknown

Since ${\displaystyle \mu }$ is typically unknown, the usual estimator of the variance given by ${\displaystyle s^{2}={\frac {1}{n-1}}\sum \limits _{i=1}^{n}(X_{i}-{\bar {x}})^{2}}$ Recall that  the variance of a random variable can be written as: {\displaystyle {\begin{aligned}Var(X)&=E[(X-E(X))^{2}]=E[X^{2}-2XE(X)+(E(X))^{2}]\\&=E(X^{2})-2E(X)E(X)+[E(X)]^{2}\\&=E(X^{2})-[E(X)]^{2}\\&\end{aligned}}} This implies that ${\displaystyle E(X^{2})=Var(X)+[E(X)]^{2}}$ Applying this result we have to the ${\displaystyle X_{i}}$  and to ${\displaystyle {\bar {x}}}$ we have: {\displaystyle {\begin{aligned}E(X_{i}^{2})&=Var(X_{i})+[E(X_{i})]^{2}=\sigma ^{2}+\mu ^{2}\\E({\bar {x}}^{2})&=Var({\bar {x}})+[E({\bar {x}})]^{2}={\frac {\sigma ^{2}}{n}}+\mu ^{2}\\&\end{aligned}}} Furthermore {\displaystyle {\begin{aligned}E\left[\sum \limits _{i=1}^{n}(X_{i}-{\bar {x}})^{2}\right]&=E\left[\sum \limits _{i=1}^{n}X_{i}^{2}-2{\bar {x}}\sum \limits _{i=1}^{n}X_{i}+n{\bar {x}}^{2}\right]=E\left[\sum \limits _{i=1}^{n}X_{i}^{2}-2n{\bar {x}}^{2}+n{\bar {x}}^{2}\right]\\&=E\left[\sum \limits _{i=1}^{n}X_{i}^{2}-n{\bar {x}}^{2}\right]=E\left[\sum \limits _{i=1}^{n}X_{i}^{2}\right]-E\left[n{\bar {x}}^{2}\right]\\&=\sum \limits _{i=1}^{n}E(X_{i}^{2})-nE({\bar {x}}^{2})=\sum \limits _{i=1}^{n}(\sigma ^{2}+\mu ^{2})-n({\frac {\sigma ^{2}}{n}}+\mu ^{2})\\&=n\sigma ^{2}+n\mu ^{2}-\sigma ^{2}-n\mu ^{2}\\&=(n-1)\sigma ^{2}\\&\end{aligned}}} Therefore, the expectation of the sample variance ${\displaystyle s^{2}}$ is given by ${\displaystyle E(s^{2})=E\left[{\frac {1}{n-1}}\sum \limits _{i=1}^{n}(X_{i}-{\bar {x}})^{2}\right]={\frac {1}{n-1}}E\left[\sum \limits _{i=1}^{n}(X_{i}-{\bar {x}})^{2}\right]={\frac {1}{n-1}}(n-1)\sigma ^{2}=\sigma ^{2}}$ Once again, this argument does not require the assumption of normality, only that the ${\displaystyle X_{i}}$ are i.i.d. with common variance ${\displaystyle \sigma ^{2}}$.

In this case we assume that the ${\displaystyle X_{i}}$  are i.i.d.  ${\displaystyle N(\mu ,\sigma ^{2})}$  Since ${\displaystyle (n-1)s^{2}/\sigma ^{2}}$ has a chi-square distribution with ${\displaystyle n-1}$ degrees of freedom, it follows that ${\displaystyle Var\left({\frac {(n-1)s^{2}}{\sigma ^{2}}}\right)={\frac {(n-1)^{2}}{\sigma ^{4}}}Var(s^{2})=2(n-1)}$ and therefore ${\displaystyle Var(s^{2})={\frac {2\sigma ^{4}}{(n-1)}}\,.}$

(3) ${\displaystyle \mu }$ is unknown

In this case we use the ${\displaystyle MSD}$ to estimate the variance: ${\displaystyle MSD={\frac {1}{n}}\sum \limits _{i=1}^{n}(X_{i}-{\bar {x}})^{2}\,.}$ Note that ${\displaystyle MSD={\frac {n-1}{n}}s^{2}\,.}$ Hence${\displaystyle E(MSD)={\frac {n-1}{n}}E\left[s^{2}\right]={\frac {n-1}{n}}\sigma ^{2}}$ and${\displaystyle Var(MSD)=\left({\frac {n-1}{n}}\right)^{2}Var\left[s^{2}\right]=\left({\frac {n-1}{n}}\right)^{2}{\frac {2\sigma ^{4}}{(n-1)}}={\frac {n-1}{n^{2}}}2\sigma ^{4}}$ Note that the expectation of the ${\displaystyle MSD}$ is not exactly equal to the population variance ${\displaystyle \sigma ^{2}}$ which is the reason that the sample variance ${\displaystyle s^{2}}$ is usually used in practical applications.  Nevertheless, even for moderately sized samples, the two estimates will be similar.