Hypergeometric Distribution

 English Português Français ‎Español Italiano Nederlands

The Hypergeometric distribution is based on a random event with the following characteristics:

• total number of elements is N
• from the N elements, M elements have the property N-M elements do not have this property, i.e. only two events, $A$ and $\bar{A}$ are possible
• we randomly choose n elements out of the N

This means the probability P(A) is not constant and the draws (events) are not independent in this sort of experiment. The random variable X, which contains number of successes A after n repetitions of the experiment has a Hypergeometric distribution with parameters N,M, and n, with probability density function: $f_{H}(x;N,M,n)=\left\{ \begin{array}{ll} \frac{\left( \begin{array}{c} M \\ x \end{array} \right) \cdot \left( \begin{array}{c} N-M \\ n-x \end{array} \right) }{\left( \begin{array}{c} N \\ n \end{array} \right) }\quad & \text{for}\ x=max[0,n-(N-M)],\dots ,min[n,M] \\ & \\ 0\quad & \text{otherwise} \end{array} \right.$ Shorthand notation is: $X \sim H(N,M,n)$. The expected value and the variance of the Hypergeometric distribution H(N,M,n): $E(X) = n \cdot \frac{M}{N}$ $Var(X) = n \cdot \frac{M}{N} \cdot \left( 1- \frac{M}{N} \right) \cdot \frac{ N-n}{N-1}$ A Hypergeometric distribution depends on parameters N, M, and n. These parameters influence its shape, location, and variance. This interactive examples allows you to change the values of these parameters and to obtain plots of a hypogeometric distribution function. We suggest that you only change the value of one parameter, holding the others constant, which will better illustrate the effects of the parameters on the shape of the Hypergeometric distribution. You can also compute probabilities for different values of x.

An insurance agent arrives in a town and sells 100 life insurances: 40 are term life policies and the remaining 60 are permanent life policies. He chooses (randomly and without returning) five life insurance policies. What is the probability that he chooses exactly two term life policies. There are $N=100$ policies. The outcomes of this experiment (type of the insurance policy) can take one of two values: the term life type (property A) with M = 40 and the permanent life type (complementary event), with N - M = 60. The random variable X is defined as ”number of the term life policies in five randomly chosen insurance policies”. The random variable X is based on random sampling  experiment without replacement and so has a Hypergeometric distribution H(N;M;n) = H(100;40;5).The smallest value of X is 0 = (max[0, n - (N - M)]), i.e. none of the five 5 randomly chosen contracts is term a life policy. The largest possible value of X is $n < M$, i.e. 5. The set of possible values of X is such that:0 $\leq$ x $\leq$ 5 We need to compute the value of the probability function for x = 2, i.e..P(X = 2) = $f_{H}$(2;100;40;5): $f_H(2;100,40,5) = {\frac{\left( \begin{array}{c} 40 \\ 2 \end{array} \right) \cdot \left( \begin{array}{c} 100 - 40 \\ 5 - 2 \end{array} \right)}{\left( \begin{array}{c} 100 \\ 5 \end{array} \right)}} = \frac{\frac{40!}{2! \cdot 38!} \cdot \frac{60!}{3! \cdot 57!}}{ \frac{100!}{5! \cdot 95!}} = 0.3545$ Suppose we increase the number of draws (randomly chosen contracts) to n=10. The only thing that would change in the example is the range of the random variable X, which becomes 0 $\leq$ x $\leq$ 10. The random variable X know has the following Hypergeometric distribution H(100;40;10). If we compute the probability that there are exactly 4 term life policies in the 10 randomly policies, i.e. P(X = 4): $f_H(4;100,40,10) = {\frac{\left( \begin{array}{c} 40 \\ 4 \end{array} \right) \cdot \left( \begin{array}{c} 100 - 40 \\ 10 - 4 \end{array} \right)}{\left( \begin{array}{c} 100 \\ 10 \end{array} \right)}} = 0.2643$

An insurance agent knows from experience that 70% of his 20 clients, renew their contracts. Suppose this agent has 20 clients. What is the probability that at least one half of four randomly chosen clients will renew their contract? We have total of N = 20 clients. Of these clients, M=14 clients renew their policies (property A) and N-M clients do not. The experiment has only two possible outcomes. We choose n=4 clients randomly. Clearly, it does not make sense to model this random variable with replacement. The random variable X is defined as ”number of clients who renew their contract”. X has Hypergeometric distribution:H(N;M;n) = H(20;14;4). The smallest possible value of X is 0 = (max[0,n - (N - M)]), i.e. non of the 4 clients renew their contracts. $n < M$ is the largest possible value of X, 4 in this example. X can take the following values:0 $\leq$ x $\leq$ 4. We need to find the probability P(X $\geq$ 2), which can be computed as P(X = 2) + P(X = 3 ) + P(X = 4 ). $f_H(2;20,14,4) = {\frac{\left( \begin{array}{c} 14 \\ 2 \end{array} \right) \cdot \left( \begin{array}{c} 20 - 14 \\ 4 - 2 \end{array} \right)}{\left( \begin{array}{c} 20 \\ 4 \end{array} \right)}} = \frac{91 \cdot 15}{4845} = 0.2817$ $f_H(3;20,14,4) = {\frac{\left( \begin{array}{c} 14 \\ 3 \end{array} \right) \cdot \left( \begin{array}{c} 20 - 14 \\ 4 - 3 \end{array} \right)}{\left( \begin{array}{c} 20 \\ 4 \end{array} \right)}} = \frac{364 \cdot 6}{4845} = 0.4508$ $f_H(4;20,14,4) = {\frac{\left( \begin{array}{c} 14 \\ 4 \end{array} \right) \cdot \left( \begin{array}{c} 20 - 14 \\ 4 - 4 \end{array} \right)}{\left( \begin{array}{c} 20 \\ 4 \end{array} \right)}} = \frac{1001 \cdot 1}{4845} = 0.2066$ This implies that: P(X $\geq$ 2) = 0.2817 + 0.4508 + 0.2066 = 0.9391. The probability that at least one half of four clients (out of the 20 clients) decides to renew their policy, is 0.9391. A student has to complete a test with ten question. The student must answer 3 randomly chosen questions from these questions. The student knows that 6 of the 10 questions are so difficult that no one has a chance to answer them. N = 10 questionsM = 4 questions have property A, they can be answeredn = 3 randomly chosen questions the student must answer X = ”number of questions with property A between n randomly chosen questions” Possible values of X are: max[0, n - (N - M)] $\leq$ x $\leq$ min(n, M) , i.e. 0 $\leq$ X $\leq$ 3 Motivation of the use of Hypergeometric distribution:

• finite number of questions,
• returning (repeating) of the questions does not make in this situation any sense,
• hence, the draws are not independent,
• this implies that P(A) depends on the previously drawn questions.

What is the probability that the student draws 3 ”good” questions? $f_{H}(3;10,4,3)={\frac{\left( \begin{array}{c} 4 \\ 3 \end{array} \right) \cdot \left( \begin{array}{c} 10-4 \\ 3-3 \end{array} \right) }{\left( \begin{array}{c} 10 \\ 3 \end{array} \right) }}=\frac{4\cdot 1}{120}=\frac{1}{30}$ What is the probability that the student chooses at least one question that he can answer? P(X $\geq$ 1) = 1 - P(X = 0) $P(X = 0) = f_H(0;10,4,3) = {\frac{\left( \begin{array}{c} 4 \\ 0 \end{array} \right) \cdot \left( \begin{array}{c} 10 - 4 \\ 3 - 0 \end{array} \right)}{\left( \begin{array}{c} 10 \\ 3 \end{array} \right)}} = \frac{1 \cdot 20}{120} = \frac{1}{6}$ It follows that: $P(X \geq 1) = 1 - 1/6 = 5/6$ Like the Binomial distribution, the Hypergeometric distribution is based on an experiment with only two possible outcomes. The Hypergeometric distribution differs from the Binomial distribution in that we draw without replacement, which means the draws from the Hypergeometric distribution are not independent. This also implies that the number of occurrences is decreasing with each draw. This implies that $n\leq N$. In addition, the number of outcomes with property $A$ also changes and this, in turn, changes the probability of drawing an object with property $A$.

• Each draw is conducted only once and without replacement, i.e. each object can be drawn only once in the $n$ draws (no repetition)

Assuming $n$ draws, we are interested in the total number of outcomes with the property $A$, i.e. random variable $X$ = {number of outcomes with the property $A$ drawn in the $n$ draws }

The order of the drawn objects does not play a role in the number of objects drawn with the property $A$. Using combinatorics, we can calculate the number of possible outcomes in which we draw $n$ out of $N$ objects without replacements:

$\left( \begin{array}{c} N \\ n \end{array} \right)$

• How many different ways are there to obtain $\{X=x\}$ ?We have $x\leq M$, i.e., we cannot draw more objects with the property $A$ than we have in total and, analogously, $n-x\leq N-M$. Since we draw without replacement, one object with the property $A$ cannot be drawn more than the total number of objects in the set (no repetition). The order that these outcomes are drawn does not impact the outcomes we observe. The total number of combinations of observing $x$ outcomes with the property $A$ out of $M$ outcomes is :

$\left( \begin{array}{c} M \\ x \end{array} \right)$

Conversely, the $n-x$ outcomes without the property $A$ drawn out of $N-M$ objects is:

$\left( \begin{array}{c} N-M \\ n-x \end{array} \right)$

Each possible element $x$ with the property $A$ out of $M$ outcomes, with any possibility of choosing $n-x$ without the property $A$ out of $N-M$ objects (this gives altogether $n$ drawn objects) leads the event $\{X=x\}$. The number of possibilities of obtaining the event $\{X=x\}$ is therefore $\left( \begin{array}{c} N-M \\ n-x \end{array} \right) \cdot \left( \begin{array}{c} N-M \\ n-x \end{array} \right)$ The desired probability can be obtained using the classical (Laplace) definition of the probability as the ratio $P(X=x)=f(x)=\frac{\left( \begin{array}{c} N-M \\ n-x \end{array} \right) \cdot \left( \begin{array}{c} N-M \\ n-x \end{array} \right) }{\left( \begin{array}{c} N \\ n \end{array} \right) } \,.$

The largest possible value of $X$ is $n$ for $n\leq M$, and $M$ for $M. This implies that: $x_{\max }=\min (n;\,M).$ The smallest possible value of $X$ is: $x\geq 0$ (always). If $n$ is greater than the number of elements without the property $A$, then we have that $x\geq n-(N-M)$. This implies that: $x_{\min }=\max [0;\,n-(N-M)].$ Let $M/N=p$, we have the following: $E(X)=n\cdot \frac{M}{N}=n\cdot p$ $Var(X)=n\cdot \frac{M}{N}\cdot \left( 1-\frac{M}{N}\right) \cdot \frac{N-n}{ N-1}=n\cdot p\cdot (p-1)\cdot \frac{N-n}{N-1}$ The distribution $H(M,N,n)$ will have the same expected value as the corresponding Binomial distribution $B(n,M/N)$. However, its variance will be smaller because it is multiplied by the ratio $(N-n)/(N-1)$ because drawing without replacement implies that we cannot use anymore the information we start with initially. The constant $(N-n)/(N-1)$ is called a continuity correction. The probability function of the Hypergeometric distribution is illustrated in the following diagram. We choose the following parameters for this example: $N=100,\ M=20,\ n=10$ and $N=16,\ M=8,\ n=8$.