Marginal and Conditional Distributions

 English Português Français ‎Español Italiano Nederlands

Marginal distribution

Suppose one is given a two dimensional frequency distribution of the variables ${\displaystyle X}$ and ${\displaystyle Y}$. The marginal distribution of ${\displaystyle X}$ (respectively ${\displaystyle Y}$) is the one dimensional distribution of variable ${\displaystyle X}$ (respectively ${\displaystyle Y}$), in which we do not consider what happens to variable ${\displaystyle Y}$ (respectively ${\displaystyle X}$). The Marginal distribution is the result of ”adding up” the frequencies of the realizations. For example for the marginal (absolute) distribution of ${\displaystyle X}$ :

 Marginal distribution of ${\displaystyle X}$ Variable ${\displaystyle X}$ ${\displaystyle y_{1}}$ ${\displaystyle y_{2}}$ ${\displaystyle y_{3}}$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle x_{i}}$ ${\displaystyle h(x_{i},y_{1})}$ ${\displaystyle h(x_{i},y_{2})}$ ${\displaystyle h(x_{i},y_{3})}$ ${\displaystyle =h(x_{i},y_{1})+h(x_{i},y_{2})+h(x_{i},y_{3})}$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ Marginal distribution of ${\displaystyle Y}$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$ ${\displaystyle \cdots }$

Marginal absolute distribution of variable ${\displaystyle X}$ with the values ${\displaystyle x_{j}}$: ${\displaystyle h_{i\cdot }=\sum _{j=1}^{r}h_{ij};\ \ \ i=1,...,,m}$ Marginal absolute distribution of variable ${\displaystyle Y}$ with the values ${\displaystyle y_{j}}$: ${\displaystyle h_{\cdot j}=\sum _{i=1}^{m}h_{ij};\ \ \ \ j=1,...,r}$ Total number of observations equals ${\displaystyle n}$: ${\displaystyle h_{\cdot \cdot }=\sum _{i=1}^{m}\sum _{j=1}^{r}h_{ij}=\sum _{i=1}^{m}h_{i\cdot }=\sum _{j=1}^{r}h_{\cdot j}=n}$ The marginal relative distribution is defined similarly using the relative frequencies (${\displaystyle f_{ij}}$). (Note: This may be accomplished simply by dividing all of the absolute frequencies in the marginal absolute distribution table by n.) Marginal relative distribution of variable ${\displaystyle X}$ with the values ${\displaystyle x_{j}}$: ${\displaystyle f_{i\cdot }=\sum _{j=1}^{r}f_{ij};\ \ \ i=1,...,,m}$ Marginal relative distribution of variable ${\displaystyle Y}$ with the values ${\displaystyle y_{j}}$: ${\displaystyle f_{\cdot j}=\sum _{i=1}^{m}f_{ij};\ \ \ \ j=1,...,r}$ Total of all relative frequencies equals ${\displaystyle 1}$: ${\displaystyle f_{\cdot \cdot }=\sum _{i=1}^{m}\sum _{j=1}^{r}f_{ij}=\sum _{i=1}^{m}f_{i\cdot }=\sum _{j=1}^{r}f_{\cdot j}=1}$

Conditional distribution

Suppose one is given a two dimensional frequency distribution of two variables ${\displaystyle X}$ and ${\displaystyle Y}$. The frequency distribution of ${\displaystyle X}$ given a particular value of ${\displaystyle Y}$  is called the conditional distribution or conditional distribution of ${\displaystyle X}$ given ${\displaystyle y_{j}}$. (The conditional distribution of ${\displaystyle Y}$ given ${\displaystyle x}$ is defined similarly.) Conditional relative frequency distribution of ${\displaystyle X}$ for a given ${\displaystyle Y=y_{j}}$:  ${\displaystyle f(x_{i}|Y=y_{j})=f(x_{i}|y_{j})={\frac {f_{ij}}{f_{\cdot j}}}={\frac {h_{ij}}{h_{\cdot j}}}}$ Conditional relative frequency distribution of ${\displaystyle Y}$ for a given ${\displaystyle X=x_{i}}$: ${\displaystyle f(y_{j}|X=x_{i})=f(y_{j}|x_{i})={\frac {f_{ij}}{f_{i\cdot }}}={\frac {h_{ij}}{h_{i\cdot }}}}$ Like marginal distributions, conditional distributions are one dimensional distributions. example: The starting point is the following 5${\displaystyle \times }$3 contingency table of the variables: ${\displaystyle X}$ - occupation ${\displaystyle Y}$ - athletic activity which have been observed for ${\displaystyle n=1000}$ employed persons.

occupation ${\displaystyle X}$ MD ${\displaystyle X}$
rarely sometimes regularly
worker 240 120 70 430
salaried 160 90 90 340
civil servant 30 30 30 90
farmer 37 7 6 50
self employed 40 32 18 90
MD ${\displaystyle Y}$ 507 279 214 1000

Conditional distribution of the variable ${\displaystyle Y}$ (athletic activity) for a given ${\displaystyle x_{i}}$ (occupational group):

occupation ${\displaystyle X}$ MD ${\displaystyle X}$
rarely sometimes regularly
worker 0.56 0.28 0.16 1.00
salaried 0.47 0.26 0.26 1.00
civil servant 0.33 0.33 0.33 1.00
farmer 0.74 0.14 0.12 1.00
self employed 0.44 0.36 0.20 1.00

For ${\displaystyle n=100}$ randomly selected persons it has been determined whether they smoke and whether they have had lung cancer. The variables are X - Smoker with realizations ${\displaystyle x_{1}}$ = yes and ${\displaystyle x_{2}}$ = no Y - Lung cancer with realizations ${\displaystyle y_{1}}$ = yes and ${\displaystyle y_{2}}$ = no The two-dimensional frequency distribution is a 2${\displaystyle \times }$2 contingency table

 MD${\displaystyle X}$ yes (${\displaystyle y_{1}}$) no(${\displaystyle y_{2}}$) smoking yes (${\displaystyle x_{1}}$) 10 15 25 smoking no (${\displaystyle x_{2}}$) 5 70 75 MD ${\displaystyle Y}$ 15 85 100

The conditional distributions of the variable ${\displaystyle X}$ (smoker) for a given ${\displaystyle y_{j}}$ (lung cancer) are shown in the following table:

 yes ${\displaystyle (y_{1})}$ no ${\displaystyle (y_{2})}$ smoker yes 0.667 0.176 smoker no 0.333 0.824 1.000 1.000

Each element of the conditional distribution has been calculated as the ratio of the respective cell of the joint distribution and the corresponding element of the ${\displaystyle Y}$ marginal distribution. From the table we learn that 66.7% of all persons diagnosed with lung cancer are smokers. 82.4% of the persons not diagnosed with lung cancer are non-smokers. The conditional distribution of the variable ${\displaystyle Y}$ (lung cancer), for a given value ${\displaystyle x_{i}}$ (smoker/non-smoker) is constructed analogously:

 yes ${\displaystyle (y_{1})}$ no ${\displaystyle (y_{2})}$ smoker yes ${\displaystyle (x_{1})}$ 0.400 0.600 1.000 smoker no ${\displaystyle (x_{2})}$ 0.067 0.933 1.000

Hence, 40% of all smokers but only 6.7% of all non-smokers have been diagnosed with lung cancer. In a survey of 941 persons, respondents’ age (grouped as 18-29, 30-39 and 40-49) and the highest level of education attained (university, high school, middle school, lower school) were recorded. The observed frequencies are shown in the following ${\displaystyle 3\times 4}$ contingency table:

university high school middle school lower school MD (age)
18–29 38 93 134 42 307
30–39 23 94 168 70 355
40–49 12 39 129 99 279
MD (education) 73 226 431 211 941

The conditional distributions of educational attainment, given age, are summarized in the following table:

university high school middle school lower school
18–29 0.124 0.303 0.436 0.137 1.000
30–39 0.065 0.265 0.473 0.197 1.000
40–49 0.043 0.140 0.462 0.355 1.000

Each element of the distribution has been calculated as the ratio of the respective cell of the joint distribution and the corresponding element of the marginal distribution of age. The table shows that among the 18-29 year-olds 12.4% have completed a university education, 30.3% graduated from high school and 43.6% finished middle school. In the group of 40-49 year-olds the fraction of persons with a university degree is only 4.3%. The conditional distribution of age, for a given level of educational attainment, is constructed analogously:

university high school middle school lower secondary
18–29 0.521 0.411 0.311 0.199
30–39 0.315 0.416 0.390 0.332
40–49 0.164 0.173 0.299 0.469
1.000 1.000 1.000 1.000

It can be seen that among those with at most a high school education, 41.1% belong to the age group 18-29, 41.6% to the age group 30-39 and 17.3% to the age group 40-49.

In a survey of 107 students their major and gender were recorded. The responses were used to produce the following 9${\displaystyle \times }$2 contingency table:

\$1

male MD (major)
social sc. 12 13 25
engineering 1 1 2
law 8 13 21
medicine 6 4 10
natural sc. 1 8 9
psychology 3 8 11
other 1 0 1
theology 7 2 9
business 5 14 19
MD (gender) 44 63 107

What are the shares of females and males in each major? The answer is given by the conditional distributions of gender, given the major. The frequencies of the conditional distribution are computed as the ratio of the corresponding cells of the joint distribution table and the marginal distribution (i.e. row sum in this case) of the respective major.

female male MD (major)
social sc. 0.480 0.520 1.000
engineering 0.500 0.500 1.000
law 0.381 0.619 1.000
medicine 0.600 0.400 1.000
natural sc. 0.111 0.889 1.000
psychology 0.273 0.727 1.000
other 1.000 0.000 1.000
theology 0.778 0.222 1.000
business 0,263 0,737 1,000
total 0,411 0,589 1,000

The results show that business is dominated by males who account for 73.7% of all students majoring in business. In theology, on the other hand, women are the majority comprising 77.8% of theology majors.