# Relationship between Nominal Variables (Contingency)

 English Português Français ‎Español Italiano Nederlands

The starting point for the analysis of relationships between two nominal variables $X$ and $Y$ is the joint frequency distribution of $X$ and $Y$ put into a contingency table including the absolute frequencies $h_{ij}=h(x_{i},y_{j})\quad (i=1,\dots ,m;j=1,\dots ,r)$ or the relative frequencies $f_{ij}=f(x_{i},y_{j})=h(x_{i},y_{j})/n\quad (i=1,\dots ,m;j=1,\dots ,r)$ . As we showed in the section ”characteristics of two dimensional distributions” the relative frequency for the joint appearance of realizations $x_{i}$ and $y_{i}$ $(i=1,\dots ,m;j=1,\dots ,r)$ –in the case of independence– is equal to the product of the relative frequencies of the marginal distribution of both variables: $f_{ij}=f_{i\bullet }f_{\bullet j}$ and $h_{ij}={\frac {h_{i\bullet }h_{\bullet j}}{n}}=nf_{i\bullet }f_{\bullet j}$ We can now calculate an auxiliary quantity - the squared contingency, represented by $\chi ^{2}$ :  $\chi ^{2}=\sum _{i=1}^{m}\limits \sum _{j=1}^{r}\limits {\frac {\left(h_{ij}-{\frac {1}{n}}h_{i\bullet }h_{\bullet j}\right)^{2}}{{\frac {1}{n}}h_{i\bullet }h_{\bullet j}}}=n\sum _{i=1}^{m}\limits \sum _{j=1}^{r}\limits {\frac {(f_{ij}-f_{i\bullet }f_{\bullet j})^{2}}{f_{i\bullet }f_{\bullet j}}}$ The numerator of the summands above form the squared deviations of the observed absolute (relative) frequencies from the expected absolute (relative) frequencies (if the variables are independent). Dividing by the expected absolute (relative) frequencies (if the variables are independent) we obtain a standardization. We use the squared contingency to calculate the contingency coefficient as follows: $C={\sqrt {\frac {\chi ^{2}}{n+\chi ^{2}}}}$ The contingency coefficient provides a measure of the strength of the relationship between nominal variables. $0\leq C\leq {\sqrt {\frac {C^{\ast }-1}{C^{\ast }}}}$ ; where $\ C^{\ast }=min(m,r)$ . If the contingency coefficient equals 0 we have statistical independence. The contingency coefficient almost never reaches 1 even when there is a perfect relationship between both variables because the sample size $n$ is always larger than 0 and therefore the denominator is always larger than the numerator. In order to solve this problem and to be able to reach the value 1 in case of a perfect relationship, we often use the corrected contingency coefficient which is calculated as follows: $C_{korr}=C\cdot {\sqrt {\frac {C^{\ast }}{C^{\ast }-1}}}\qquad O\leq C_{korr}\leq 1$ example:

We want to analyze if there is a relationship between smoking and lung cancer. We use the following contingency table:

 MD $X$ yes($y_{1}$ ) no($y_{2}$ ) smoker yes ($x_{1}$ ) 10 15 25 $(h_{1\cdot })$ smoker no ($x_{2}$ ) 5 70 75 $(h_{2}\cdot )$ MD $Y$ 15 $(h_{\cdot 1)}$ 85 $(h_{\cdot 2})$ 100 $(n)$ $\chi ^{2}={\frac {\left(10-{\frac {15\cdot (25)}{100}}\right)^{2}}{\frac {15\cdot (25)}{100}}}+{\frac {\left(15-{\frac {85\cdot (25)}{100}}\right)^{2}}{\frac {85\cdot (25)}{100}}}+{\frac {\left(5-{\frac {15\cdot (75)}{100}}\right)^{2}}{\frac {15\cdot (75)}{100}}}+{\frac {\left(70-{\frac {85\cdot (75)}{100}}\right)^{2}}{\frac {85\cdot (75)}{100}}}=16.34$ $C={\sqrt {\frac {16.34}{100+16.34}}}=0.375$ $C_{korr}=0.375\cdot {\sqrt {\frac {2}{2-1}}}=0.53$ The corrected contingency coefficient of 0.53 is evidence for a relationship between smoking and lung cancer. Now you are given the opportunity to generate a two-dimensional frequency distribution using the variables from one of the following data sets:

Studying

For 107 students, the following variables were recorded: major, gender, age, number of semesters, citizenship, social situation (very good/good, satisfactory, bad), psychological situation (very unstable, unstable, stable, very stable) and assessment of their studies (very good/good, satisfactory, bad).

Information

941 persons were asked if they subscribe to a magazine. At the same time, the following variables were also recorded: employment status (employed, not employed), age (using the age groups 18 - 29, 30 - 39, 40 - 49), education (lower school, middle school, high school, university)

Gas stations

700 gas stations were observed. To describe their location, city size (“small” if less than 100,000, “big” if at least 100,000 residents) and the type of street (interstate/highway, county road, main street) were recorded. Furthermore, type of service (full service, self service) and sales (low, average, high) were observed. First, you will be asked to select one of the available data sets. Then, for the data set chosen, all possible two-dimensional frequency distributions will be shown in the output window, as well as the $\chi ^{2}$ statistic and the contingency coefficients. The ”department store” data set contains the following variables recorded for $n=165$ randomly selected customers:

 Variable possible realizations $X$ gender $1$ - male $2$ - female $Y$ method of payment $1$ - cash $2$ - ATM card $3$ - credit card $Z$ residence $1$ - Berlin $2$ - not in Berlin

Below, the three possible two-dimensional frequency distributions are shown that can be formed for the variables in this data set. The contingency coefficient is calculated each time. The two-dimensional frequency distribution for the variables gender and method of payment is a 2$\times$ 3 contingency table.

 gender $(X)$ MD $X$ $(y_{1})$ $(y_{2})$ $(y_{3})$ male $(x_{1})$ 31 (0.188) 32(0.194) 23(0.139 86 (0.521) female $(x_{2})$ 30 (0.182) 29(0.176) 20 (0.121) 79 (0.479) MD $Y$ 61 (0.370) 61(0.370) 43 (0.260) 165 (1.00)
 $\chi ^{2}$ statistic 0.08 contingency coefficient 0.02 corrected contingency coefficient 0.03

The corrected contingency coefficient of 0.03 shows that there is only a very weak relationship between gender and method of payment. The two-dimensional frequency distribution for the variables gender and residence is a 2$\times$ 2 contingency table.

 gender $(X)$ MD $X$ Berlin $(z_{1})$ not in Berlin $(z_{2})$ male $(x_{1})$ 50 (0.303) 36 (0.218) 86 (0.521) female $(x_{2})$ 37 (0.224) 42 (0.255) 79(0.429) MD $Y$ 87 (0.527) 78 (0.473) 165 (1.00)
 $\chi ^{2}$ statistic 2.11 contingency coefficient 0.11 corrected contingency coefficient 0.16

The corrected contingency coefficient of 0.16 shows that there is only a weak relationship between gender and residence. The two-dimensional frequency distribution for the variables residence and method of payment is a 2$\times$ 3 contingency table.

 residence $(Z)$ MR $X$ $(y_{1})$ $(y_{2})$ $(y_{3})$ Berlin $(z_{1})$ 44 (0.267) 22(0.133) 21(0.127) 87(0.527) not in Berlin $(z_{2})$ 17(0.103) 39(0.237) 22(0.133) 78(0.473) MD $Y$ 62(0.370) 61(0.370) 43(0.260) 165(1.00)
 $\chi ^{2}$ statistic 16.27 contingency coefficient 0.3 corrected contingency coefficient 0.42

The corrected contingency coefficient of 0.42 –being considerably larger than in the previous two cases– shows that there is a medium strength relationship between residence and method of payment.