Relation between Discrete Variables (Rank Correlation)

From MM*Stat International

Jump to: navigation, search
English
Português
Français
‎Español
Italiano
Nederlands


Spearman’s rank correlation coefficient

The starting point for the measurement of relationships of two discrete, or ordinal, variables X and Y are the ranks. R(x_i),R(y_i), i=1,\dots,n, which are assigned to the observations x_{i} and y_{j} according to their rank. The ranks are defined so that R(x_{i}) is equal to 1 for the x_{i} that takes on the largest value we have observed, is equal to 2 for the x_{i} that takes on the second largest value we have observed and so on. Spearman’s rank correlation coefficient is computed from the pairs of ranks as follows: r_s=1-\frac{6 \sum^n_{i=1}\limits [R(x_i)-R(y_i)]^2}{n(n^2-1)}= 1-\frac{6
\sum^n_{i=1}\limits d_i^2}{n(n^2-1)},\quad d_i=R(x_i)-R(y_i) Spearman’s rank correlation coefficient amounts to applying the Bravais-Pearson correlation coefficient to the ranks (rather than the observations themselves). It is true that: \sum_{i=1}^{n}\limits R(x_{i})=\sum_{i=1}^{n}\limits
R(y_{i})=\frac{n(n+1)}{2} \sum^n_{i=1}\limits R(x_i)^2=\sum^n_{i=1}\limits R(y_i)^2=\frac{n(n+1)(2n+1)
}{6} \sum_{i=1}^{n}\limits R(x_{i})R(y_{i})=\frac{1}{2}\left[ \sum_{i=1}^{n}
\limits R(x_{i})^{2}+\sum_{i=1}^{n}\limits
R(y_{i})^{2}-\sum_{i=1}^{n}\limits(R(x_{i})-R(y_{i}))^{2}\right] The Bravais-Pearson Correlation Coefficient is calculated as: r_{yx}=\frac{
n\sum_{i=1}^{n}\limits x_{i}y_{i}-\sum_{i=1}^{n}\limits
x_{i}\sum_{i=1}^{n}\limits y_{i}}{\sqrt{\left[ n\sum_{i=1}^{n}\limits
x_{i}^{2}-\left( \sum_{i=1}^{n}\limits x_{i}\right) ^{2}\right] \left[
n\sum_{i=1}^{n}\limits y_{i}^{2}-\left( \sum_{i=1}^{n}\limits y_{i}\right)
^{2}\right] }} If we use the corresponding ranks R(x_{i}) and R(y_{i}) instead of the observations x_{i} and y_{i} themselves then we have derived Spearman’s rank correlation coefficient: r_{yx}=\frac{n \sum^n_{i=1}\limits R(x_i) R(y_i) - \sum^n_{i=1}\limits
R(x_i) \sum^n_{i=1}\limits R(y_i)}{\sqrt{\left[n \sum^n_{i=1}\limits
R(x_i)^2 - \left(\sum^n_{i=1}\limits R(x_i) \right)^2\right] \left[n
\sum^n_{i=1}\limits R(y_i)^2 - \left(\sum^n_{i=1}\limits R(y_i)\right)^2
\right]}} =\frac{n\cdot \frac{1}{2}\cdot 2 \frac{n(n+1)(2n+1)}{6}-n\cdot \frac{1}{2}
\sum^n_{i=1}\limits [R(x_i)-R(y_i)]^2 - \frac{n^2(n+1)^2}{4}} {n\cdot \frac{
n(n+1)(2n+1)}{6}-\frac{n^2(n+1)^2}{4}} =1-\frac{6\sum_{i=1}^{n}\limits[R(x_{i})-R(y_{i})]^{2}}{n(n+1)(n-1)}  =  r_{s}

Properties of Spearman’s rank correlation coefficient:

  • Spearman’s rank correlation coefficient can only take on values between -1 and +1: -1<=r_{s}<=1.
  • The rank correlation coefficient takes on the value +1 if the ranks behave exactly the same way, i.e.: R(x_{i})=R(y_{i}) for all i.
  • Spearman’s rank correlation coefficient takes on the value -1, if the ranks are perfectly opposed to each other, i.e.: R(x_{i})=n+1-R(y_{i}) for all i.

example:

X- Ranking of an athlete in downhill skiing Y- Ranking of an athlete in slalom Does there exist a relationship between the ranking in both disciplines?

athlete 1 2 3 4 5 6
downhill  R(x_{i}) 2 1 3 4 5 6
slalom  R(y_{i}) 2 3 1 5 4 6
{d_{i}}^{2} 0 4 4 1 1 0

The coefficient r_{s} = 0.714 points to a strong relationship between the ranking in both disciplines.

Kendall’s rank correlation coefficient

Kendall’s rank correlation coefficient is based on the comparison of the order relation for all possible pairs of observations of two variables. Concordant are the pairs of variables which show the same order relation, i. e. which show for both variables a low or high value. Discordant are the pairs which show a different order relation, that is which show in one of the variables a low and in the other variable a high value. Moreover, there can be pairs of variables, which are equal in terms of one value or both values. We call this bounding. The number of concordant pairs P and discordant pairs Q can be calculated as follows:

  • The variable pairs R(x_i) a R(y_i) are sorted in increasing order of R(x_i).
  • We call p_{i} the number of ranks subsequent to R(y_{i}) which are larger than R(y_{i}).
  • We call q_i the number of the ranks subsequent to R(y_i) which are smaller than R(y_i).

Using the number of discordant and concordant variable pairs, we can calculate Kendall’s rank correlation coefficient: T=\frac{P-Q}{P+Q}, with Q=\sum_i q_i and P=\sum_i p_i. The total number of all ranks to be compared is given by: n(n-1)/2=Q+P. The correlation coefficient can only take on values between -1 and +1: 
-1<=\tau <=1. An alternative way of calculating Kendall’s rank correlation coefficient is given by: T=1-\frac{4Q}{n(n-1)}=\frac{4P}{n(n-1)}-1

example:

Ten employees have been ranked according to their managerial abilities (X) and their work ethic (Y). In order to make a statement about the relationship between both variables, we calculate both, Spearmans’ and Kendall’s rank correlation coefficients.

employee 1 2 3 4 5 6 7 8 9 10
R(X) 7 3 9 10 1 5 4 6 2 8
R(Y) 3 9 10 8 7 1 5 4 2 6
{d_i}^2 16 36 1 4 36 16 1 4 0 4
  • Spearman’s rank correlation coefficient

    r_s=1-\frac{6 \sum^n_{i=1}\limits d_i^2}{n(n^2-1)}

    r_{s}=1-6\cdot 118/(10\cdot 99)=0.2848

  • Kendall’s rank correlation coefficient

    employee 5 9 2 7 6 8 1 10 3 4
    R(X) 1 2 3 4 5 6 7 8 9 10
    R(Y) 7 2 9 5 1 4 3 6 10 8
    q 6 1 6 3 0 1 0 0 1 0
    p 3 7 1 3 5 3 3 2 0 0

    Q=18, P=27

    Q+P=n(n-1)/2=10 \cdot 9/2=45

    T=(27-18)/(27+18)=9/45=0.200

This example allows us to calculate Spearmans’ and Kendall’s rank correlation coefficients for two series of ranks to be input by the user. After starting the example, the number of elements of the list of ranks has to be specified. Then the series of ranks themselves have to be provided. To test, the following data set can be put in when prompted:

student 1 2 3 4 5 6
grade in mathematics 1 4 5 1 3 2
grade in physics 2 5 3 2 2 3

For these series of ranks, the program will deliver the following output

En folnode4 f lev2 1 1.gif


En folnode4 f k 2.gif

The standings of 20 athletes in the 100 Meter dash and 200 Meter dash are given in the following table:

athlete(i) 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20
100 meters 5 7 3 13 2 15 19 14 12 1 6 20 17 4 18 11 10 16 9 8
200 meters 3 9 1 10 7 5 13 14 17 4 11 16 18 12 20 2 15 19 6 8

In what follows, the statistical relationship between the standings of the athletes in the two disciplines will be determined. Since the variables are ordinally scaled (discrete) we will use Spearman’s and Kendall’s rank correlation coefficients. Calculating both coefficients gives the following results:

En folnode4 f k 1 1.gif

Spearman’s coefficient is calculated as:

r_s = 1- \frac{6\sum_{i=1}^{n}{d_i}^2} {n(n^2-1)} The information necessary to apply the formula can be obtained from the table – d is the difference between x_{i} and y_{j}, n is the number of athletes (= 20). The calculations produce a coefficient of 0.6617, which implies a positive relationship between the standings in the two disciplines - athletes doing well in the 100 meter dash also tend to do well in the 200 meters. To calculate Kendall’s rank correlation coefficient, one needs to determine the concordant and discordant pairs of athletes. A pair of observations (=athletes) is called concordant, if the same order relation applies to both variables and discordant if the order relations don’t agree. For instance, athletes 1 and 2 are concordant: athlete 1 has a better standing than athlete 2 in both the 100 meter dash and the 200 meter dash. Athletes 1 and 5, however, are discordant: athlete 1 is behind in the 100 meters but is ahead of athlete 5 in the standings of the 200 meter dash. Overall, there are \frac{n\cdot (n-1)}{2}=190 different pairs in this example, 138 of which are concordant while 52 are discordant. Using these numbers Kendall’s rank correlation coefficient can be calculated: \tau = \frac {P-Q}{P+Q}, where Q=\sum_i q_i and P=\sum_i p_i. Here, P is the number of concordant pairs and Q the number of discordant pairs. Kendall’s rank correlation coefficient turns out to be 0.4526 in this example, which is evidence for a positive relationship between the standings.