Relation between Continuous Variables (Correlation, Correlation Coefficients)

 English Português Français ‎Español Italiano Nederlands

The common variation (covariation) of the two continuous variables $X$ and $Y$ determines the strength of the relation between the two variables. Variation is measured using the dispersion or deviation of the realizations from their mean. In the first step, we center the observations: $\ast {x_{k}}=(x_{k}-{\bar {x}})$ $\ast {y_{k}}=(y_{k}-{\bar {y}}),\,\,\,\,\,\,\ k=1,...,n$ The common variation of both variables is the product of the deviations of the observations of their mean (see the calculation of the covariance): $\sum _{k=1}^{n}\ast {x_{k}}\ast {y_{k}}=\sum _{k=1}^{n}(x_{k}-{\bar {x}})(y_{k}-{\bar {y}})$ The scale on which each of the variables are measured and the number of observations can have a large impact on the magnitude of the common variation. If the mean of one of the variables is 8 and the observed value is 10, and the mean value of the other variable is 1,008 and the observed value is 1,260. Although the deviation of the first value is 2 and of the deviation of the second is 252, the relative deviation of the mean value is in both cases 25%. This fact may not have been observed if we simply calculated the common variation for this observation 504. Therefore, In order to get similar deviations of the variables, we perform a standardization: $(x_{k}-{\bar {x}})/s_{x}$ and  $(y_{k}-{\bar {y}})/s_{y}$ Now, change the above equation into: $\sum _{k=1}^{n}{\frac {(x_{k}-{\bar {x}})}{s_{x}}}{\frac {(y_{k}-{\bar {y}})}{s_{y}}}$ We subsequently divide this sum of products by the number of observations in order to eliminate its influence. Now we have obtained the Bravais-Pearson (sample) correlation coefficient which measures the strength of the linear relation between the two continuous variables $X$ and $Y$ is given by: $r_{yx}=r_{xy}={\frac {\sum _{k=1}^{n}\limits (x_{k}-{\bar {x}})(y_{k}-{\bar {y}})}{n\cdot s_{x}\cdot s_{y}}}={\frac {s_{xy}}{s_{x}\cdot s_{y}}}$ The final parts of the above equation shows that the Bravais-Pearson correlation coefficient is equal to the variation common to both variables $X$ and $Y$ (= covariance) standardized by the product of the standard deviations of each variable. The Bravais-Pearson correlation coefficient can also be written as follows: $r_{yx}={\frac {\sum _{k=1}^{n}\limits (x_{k}-{\bar {x}})(y_{k}-{\bar {y}})}{\sqrt {\sum _{k=1}^{n}\limits (x_{k}-{\bar {x}})^{2}\sum _{k=1}^{n}\limits (y_{k}-{\bar {y}})^{2}}}}$ $r_{yx}={\frac {n\sum _{k=1}^{n}\limits x_{k}y_{k}-\sum _{k=1}^{n}\limits x_{k}\sum _{k=1}^{n}\limits y_{k}}{\sqrt {\left[n\sum _{k=1}^{n}\limits {x_{k}}^{2}-{\left(\sum _{k=1}^{n}\limits x_{k}\right)}^{2}\right]\left[n\sum _{k=1}^{n}\limits {y_{k}}^{2}-{\left(\sum _{k=1}^{n}\limits y_{k}\right)}^{2}\right]}}}$ Properties of the correlation coefficient:

• The correlation coefficient only takes on values between -1 and +1: $-1<=r_{xy}<=1$ • The sign of the correlation coefficient tells us the direction of the linear relation

• “+” corresponds to a positive correlation (proportional variation)

• “-” corresponds to a negative correlation (inverse proportional variation)

• If all observations are exactly on a straight line, the correlation coefficient is equal to $\Vert 1\Vert$ .

The more the correlation coefficient approaches the value $\Vert 1\Vert ,$ the more pronounced is the linear relation between the variables $X$ and $Y$ (and the other way round).

• If the variables $X$ and $Y$ are independent, then the correlation coefficient is equal to 0.

On the other hand, a correlation coefficient of 0 only means that there is no linear relation between the variables $X$ and $Y$ (linear independence). But it is very well possible that there exists a pronounced non-linear relation between both variables.

• The correlation coefficient is symmetric: $r_{xy}=r_{yx}$ Relation of correlation and the scatterplot of $X$ and $Y$ observations

Perfect correlation (correlation coefficient = $\Vert 1\Vert$ )

Strong correlation ( correlation coefficient $>\|0.5\|$ )

Weak correlation (correlation coefficient $<\Vert 0.5\Vert$ )

No correlation (correlation coefficient = 0) A correlation of 0 corresponds ”in general” to a some kind of a circular scatterplot point cloud.

example:

In $n=15$ enterprises, we observed the variables $Y$ - annual profit (in Mill. DM) and $X$ - annual rent for the computer facilities (in 1,000 DM). You can see their variable values in the following table. We also illustrate them graphically in the following scatterplot.

Company annual profit in Mill. DM annual rent in 1,000 DM
$k$ $y_{k}$ $x_{k}$ 1 10 30
2 15 30
3 15 100
4 20 50
5 20 100
6 25 80
7 30 50
8 30 100
9 30 250
10 35 180
11 35 330
12 40 200
13 45 400
14 50 500
15 50 600

From the observations, the following results can be obtained:

 ${\overline {y}}=30(Mill.DM)$ , $\sum _{k=1}^{15}\limits (y_{k}-{\overline {y}})^{2}=2,250$ ${\overline {x}}=200(1,000DM)$ , $\sum _{k=1}^{15}\limits (x_{k}-{\overline {x}})^{2}=457,000$ $\sum _{k=1}^{15}\limits (x_{k}-{\overline {x}})(y_{k}-{\overline {y}})=28,100$ $r_{xy}={\frac {28100}{\sqrt {(457000)\cdot (2250)}}}=0.8763$ The sample correlation coefficient is in this example 0.8763. This points to a strong positive linear relation.

In 1985, the following variables describing criminal activity were recorded for each of the 50 states of the U.S.A.:

 $X1$ - land area $X2$ - population $X3$ - murder $X4$ - rape $X5$ - robbery $X6$ - assault $X7$ - burglary $X8$ - larceny $X9$ - auto theft $X10$ - US states region number $X11$ - US states division number

Variables $X10$ and $X11$ can take on the following values:

 1 Northeast 1 New England 2 Midwest 2 Mid Atlantic 3 South 3 E N Central 4 West 4 W N Central 5 S Atlantic 6 E S Central 7 W S Central 8 Mountain 9 Pacific

This interactive example allows you to select two variables for which a scatterplot will be drawn and the Bravais-Pearson-correlation coefficient will be calculated. In 1985, rates of criminal activity of the 50 states of the U.S.A. have been recorded, among them the rate of murder. The relationship between the murder rate and the size of the population can be visualized by a scatterplot:

The different sums of squared deviations (SSD) are calculated in the following way: Sum of the products of deviations of “population” and “murder”: $SSD(population\mid murder)=\sum (x_{k}-{\bar {x}})(y_{k}-{\bar {y}})=260,121.05$ Sum of squared deviations for “population": $SSD(population)=\sum (x_{k}-{\bar {x}})^{2}=1,259,033,421.62$ Sum of squared deviations for “murder": $SSD(murder)=\sum (y_{k}-{\bar {y}})^{2}=725.54$ The sample correlation coefficient is equal to $r={\frac {260,121.05}{\sqrt {(1,259,033,421.62)\cdot (725.54)}}}=0.27$ The sample correlation coefficient of 0.27 points to a weak positive linear relationship.