# One-Dimensional Regression Analysis

 English Português Français ‎Español Italiano Nederlands

## One-dimensional linear regression function

A simple linear regression function has the following form: ${\displaystyle E(y_{i}|x_{i})=b_{0}+b_{1}x_{i}\quad i=1,\ldots ,n}$ In this equation, ${\displaystyle x_{i}}$ represents the observed values of a random variable X (fixed) and ${\displaystyle b_{0}}$ and ${\displaystyle b_{1}}$ are unknown regression parameters. The actual observed values ${\displaystyle y_{i}(i=1,\ldots ,n)}$ can be obtained by summing residual ${\displaystyle u_{i}}$ and ${\displaystyle E(y_{i}|x_{i})}$ (as you can see on the picture): ${\displaystyle y_{i}=E(y_{i}|x_{i})+u_{i}=b_{0}+b_{1}x_{i}+u_{i}\quad i=1,\ldots ,n}$

Regression parameters Parameters of a simple linear regression function have the following meaning:

• ${\displaystyle b_{0}}$ - intercept term (constant)

It describes the intersection of the corresponding regression line and the y-axis and it has the same value as variable ${\displaystyle Y}$ at this point.

• ${\displaystyle b_{1}}$ - linear slope coefficient (also a constant)

It characterizes the slope of the corresponding regression line. It tells us by how many units is the random variable ${\displaystyle Y}$ will change if the value of variable ${\displaystyle X}$ is increased by one unit.

Estimation of regression parameters To estimate regression parameters, two important conditions have to be satisfied. 1st condition The deviations of estimated regression values ${\displaystyle {\widehat {y_{i}}}}$ from observed values ${\displaystyle y_{i}}$ should be on average equal to zero; that is ${\displaystyle \sum _{i=1}^{n}(y_{i}-{\widehat {y_{i}}})=\sum _{i=1}^{n}{\widehat {u_{i}}}=0}$ ${\displaystyle {\bar {\hat {u}}}={\frac {1}{n}}\sum _{i=1}^{n}{\widehat {u_{i}}}=0}$ However, this condition is satisfied for infinitely many regression lines, namely those that go through the point of sample means ${\displaystyle {\bar {x}},{\bar {y}}}$. Aside: Notice that the above expressions imply that ${\displaystyle y_{i}={\widehat {y_{i}}}+{\widehat {u_{i}}}}$ therefore for each observation ${\displaystyle i}$ we have decomposed the observed ${\displaystyle y_{i}}$ into two parts (1) an estimated regression function ${\displaystyle {\widehat {y_{i}}}={\widehat {E(y_{i}|x_{i})}}}$ (i.e. an estimate of the conditional mean); and (2) an estimated residual ${\displaystyle {\widehat {u_{i}}}}$ (disturbance)

2nd condition We search for a regression line such that the spread (variance ) of the corresponding estimated residuals (called disturbances) ${\displaystyle {s^{2}}_{\hat {u}}={\frac {1}{n-2}}\sum _{i=1}^{n}{({\widehat {u_{i}}}-{\bar {\hat {u}}})}^{2}}$ is minimal in comparison with all other possible regression lines. The first condition ${\displaystyle {\bar {\hat {u}}}=0}$ implies ${\displaystyle {s^{2}}_{\hat {u}}={\frac {1}{n-2}}\sum _{i=1}^{n}{({\widehat {u_{i}}}-0)}^{2}={\frac {1}{n-2}}\sum _{i=1}^{n}{\widehat {u_{i}}}^{2}={\frac {1}{n-2}}\sum _{i=1}^{n}{(y_{i}-{\widehat {y_{i}}}))}^{2}.}$ The second condition is depicted in the following figure:

The squares drawn in the figure correspond to the squared residuals and the total area of the squares should be minimized. Hence, the method used for this minimization is called the least squares method (LS). The least squares method minimizes the sum of squared deviations of regression values from the observed values (residual sum of squares—RSS) ${\displaystyle \sum _{i=1}^{n}{(y_{i}-{\widehat {y_{i}}}))}^{2}\rightarrow min.\quad \mid E(y_{i}|x_{i})=b_{0}+b_{1}x_{i}.}$ The minimized function has two unknown variables (${\displaystyle b_{0}}$ and ${\displaystyle b_{1}}$). ${\displaystyle S(b_{0},b_{1})=\sum _{i=1}^{n}{(y_{i}-b_{0}-b_{1}x_{i})}^{2}\rightarrow min.\quad E(y_{i}|x_{i})=b_{0}+b_{1}x_{i}}$ To find a minimum, the first partial derivatives have to be equal to zero. ${\displaystyle S(b_{0},b_{1})=\sum _{i=1}^{n}(y_{i}-b_{0}-b_{1}x_{i})^{2}\rightarrow min.}$ ${\displaystyle {\frac {\partial S(b_{0},b_{1})}{\partial b_{0}}}=-2\sum _{i=1}^{n}(y_{i}-b_{0}-b_{1}x_{i})\doteq 0}$ ${\displaystyle {\frac {\partial S(b_{0},b_{1})}{\partial b_{1}}}=-2\sum _{i=1}^{n}(y_{i}-b_{0}-b_{1}x_{i})x_{i}\doteq 0}$ To verify whether the solution is really a minimum, the second partial derivatives has to be evaluated. ${\displaystyle {\frac {{\partial }^{2}S(b_{0},b_{1})}{\partial {b_{0}}^{2}}}=2n>0}$ ${\displaystyle {\frac {{\partial }^{2}S(b_{0},b_{1})}{\partial {b_{1}}^{2}}}=2\sum _{i=1}^{n}{x_{i}}^{2}>0}$ Because the both second derivatives are positive, the extremum found is a minimum. The first derivatives (equal to zero) lead to the so-called (least squares) normal equations, from which the estimated regression parameters ( ${\displaystyle {\widehat {b_{0}}}}$  and ${\displaystyle {\widehat {b_{1}}}}$ ) can be computed by solving the equations. ${\displaystyle n{\widehat {b_{0}}}+{\widehat {b_{1}}}\sum _{i=1}^{n}x_{i}=\sum _{i=1}{n}y_{i}}$ ${\displaystyle {\widehat {b_{0}}}\sum _{i=1}^{n}x_{i}+{\widehat {b_{1}}}\sum _{i=1}^{n}{x_{i}}^{2}=\sum _{i=1}^{n}x_{i}y_{i}}$ The normal equations can be solved by means of linear algebra (Cramer’s rule): ${\displaystyle {\widehat {b_{0}}}={\frac {\left|{\begin{array}{ll}\sum y_{i}&\sum x_{i}\\\sum x_{i}y_{i}&\sum {x_{i}}^{2}\end{array}}\right|}{\left|{\begin{array}{ll}n&\sum x_{i}\\\sum x_{i}&\sum {x_{i}}^{2}\end{array}}\right|}}={\frac {\sum y_{i}\sum {x_{i}}^{2}-\sum x_{i}\sum x_{i}y_{i}}{n\sum {x_{i}}^{2}-\sum x_{i}\sum x_{i}}}}$ ${\displaystyle {\widehat {b_{1}}}={\frac {\left|{\begin{array}{ll}n&\sum y_{i}\\\sum x_{i}&\sum x_{i}y_{i}\end{array}}\right|}{\left|{\begin{array}{ll}n&\sum x_{i}\\\sum x_{i}&\sum {x_{i}}^{2}\end{array}}\right|}}={\frac {n\sum x_{i}y_{i}-\sum x_{i}\sum y_{i}}{n\sum {x_{i}}^{2}-\sum x_{i}\sum x_{i}}}}$ Dividing the original equations by ${\displaystyle n}$, we get a simplified formula suitable for the computation of regression parameters: {\displaystyle {\begin{aligned}{\widehat {b_{0}}}+{\widehat {b_{1}}}{\bar {x}}&=&{\bar {y}}\\{\widehat {b_{0}}}{\bar {x}}+{\widehat {b_{1}}}{\bar {x^{2}}}&=&{\overline {xy}}\end{aligned}}} For the estimated intercept ${\displaystyle {\widehat {b_{0}}}}$, we get: ${\displaystyle {\widehat {b_{0}}}={\bar {y}}-{\widehat {b_{1}}}{\bar {x}}}$ For the estimated linear slope coefficient ${\displaystyle {\widehat {b_{1}}}}$, we get: {\displaystyle {\begin{aligned}({\bar {y}}-{\widehat {b_{1}}}{\bar {x}}){\bar {x}}+{\widehat {b_{1}}}{\bar {x^{2}}}&=&{\overline {xy}}\\{\widehat {b_{1}}}({\bar {x^{2}}}-{\bar {x}}^{2})&=&{\bar {xy}}-{\overline {x}}{\bar {y}}\\{{\widehat {b_{1}}}S_{X}}^{2}&=&S_{XY}\\{\widehat {b_{1}}}&=&{\frac {S_{XY}}{{S_{X}}^{2}}}\end{aligned}}} Properties:

• The sample variance of ${\displaystyle X}$ must be greater than zero: ${\displaystyle {S_{X}}^{2}>0}$

• From the simplified normal equations, you can see that: ${\displaystyle ({\bar {x}},{\bar {y}})\rightarrow }$ if ${\displaystyle x_{i}={\bar {x}}}$ then ${\displaystyle {\widehat {y_{i}}}={\bar {y}}}$ ${\displaystyle {\widehat {y_{i}}}={\widehat {b_{0}}}+{\widehat {b_{1}}}x_{i}={\bar {y}}+{\widehat {b_{1}}}(x_{i}-{\bar {x}})={\bar {y}}}$

• Combining results from correlation and regression analysis, it is possible to obtain the estimated linear slope coefficient ${\displaystyle {\widehat {b_{1}}}}$ as follows:

${\displaystyle {\widehat {b_{1}}}={\frac {S_{xy}}{{S_{x}}^{2}}},\quad r_{xy}={\frac {S_{xy}}{S_{x}S_{y}}}}$ ${\displaystyle \Rightarrow {\widehat {b_{1}}}=r_{xy}{\frac {S_{y}}{S_{x}}}}$

The regression ${\displaystyle (y|x)}$ of ${\displaystyle y}$ on ${\displaystyle x}$ does not correspond to the regression ${\displaystyle (x|y)}$ of ${\displaystyle x}$ on ${\displaystyle y}$.

 ${\displaystyle {\widehat {b_{0}}}}$ = ${\displaystyle {\bar {y}}-{\widehat {b_{1}}}{\bar {x}}}$ ${\displaystyle {\widehat {b_{0}}}^{\ast }}$ = ${\displaystyle {\bar {x}}-{\widehat {b_{1}}}^{\ast }{\bar {y}}}$ ${\displaystyle {\widehat {b_{1}}}}$ = ${\displaystyle {\frac {S_{XY}}{{S_{X}}^{2}}}}$ ${\displaystyle {\widehat {b_{1}}}^{\ast }}$ = ${\displaystyle {\frac {S_{XY}}{{S_{Y}}^{2}}}}$

Example: X- production output Y- working time n=10 production cycles in a firm

${\displaystyle i}$ ${\displaystyle x_{i}}$ ${\displaystyle y_{i}}$ ${\displaystyle x_{i}y_{i}}$ ${\displaystyle {x_{i}}^{2}}$ ${\displaystyle {y_{i}}^{2}}$ ${\displaystyle {\widehat {y_{i}}}}$ ${\displaystyle {\hat {u_{i}}}}$
1 30 73 2,190 900 5,329 70 3
2 20 50 1,000 400 2,500 50 0
3 60 128 7,680 3,600 16,384 130 -2
4 80 170 1,360 6,400 28,900 170 0
5 40 87 3,480 1,600 7,569 90 -3
6 50 108 5,400 2,500 11,664 110 -2
7 60 135 8,100 3,600 18,225 130 5
8 30 69 2,070 900 4,761 70 -1
9 70 148 10,360 4,900 21,904 150 -2
10 60 132 72,920 3,600 17,424 130 2
${\displaystyle \sum }$ 500 1,100 61,800 28,400 134,660 1,100 0

Computation of auxiliary variables (sample mean, sample variance and sample standard deviation):

 ${\displaystyle {\bar {x}}}$ = ${\displaystyle 50}$ ${\displaystyle s_{x}^{2}=3,400/10=340}$ ${\displaystyle s_{x}}$ = ${\displaystyle 18.44}$ ${\displaystyle {\bar {y}}}$ = ${\displaystyle 110}$ ${\displaystyle s_{x}^{2}=13,660/10=13,366}$ ${\displaystyle s_{y}}$ = ${\displaystyle 36.96}$

sample Covariance and sample equal: ${\displaystyle s_{xy}=6,800/10=680\quad r_{xy}=680/(18.44\cdot 36.96)=0.9977}$ From these values, we can compute the estimated regression coefficients ${\displaystyle {\widehat {b_{0}}}}$ and ${\displaystyle {\widehat {b_{1}}}}$ : ${\displaystyle {\widehat {b_{1}}}=680/340=2}$ ${\displaystyle {\widehat {b_{0}}}=110-2\cdot (50)=10}$ As a result, we obtain the following estimated regression line: ${\displaystyle {\widehat {y_{i}}}=10+2x_{i}}$

## Quality (fit) of the regression line

Once the regression line is estimated, it is useful to know how well the regression line approximates the observed data, that is, how good the representation of the data by means of the regression line is. A measure that can describe the quality of representation is called the coefficient of determination (or R-Squared ${\displaystyle R^{2}}$). Its computation is based on a decomposition of the variance of the dependent variable ${\displaystyle Y}$. The smaller is the sum of squared estimated residuals, the better is the quality (fit) of the regression line. Since least squared minimizes the variance of the estimated residuals it also maxmizes the R-Squared by construction. ${\displaystyle \sum {(y_{i}-{\widehat {y_{i}}})}^{2}=\sum {\hat {{u_{i}}^{2}}}\rightarrow min.}$ The sample variance of ${\displaystyle Y}$ is: ${\displaystyle {s_{y}}^{2}={\frac {\sum _{i=1}^{n}{(y_{i}-{\bar {y}})}^{2}}{n}}}$ The deviation of the observed values ${\displaystyle y_{i}}$ from the arithmetic mean ${\displaystyle {\bar {y}}}$ can be decomposed to two parts: the deviation of the observed values ${\displaystyle y_{i}}$ from the estimated regression values and the deviation of the estimated regression values from the sample mean. ${\displaystyle y_{i}-{\bar {y}}=[(y_{i}-{{\widehat {y_{i}}})}+({\widehat {y_{i}}}-{\bar {y}})],\quad i=1,\cdots ,n}$ This decomposition is depicted in the following figure:

Analogously, the sum of the squared deviations can be decomposed: ${\displaystyle \sum _{i=1}^{n}{(y_{i}-{\bar {y}})}^{2}=\sum _{i=1}^{n}[{(y_{i}-{\widehat {y_{i}}})}+({\widehat {y_{i}}}-{\bar {y}})]^{2}}$ ${\displaystyle \sum _{i=1}^{n}{(y_{i}-{\bar {y}})}^{2}=\sum _{i=1}^{n}{(y_{i}-{\widehat {y_{i}}})}^{2}+\sum _{i=1}^{n}{({\widehat {y_{i}}}-{\bar {y}})}^{2}}$ We were able to derive the second equation above by noting that ${\displaystyle \sum _{i=1}^{n}{(y_{i}-{\widehat {y_{i}}})}({\widehat {y_{i}}}-{\bar {y}})=0}$. The reader is urged to prove this using the second least square first order condition above along with the definition of ${\displaystyle {{\widehat {y_{i}}}.}}$ Dividing both sides of the second equation by ${\displaystyle n}$, it follows: ${\displaystyle {\frac {\sum _{i}^{n}{(y_{i}-{\bar {y}})}^{2}}{n}}={\frac {\sum _{i=1}^{n}{(y_{i}-{\widehat {y_{i}}})}^{2}}{n}}+{\frac {\sum _{i=1}^{n}{({\widehat {y_{i}}}-{\bar {y}})}^{2}}{n}}}$ ${\displaystyle {\frac {\sum _{i}^{n}{(y_{i}-{\bar {y}})}^{2}}{n}}={\frac {\sum _{i=1}^{n}{\hat {u_{i}}}^{2}}{n}}+{\frac {\sum _{i=1}^{n}{({\widehat {y_{i}}}-{\bar {y}})}^{2}}{n}}}$ ${\displaystyle {S_{y}}^{2}={S_{\hat {u}}}^{2}+{S_{\hat {y}}}^{2}}$ The total sample variance of ${\displaystyle Y}$ is equal to (can be decompossed into) the sum of the sample variance of the estimated residuals (the unexplained part of the variance of ${\displaystyle Y}$) and the part of the variance of ${\displaystyle Y}$ that is explained by the regression function (the sample variance of the regression function). It holds:

• The larger the portion of the sample variance ${\displaystyle y}$ in explained by the model is (i.e. ${\displaystyle {S_{\hat {y}}}^{2})}$, the better the fit of the regression function.
• On the other hand, the larger the residual variance ${\displaystyle {\hat {{S_{u}}^{2}}}}$ as a percentage of the sample variance of ${\displaystyle y}$, alternatively the larger the outside influences unexplained by the regression function are, the worse the regression function fits.

The coefficient of determination The coefficient of determination is defined as the ratio of the (sample) variance ${\displaystyle Y}$ explained by the regression function and the total (sample) variance of ${\displaystyle Y}$. That is, it represents the proportion of the sample variance in ${\displaystyle y}$ ”explained” by the estimated regression function. ${\displaystyle R_{yx}^{2}={\frac {\sum _{i=1}^{n}{({\widehat {y_{i}}}-{\bar {y}})}^{2}}{\sum _{i=1}^{n}{(y_{i}-{\bar {y}})}^{2}}}={\frac {{S_{\hat {y}}}^{2}}{{S_{y}}^{2}}}}$ An alternative way for computing the coefficient of determination is: ${\displaystyle R_{yx}^{2}={\frac {{[\sum _{i=1}^{n}(y_{i}-{\bar {y}})(x_{i}-{\bar {x}})]}^{2}}{\sum _{i=1}^{n}{(y_{i}-{\bar {y}})}^{2}\sum _{i=1}^{n}{(x_{i}-{\bar {x}})}^{2}}}={\frac {{S_{xy}}^{2}}{{S_{y}}^{2}{S_{x}}^{2}}}}$ ${\displaystyle R_{xy}^{2}={\frac {{(n\sum _{i=1}^{n}x_{i}y_{i}-\sum _{i=1}^{n}x_{i}\sum _{i=1}^{n}y_{i})}^{2}}{[n\sum _{i=1}^{n}{x_{i}}^{2}-{(\sum _{i=1}^{n}x_{i})}^{2}][n\sum _{i=1}^{n}{y_{i}}^{2}-{(\sum _{i=1}^{n}y_{i})}^{2}]}}}$ Characteristics:

• The coefficient of determination has the following domain: ${\displaystyle 0\leq R_{yx}^{2}\leq 1}$

The higher the coefficient of determination is, the better the regression function explains the observed values. If all observed values lie on the regression line, the coefficient of determination is equal to ${\displaystyle 1}$. The total variance of ${\displaystyle Y}$ can be explained by the variable ${\displaystyle X}$. ${\displaystyle Y}$ depends completely (and linearly) on ${\displaystyle X}$. If the coefficient of determination is zero, the total variance of ${\displaystyle Y}$ is identical with the unexplained variance (the residual variance). The random variable ${\displaystyle X}$ does not have any influence of ${\displaystyle Y}$.

• ${\displaystyle R_{xy}^{2}=R_{yx}^{2}}$ Symmetry (the fit of the regression of ${\displaystyle y}$ on ${\displaystyle x}$ is identical to the fit of the regression of ${\displaystyle x}$ on ${\displaystyle y}$)

• For a linear regression function, the coefficient of determination corresponds to the square of the correlation coefficient: ${\displaystyle R_{yx}^{2}=r_{yx}^{2}}$.

Example: For the above described dependence between the working time and the production output, the sample correlation coefficient and the coefficient of determination are: ${\displaystyle {r_{yx}}=0.9977}$ ${\displaystyle {R_{yx}}^{2}=0.9954}$

## One-dimensional nonlinear regression function

Example n=8 comparable towns X - the number of the public-transportation maps that are distributed for free among citizens of the city at the beginning of the analyzed time period. Y - increase in the number of citizens using public transport during the analyzed time period.

 Town ${\displaystyle i}$ Increase ${\displaystyle Y}$ Public-transportation maps ${\displaystyle X}$ (in 1,000s) (in 1,000s) 1 0.60 80 2 6.70 220 3 5.30 140 4 4.00 120 5 6.55 180 6 2.15 100 7 6.60 200 8 5.75 160

Linear regression ${\displaystyle {\widehat {y_{i}}}={\widehat {b_{0}}}+{\widehat {b_{1}}}x_{i}=-1.82+0.0435x_{i}}$ ${\displaystyle {R_{yx}}^{2}=0.875}$

As we see from the figures the estimated residuals are not randomly dispersed around zero, but instead they have a rather clear nonlinear pattern. Hence, it can be beneficial to use a nonlinear regression model instead of the linear one. Quadratic regression – second-order polynomial ${\displaystyle {\widehat {y_{i}}}={\widehat {b_{0}}}+{\widehat {b_{1}}}x_{i}+{{\widehat {b_{2}}}x_{i}}^{2}=-10.03+0.1642x_{i}-0.0004{x_{i}}^{2}}$ ${\displaystyle {R_{yx}}^{2}=0.995}$

Using this interactive example, you can estimate a one-dimensional regression function for any two variables from two available data sets. The program generates a scatterplot and adds an estimated regression line to the plot. Afterwards, the estimated regression function, the sample correlation coefficient, the sample coefficient of determination are computed.

## US - crime data

In year 1985, information about various crimes in each of 50 states of the USA was collected, including data on:

 ${\displaystyle X1}$ - land area ${\displaystyle X2}$ - population ${\displaystyle X3}$ - murder ${\displaystyle X4}$ - rape ${\displaystyle X5}$ - robbery ${\displaystyle X6}$ - assault ${\displaystyle X7}$ - burglary ${\displaystyle X8}$ - larceny ${\displaystyle X9}$ - auto-theft ${\displaystyle X10}$ - US states region number ${\displaystyle X11}$ - US states division number

Variables ${\displaystyle X10}$ and ${\displaystyle X11}$ have the following meaning:

 1 Northeast 1 New England 2 Midwest 2 Mid Atlantic 3 South 3 E N Central 4 West 4 W N Central 5 S Atlantic 6 E S Central 7 W S Central 8 Mountain 9 Pacific

## Car data

The following measures were collected for 74 different types of cars:

 ${\displaystyle X1}$ - price ${\displaystyle X2}$ - mpg (miles per gallon) ${\displaystyle X3}$ - headroom (in inches) ${\displaystyle X4}$ - rear seat clearance (distance from front seat back to the rear seat,in inches) ${\displaystyle X5}$ - trunk space (in cubic feet) ${\displaystyle X6}$ - weight (in pound) ${\displaystyle X7}$ - length (in inches) ${\displaystyle X8}$ - turning diameter (clearance required to make a U-turn, in feet) ${\displaystyle X9}$ - displacement (in cubic inches)

In year 1985, information about various crimes in each of 50 states of the USA was collected, including data on:

 ${\displaystyle X1}$ - land area ${\displaystyle X2}$ - population ${\displaystyle X3}$ - murder ${\displaystyle X4}$ - rape ${\displaystyle X5}$ - robbery ${\displaystyle X6}$ - assault ${\displaystyle X7}$ - burglary ${\displaystyle X8}$ - larceny ${\displaystyle X9}$ - auto-theft ${\displaystyle X10}$ - US states region number ${\displaystyle X11}$ - US states division number

The dependence of robbery (X5) on the population (X2) of a state can be depicted in a scatterplot. Every state is represented in the diagram by a single point (${\displaystyle X2,X5}$). Moreover, an estimated regression line is added in the picture (it is drawn in black).

The regression analysis provides the following results:

• The estimated regression intercept is ${\displaystyle 48.1134}$. In this case, it does not make sense to interpret this number; ${\displaystyle {\widehat {b_{0}}}}$ is a kind of correction parameter.

• The increase in the population of a state by one unit (that is, by 1,000 citizens) leads to the increase in the number of robberies by ${\displaystyle {\widehat {b_{1}}}=0.0112}$.

• The sample correlation coefficient is ${\displaystyle 0.62}$—this implies a (positive) dependence of the population and the number of robberies.

• To estimate the fit of the estimated , the coefficient of determination can be used. Its calculation is based on the decomposition of the sample variance of the dependent variable. For the calculation, we can use the total sample variance (SS-Total), the unexplained (residual) variance (SS-Residual), and the explained variance (SS-Regression). Using the formula

${\displaystyle R^{2}={\frac {SS-Regression}{SS-Total}}={\frac {\sum {({\widehat {y_{i}}}-{\bar {y}})}^{2}}{\sum {(y_{i}-{\bar {y}})}^{2}}}=1-{\frac {SS-Residual}{SS-Total}},}$

we get that the coefficient of determination equals ${\displaystyle 0.39}$ . The regression line does not characterize the observed values very well, the explanatory power of the model is weak.

The observation ${\displaystyle x(37)}$ corresponds to the population of ${\displaystyle 16,370}$ thousands and the number of robberies ${\displaystyle 134.1}$. The estimated regression function for such a state predicts the number of robberies to be equal to ${\displaystyle 231.66}$. Notice: The interactive example allows you to display (graphically) the pairwise dependence of other variables as well. The following measures were collected for 74 different types of cars:

 ${\displaystyle X1}$ - price ${\displaystyle X2}$ - mpg (miles per gallon) ${\displaystyle X3}$ - headroom (in inches) ${\displaystyle X4}$ - rear seat clearance (distance from front seat back to the rear seat,in inches) ${\displaystyle X5}$ - trunk space (in cubic feet) ${\displaystyle X6}$ - weight (in pound) ${\displaystyle X7}$ - length (in inches) ${\displaystyle X8}$ - turning diameter (clearance required to make a U-turn, in feet) ${\displaystyle X9}$ - displacement (in cubic inches)

The dependence of turning diameter (X8) on the length (X7) of a car can be depicted in a scatterplot. Every car is represented in the diagram by a single point (${\displaystyle X7,X8}$). Moreover, an estimated regression line is added in the picture (it is drawn in black).

The regression analysis provides the following results:

• The estimated regression intercept is ${\displaystyle 7.1739}$. In this case, it may not make sense to interpret this number; ${\displaystyle {\widehat {b_{0}}}}$ is a kind of correction parameter.

• The increase in the length of a car by one unit (that is, by one inch in this case) leads to the increase in the turning diameter by ${\displaystyle {\widehat {b_{1}}}=0.1735}$ feet.

• The sample correlation coefficient is ${\displaystyle 0.90}$—this implies a strong (positive) dependence of the turning diameter and the length.

• To estimate the fit of the estimated , the coefficient of determination can be used. Its calculation is based on the decomposition of the variance of the dependent variable. For the calculation, there are available the total sample variance (SS-Total), the unexplained (residual) variance (SS-Residual), and the explained variance (SS-Regression). Using the formula

${\displaystyle R^{2}={\frac {SS-Regression}{SS-Total}}={\frac {\sum {({\widehat {y_{i}}}-{\bar {y}})}^{2}}{\sum {(y_{i}-{\bar {y}})}^{2}}},}$

we get that the coefficient of determination equals ${\displaystyle 0.81}$ . The regression line characterizes (explains) the observed values quite well.

The observation ${\displaystyle x(53)}$ corresponds to the length of a car of 192 inches and a turning diameter 38 feet. The estimated regression function for a car of this length predicts the turning diameter to be equal to ${\displaystyle 40.49}$ feet. Notice: The interactive example allows you to display (graphically) the pairwise dependence of other variables as well. Now, we examine the monthly net income and monthly expenditures on living of 10 two-person households.

Household 1 2 3 4 5 6 7 8 9 10
Net income in DM ${\displaystyle x_{i}}$ 3,500 5,000 4,300 6,100 1,000 4,800 2,900 2,400 5,600 4,100
Expenditures in DM${\displaystyle y_{i}}$ 2,000 3,500 3,100 3,900 900 3,000 2,100 1,900 2,900 2,100

These observations are drawn in the following scatterplot. You can see that the net income of a household has a positive influence of the household’s expenditures and that this dependence can be estimated by means of a linear regression function.

We want to estimate a linear regression function describing expenditures of a household as a function of the household’s net income. To estimate the linear regression model, some auxiliary calculations are needed.

${\displaystyle HH}$ ${\displaystyle x_{i}}$ ${\displaystyle y_{i}}$ ${\displaystyle x_{i}\cdot y_{i}}$ ${\displaystyle {x_{i}}^{2}}$ ${\displaystyle {y_{i}}^{2}}$
1 3,500 2,000 7,000,000 12,250,000 4,000,000
2 5,000 3,500 17,500,000 25,000,000 12,250,000
3 4,3000 3,100 13,330,000 18,490,000 9,610,000
4 6,100 3,900 23,790,000 37,210,000 15,210,000
5 1,000 3900 900,000 1,000,000 810,000
6 4,800 3,000 14,400,000 23,040,000 9,000,000
7 2,900 2,100 6,090,000 8,410,000 4,410,000
8 2,4000 1,900 4,560,000 5,760,000 3,610,000
9 5,600 2,900 16,240,000 31,360,000 8,410,000
10 4,100 2,100 8,610,000 16,810,000 4,410,000
Sum ${\displaystyle 39,700}$ ${\displaystyle 25,400}$ ${\displaystyle 112,420,000}$ ${\displaystyle 179,330,000}$ ${\displaystyle 71,720,000}$

Using the derived formulas, the estimated regression parameters ${\displaystyle {\widehat {b_{0}}}}$ and ${\displaystyle {\widehat {b_{1}}}}$ are computed as follows: {\displaystyle {\begin{aligned}{\widehat {b_{0}}}&=&{\frac {\sum y_{i}\sum {x_{i}}^{2}-\sum x_{i}\sum x_{i}y_{i}}{n\sum {x_{i}}^{2}-\sum x_{i}\sum x_{i}}}\\&=&{\frac {(25,400\cdot 179,330,000)-(39,700\cdot 112,420,000)}{(10\cdot 179,330,000)-(39,700\cdot 39,700)}}\\&=&423.13\\{\widehat {b_{1}}}&=&{\frac {n\sum x_{i}y_{i}-\sum x_{i}\sum y_{i}}{n\sum {x_{i}}^{2}-\sum x_{i}\sum x_{i}}}\\&=&{\frac {(10\cdot 112,420,000)-(39,700\cdot 25,400)}{(10\cdot 179,330,000)-(39,700\cdot 39,700)}}\\&=&0.5332\\&&\end{aligned}}} Thus, the estimated regression function is ${\displaystyle {\widehat {y_{i}}}=423.13+0.5332\cdot x_{i}}$  Expenditures = 423.13 + 0.5332 ${\displaystyle \cdot }$ Net income The estimated regression line can be drawn in the scatterplot:

The slope of the line corresponds to the marginal propensity to consume: an increase in the net income by one Mark (1 DM) translates on average to 0.53 DM increase in expenditures for the observed households. Once sample standard deviations of ${\displaystyle x}$ and ${\displaystyle y}$ and their sample covariance are computed, we can readily obtain the sample correlation coeficient: ${\displaystyle r_{xy}={\frac {S_{xy}}{S_{x}S_{y}}}={\frac {1,286,900}{1,553.5\cdot 894.68}}=0.926}$ It hints to a strong (positive) dependence between households’ net incomes and living expenditures. The quality of the fit of the regression function can be evaluated via the coefficient of determination. It is a ratio of the variance explained by the regression function and the total sample variance of expenditures ${\displaystyle Y}$: ${\displaystyle R^{2}={\frac {\sum {({\widehat {y_{i}}}-{\bar {y}})}^{2}}{\sum {(y_{i}-{\bar {y}})}^{2}}}={\frac {6,175,715.85}{7,204,000.00}}=0.857}$ The coefficient of determination shows that 86% of the variation in households’ expenditures can be explained by a linear dependence on the household’s net incomes.