One-Dimensional Regression Analysis
From MM*Stat International
English |
Português |
Français |
Español |
Italiano |
Nederlands |
One-dimensional linear regression function
A simple linear regression function has the following form: In this equation, represents the observed values of a random variable X (fixed) and and are unknown regression parameters. The actual observed values can be obtained by summing residual and (as you can see on the picture):
Regression parameters Parameters of a simple linear regression function have the following meaning:
- intercept term (constant)
It describes the intersection of the corresponding regression line and the y-axis and it has the same value as variable at this point.
- linear slope coefficient (also a constant)
It characterizes the slope of the corresponding regression line. It tells us by how many units is the random variable will change if the value of variable is increased by one unit.
Estimation of regression parameters To estimate regression parameters, two important conditions have to be satisfied. 1st condition The deviations of estimated regression values from observed values should be on average equal to zero; that is However, this condition is satisfied for infinitely many regression lines, namely those that go through the point of sample means . Aside: Notice that the above expressions imply that therefore for each observation we have decomposed the observed into two parts (1) an estimated regression function (i.e. an estimate of the conditional mean); and (2) an estimated residual (disturbance)
2nd condition We search for a regression line such that the spread (variance ) of the corresponding estimated residuals (called disturbances) is minimal in comparison with all other possible regression lines. The first condition implies The second condition is depicted in the following figure:
The squares drawn in the figure correspond to the squared residuals and the total area of the squares should be minimized. Hence, the method used for this minimization is called the least squares method (LS). The least squares method minimizes the sum of squared deviations of regression values from the observed values (residual sum of squares—RSS) The minimized function has two unknown variables ( and ). To find a minimum, the first partial derivatives have to be equal to zero. To verify whether the solution is really a minimum, the second partial derivatives has to be evaluated. Because the both second derivatives are positive, the extremum found is a minimum. The first derivatives (equal to zero) lead to the so-called (least squares) normal equations, from which the estimated regression parameters ( and ) can be computed by solving the equations. The normal equations can be solved by means of linear algebra (Cramer’s rule): Dividing the original equations by , we get a simplified formula suitable for the computation of regression parameters: For the estimated intercept , we get: For the estimated linear slope coefficient , we get: Properties:
The sample variance of must be greater than zero:
From the simplified normal equations, you can see that: if then
Combining results from correlation and regression analysis, it is possible to obtain the estimated linear slope coefficient as follows:
The regression of on does not correspond to the regression of on .
= = = =
Example: X- production output Y- working time n=10 production cycles in a firm
1 | 30 | 73 | 2,190 | 900 | 5,329 | 70 | 3 |
2 | 20 | 50 | 1,000 | 400 | 2,500 | 50 | 0 |
3 | 60 | 128 | 7,680 | 3,600 | 16,384 | 130 | -2 |
4 | 80 | 170 | 1,360 | 6,400 | 28,900 | 170 | 0 |
5 | 40 | 87 | 3,480 | 1,600 | 7,569 | 90 | -3 |
6 | 50 | 108 | 5,400 | 2,500 | 11,664 | 110 | -2 |
7 | 60 | 135 | 8,100 | 3,600 | 18,225 | 130 | 5 |
8 | 30 | 69 | 2,070 | 900 | 4,761 | 70 | -1 |
9 | 70 | 148 | 10,360 | 4,900 | 21,904 | 150 | -2 |
10 | 60 | 132 | 72,920 | 3,600 | 17,424 | 130 | 2 |
500 | 1,100 | 61,800 | 28,400 | 134,660 | 1,100 | 0 |
Computation of auxiliary variables (sample mean, sample variance and sample standard deviation):
= | = | |||||||
= | = |
sample Covariance and sample equal: From these values, we can compute the estimated regression coefficients and : As a result, we obtain the following estimated regression line:
Quality (fit) of the regression line
Once the regression line is estimated, it is useful to know how well the regression line approximates the observed data, that is, how good the representation of the data by means of the regression line is. A measure that can describe the quality of representation is called the coefficient of determination (or R-Squared ). Its computation is based on a decomposition of the variance of the dependent variable . The smaller is the sum of squared estimated residuals, the better is the quality (fit) of the regression line. Since least squared minimizes the variance of the estimated residuals it also maxmizes the R-Squared by construction. The sample variance of is: The deviation of the observed values from the arithmetic mean can be decomposed to two parts: the deviation of the observed values from the estimated regression values and the deviation of the estimated regression values from the sample mean. This decomposition is depicted in the following figure:
Analogously, the sum of the squared deviations can be decomposed: We were able to derive the second equation above by noting that . The reader is urged to prove this using the second least square first order condition above along with the definition of Dividing both sides of the second equation by , it follows: The total sample variance of is equal to (can be decompossed into) the sum of the sample variance of the estimated residuals (the unexplained part of the variance of ) and the part of the variance of that is explained by the regression function (the sample variance of the regression function). It holds:
- The larger the portion of the sample variance in explained by the model is (i.e. , the better the fit of the regression function.
- On the other hand, the larger the residual variance as a percentage of the sample variance of , alternatively the larger the outside influences unexplained by the regression function are, the worse the regression function fits.
The coefficient of determination The coefficient of determination is defined as the ratio of the (sample) variance explained by the regression function and the total (sample) variance of . That is, it represents the proportion of the sample variance in ”explained” by the estimated regression function. An alternative way for computing the coefficient of determination is: Characteristics:
The coefficient of determination has the following domain:
The higher the coefficient of determination is, the better the regression function explains the observed values. If all observed values lie on the regression line, the coefficient of determination is equal to . The total variance of can be explained by the variable . depends completely (and linearly) on . If the coefficient of determination is zero, the total variance of is identical with the unexplained variance (the residual variance). The random variable does not have any influence of .
Symmetry (the fit of the regression of on is identical to the fit of the regression of on )
For a linear regression function, the coefficient of determination corresponds to the square of the correlation coefficient: .
Example: For the above described dependence between the working time and the production output, the sample correlation coefficient and the coefficient of determination are:
One-dimensional nonlinear regression function
Example n=8 comparable towns X - the number of the public-transportation maps that are distributed for free among citizens of the city at the beginning of the analyzed time period. Y - increase in the number of citizens using public transport during the analyzed time period.
Town | Increase | Public-transportation maps |
(in 1,000s) | (in 1,000s) | |
1 | 0.60 | 80 |
2 | 6.70 | 220 |
3 | 5.30 | 140 |
4 | 4.00 | 120 |
5 | 6.55 | 180 |
6 | 2.15 | 100 |
7 | 6.60 | 200 |
8 | 5.75 | 160 |
Linear regression
As we see from the figures the estimated residuals are not randomly dispersed around zero, but instead they have a rather clear nonlinear pattern. Hence, it can be beneficial to use a nonlinear regression model instead of the linear one. Quadratic regression – second-order polynomial
Using this interactive example, you can estimate a one-dimensional regression function for any two variables from two available data sets. The program generates a scatterplot and adds an estimated regression line to the plot. Afterwards, the estimated regression function, the sample correlation coefficient, the sample coefficient of determination are computed.
US - crime data
In year 1985, information about various crimes in each of 50 states of the USA was collected, including data on:
- | land area | |
- | population | |
- | murder | |
- | rape | |
- | robbery | |
- | assault | |
- | burglary | |
- | larceny | |
- | auto-theft | |
- | US states region number | |
- | US states division number |
Variables and have the following meaning:
1 | Northeast | 1 | New England | |
2 | Midwest | 2 | Mid Atlantic | |
3 | South | 3 | E N Central | |
4 | West | 4 | W N Central | |
5 | S Atlantic | |||
6 | E S Central | |||
7 | W S Central | |||
8 | Mountain | |||
9 | Pacific |
Car data
The following measures were collected for 74 different types of cars:
- | price | |
- | mpg (miles per gallon) | |
- | headroom (in inches) | |
- | rear seat clearance (distance from front seat back to the rear seat,in inches) | |
- | trunk space (in cubic feet) | |
- | weight (in pound) | |
- | length (in inches) | |
- | turning diameter (clearance required to make a U-turn, in feet) | |
- | displacement (in cubic inches) |
In year 1985, information about various crimes in each of 50 states of the USA was collected, including data on:
- | land area | |
- | population | |
- | murder | |
- | rape | |
- | robbery | |
- | assault | |
- | burglary | |
- | larceny | |
- | auto-theft | |
- | US states region number | |
- | US states division number |
The dependence of robbery (X5) on the population (X2) of a state can be depicted in a scatterplot. Every state is represented in the diagram by a single point (). Moreover, an estimated regression line is added in the picture (it is drawn in black).
The regression analysis provides the following results:
The estimated regression intercept is . In this case, it does not make sense to interpret this number; is a kind of correction parameter.
The increase in the population of a state by one unit (that is, by 1,000 citizens) leads to the increase in the number of robberies by .
The sample correlation coefficient is —this implies a (positive) dependence of the population and the number of robberies.
To estimate the fit of the estimated , the coefficient of determination can be used. Its calculation is based on the decomposition of the sample variance of the dependent variable. For the calculation, we can use the total sample variance (SS-Total), the unexplained (residual) variance (SS-Residual), and the explained variance (SS-Regression). Using the formula
we get that the coefficient of determination equals . The regression line does not characterize the observed values very well, the explanatory power of the model is weak.
The observation corresponds to the population of thousands and the number of robberies . The estimated regression function for such a state predicts the number of robberies to be equal to . Notice: The interactive example allows you to display (graphically) the pairwise dependence of other variables as well. The following measures were collected for 74 different types of cars:
- | price | |
- | mpg (miles per gallon) | |
- | headroom (in inches) | |
- | rear seat clearance (distance from front seat back to the rear seat,in inches) | |
- | trunk space (in cubic feet) | |
- | weight (in pound) | |
- | length (in inches) | |
- | turning diameter (clearance required to make a U-turn, in feet) | |
- | displacement (in cubic inches) |
The dependence of turning diameter (X8) on the length (X7) of a car can be depicted in a scatterplot. Every car is represented in the diagram by a single point (). Moreover, an estimated regression line is added in the picture (it is drawn in black).
The regression analysis provides the following results:
The estimated regression intercept is . In this case, it may not make sense to interpret this number; is a kind of correction parameter.
The increase in the length of a car by one unit (that is, by one inch in this case) leads to the increase in the turning diameter by feet.
The sample correlation coefficient is —this implies a strong (positive) dependence of the turning diameter and the length.
To estimate the fit of the estimated , the coefficient of determination can be used. Its calculation is based on the decomposition of the variance of the dependent variable. For the calculation, there are available the total sample variance (SS-Total), the unexplained (residual) variance (SS-Residual), and the explained variance (SS-Regression). Using the formula
we get that the coefficient of determination equals . The regression line characterizes (explains) the observed values quite well.
The observation corresponds to the length of a car of 192 inches and a turning diameter 38 feet. The estimated regression function for a car of this length predicts the turning diameter to be equal to feet. Notice: The interactive example allows you to display (graphically) the pairwise dependence of other variables as well. Now, we examine the monthly net income and monthly expenditures on living of 10 two-person households.
Household | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|
Net income in DM | 3,500 | 5,000 | 4,300 | 6,100 | 1,000 | 4,800 | 2,900 | 2,400 | 5,600 | 4,100 |
Expenditures in DM | 2,000 | 3,500 | 3,100 | 3,900 | 900 | 3,000 | 2,100 | 1,900 | 2,900 | 2,100 |
These observations are drawn in the following scatterplot. You can see that the net income of a household has a positive influence of the household’s expenditures and that this dependence can be estimated by means of a linear regression function.
We want to estimate a linear regression function describing expenditures of a household as a function of the household’s net income. To estimate the linear regression model, some auxiliary calculations are needed.
1 | 3,500 | 2,000 | 7,000,000 | 12,250,000 | 4,000,000 |
2 | 5,000 | 3,500 | 17,500,000 | 25,000,000 | 12,250,000 |
3 | 4,3000 | 3,100 | 13,330,000 | 18,490,000 | 9,610,000 |
4 | 6,100 | 3,900 | 23,790,000 | 37,210,000 | 15,210,000 |
5 | 1,000 | 3900 | 900,000 | 1,000,000 | 810,000 |
6 | 4,800 | 3,000 | 14,400,000 | 23,040,000 | 9,000,000 |
7 | 2,900 | 2,100 | 6,090,000 | 8,410,000 | 4,410,000 |
8 | 2,4000 | 1,900 | 4,560,000 | 5,760,000 | 3,610,000 |
9 | 5,600 | 2,900 | 16,240,000 | 31,360,000 | 8,410,000 |
10 | 4,100 | 2,100 | 8,610,000 | 16,810,000 | 4,410,000 |
Sum |
Using the derived formulas, the estimated regression parameters and are computed as follows: Thus, the estimated regression function is Expenditures = 423.13 + 0.5332 Net income The estimated regression line can be drawn in the scatterplot:
The slope of the line corresponds to the marginal propensity to consume: an increase in the net income by one Mark (1 DM) translates on average to 0.53 DM increase in expenditures for the observed households. Once sample standard deviations of and and their sample covariance are computed, we can readily obtain the sample correlation coeficient: It hints to a strong (positive) dependence between households’ net incomes and living expenditures. The quality of the fit of the regression function can be evaluated via the coefficient of determination. It is a ratio of the variance explained by the regression function and the total sample variance of expenditures : The coefficient of determination shows that 86% of the variation in households’ expenditures can be explained by a linear dependence on the household’s net incomes.