Linear regression, or Multiple Linear regression when more than one predictor is used, determines the linear relationship between a response (Y/dependent) variable and one or more predictor (X/independent) variables. The least-squares method is used to minimize the vertical distance between the response and the fitted linear line.
The requirements of the test are:
Data in existing Excel worksheets can be used and should be arranged in a List dataset layout. The dataset must contain at least two continuous scale variables.
When entering new data we recommend using New Dataset to create a new k variables dataset ready for data entry.
To start the test:
Excel 97, 2000, 2002 & 2003:
Select any cell in the range containing the dataset to analyse, then click Analyse on the Analyse-it toolbar, click Regression then click Linear.
The report shows the number of observations analysed, and, if applicable, how many missing values were listwise deleted.
R2 and adjusted R2 statistics summarise the goodness of the linear fit. Both statistics range from 0 to 1, with higher values indicating a better fit, and a value of 1 indicating a perfect fit. Adjusted R2 is similar to R2 except it is adjusted for the number of predictors in the model so Adjusted R2 statistics from models with a different number of predictor variables can be compared. The standard error of the regression line is also shown.
An analysis of variance table is shown to test the hypothesis that the linear fit is a better fit than fitting to just the mean of the response. Total variation is the variance when a model is fit to just the mean of the response variable. Residual variation is the variance when the linear model is fit. Therefore, the model variation is the difference between the total and residual variation and is the amount of variation explained by the linear model. The F statistic is the ratio of the model and residual variance and represents the proportional increase in error of fitting a mean model versus the linear model. The p-value is the probability of rejecting the null hypothesis, that the linear fit is equal to the mean fit alone, when it is in fact true. A significant p-value implies that the linear fit is a better fit than the mean alone.
The regression coefficients table shows the linear fit coefficients and confidence intervals for each predictor variable and the intercept. The coefficients together combine to form the regression equation of the linear fit and can be used to predict the response from the predictors as follows:
y = a + bx1 + cx2 + dx3 ...
y = a + bx1 + cx2 + dx3 ...
where a is the intercept coefficient (the point where the line intersects the Y axis), and b, c, d (and so on...) are the coefficients for the x1, x2, x3 (1st, 2nd, 3rd an so on...) predictor variables.
IMPORTANT When using the equation to predict values for Y ensure the coefficients are used to at least 4 significant figures. The values are shown to 4 significant figures, but if necessary, the cells contain the coefficients to much higher precision.
A t-statistic and hypothesis test are shown for each regression coefficient. The p-value is the probability of rejecting the null hypothesis, that the predictor has no effect on the response, when it is in fact true. A significant p-value implies that the predictor contributes to the linear fit. In some cases it may be possible to remove predictors that have no effect on the response from the model in favour of a simpler model.
To add or remove predictors from the model:
Usually the linear fit will include a constant term where the fitted line intersects the Y axis at X = 0. The constant term can be removed from the model, if it is known that Y=0 when X=0, forcing the fitted regression through zero.
To remove the constant term from the model:
A scatter plot allows visual assessment of the relationship between the response and predictor variable.
The fit, simultaneous confidence interval for the fit, and prediction intervals can also be overlaid on the scatter plot.
To modify the scatter plot:
The scatter plot (see below) shows the fit, simultaneous confidence intervals and prediction intervals.
A residual plot allows visual assessment of the distance of each observation from the fitted line. The residuals should be randomly scattered in a constant width band about the zero line, if the prior assumption of constant variance is met. Runs of residuals above or below the zero line may indicate a non-linear relationship. If the residuals are standardized they should lie within roughly ±2 to 3 SDs of zero. Standardized residuals of +/- 4 or more SDs should be investigated as possible outliers.
A histogram of the residuals allows visual assessment of the assumption that the measurement errors in the response variable are normally distributed.
The residual plot can show raw or standardized residuals with an optional histogram.
The residual plot (see below) shows the residuals and a histogram with a normal distribution overlay.