You are viewing documentation for the old version 2.30 of Analyse-it. If you are using version 3.00 or later we recommend you go to the multiple linear regression.

Multiple linear regression, with scatter & residual plots

Linear regression, or Multiple Linear regression when more than one predictor is used, determines the linear relationship between a response (Y/dependent) variable and one or more predictor (X/independent) variables. The least-squares method is used to minimize the vertical distance between the response and the fitted linear line.

The requirements of the test are:

  • A dependent response and at least one independent predictor variable, measured on a continuous scale.
  • Measurement error in the response variable must be normally distributed and have constant variance, with predictors free of measurement error.


Arranging the dataset

Data in existing Excel worksheets can be used and should be arranged in a List dataset layout. The dataset must contain at least two continuous scale variables. 

When entering new data we recommend using New Dataset to create a new k variables dataset ready for data entry.

Using the test

To start the test:

  1. Excel 2007:
    Select any cell in the range containing the dataset to analyse, then click Regression on the Analyse-it tab, then click Linear.
  2. Excel 97, 2000, 2002 & 2003:
    Select any cell in the range containing the dataset to analyse, then click Analyse on the Analyse-it toolbar, click Regression then click Linear.

  3. Tick the predictor variables in Variable X (independent).
  4. Click Variable Y (dependent) and select the dependent response variable.
  5. Enter Confidence interval to calculate for the regression coefficients. The level should be entered as a percentage between 50 and 100, without the % sign.
  6. Click OK to run the test.

The report shows the number of observations analysed, and, if applicable, how many missing values were listwise deleted.

R2 and adjusted R2 statistics summarise the goodness of the linear fit. Both statistics range from 0 to 1, with higher values indicating a better fit, and a value of 1 indicating a perfect fit. Adjusted R2 is similar to R2 except it is adjusted for the number of predictors in the model so Adjusted R2 statistics from models with a different number of predictor variables can be compared. The standard error of the regression line is also shown.

Model comparison table

An analysis of variance table is shown to test the hypothesis that the linear fit is a better fit than fitting to just the mean of the response. Total variation is the variance when a model is fit to just the mean of the response variable. Residual variation is the variance when the linear model is fit. Therefore, the model variation is the difference between the total and residual variation and is the amount of variation explained by the linear model. The F statistic is the ratio of the model and residual variance and represents the proportional increase in error of fitting a mean model versus the linear model. The p-value is the probability of rejecting the null hypothesis, that the linear fit is equal to the mean fit alone, when it is in fact true. A significant p-value implies that the linear fit is a better fit than the mean alone.

Linear fit equation

The regression coefficients table shows the linear fit coefficients and confidence intervals for each predictor variable and the intercept. The coefficients together combine to form the regression equation of the linear fit and can be used to predict the response from the predictors as follows:

y = a + bx1 + cx2 + dx3 ...

where a is the intercept coefficient (the point where the line intersects the Y axis), and b, c, d (and so on...) are the coefficients for the x1, x2, x3 (1st, 2nd, 3rd an so on...) predictor variables.

IMPORTANT  When using the equation to predict values for Y ensure the coefficients are used to at least 4 significant figures. The values are shown to 4 significant figures, but if necessary, the cells contain the coefficients to much higher precision.

A t-statistic and hypothesis test are shown for each regression coefficient. The p-value is the probability of rejecting the null hypothesis, that the predictor has no effect on the response, when it is in fact true. A significant p-value implies that the predictor contributes to the linear fit. In some cases it may be possible to remove predictors that have no effect on the response from the model in favour of a simpler model.

To add or remove predictors from the model:

  1. If the Linear regression dialog box is not visible click Edit on the Analyse-it tab/toolbar. 
  2. Tick/Untick variables in the Variable X (independent) selector to change the predictors.
  3. Click OK.

Usually the linear fit will include a constant term where the fitted line intersects the Y axis at X = 0. The constant term can be removed from the model, if it is known that Y=0 when X=0, forcing the fitted regression through zero. 

To remove the constant term from the model:

  1. If the Linear regression dialog box is not visible click Edit on the Analyse-it tab/toolbar. 
  2. Tick No intercept term in model.
  3. Click OK.

Scatter plot

A scatter plot allows visual assessment of the relationship between the response and predictor variable.

The fit, simultaneous confidence interval for the fit, and prediction intervals can also be overlaid on the scatter plot. 

 To modify the scatter plot:

  1. If the Linear regression dialog box is not visible click Edit on the Analyse-it tab/toolbar. 
  2. Click Scatter plot and select with Fit to show the fit, with Fit + CI to show the fit with confidence interval for the fit, with Fit + PI to show the fit with prediction interval, or with Fit + CI + PI to show the fit with confidence and prediction intervals on the scatter plot..
  3. Click OK.

The scatter plot (see below) shows the fit, simultaneous confidence intervals and prediction intervals.

(click to enlarge)

Examining the residuals

A residual plot allows visual assessment of the distance of each observation from the fitted line. The residuals should be randomly scattered in a constant width band about the zero line, if the prior assumption of constant variance is met. Runs of residuals above or below the zero line may indicate a non-linear relationship. If the residuals are standardized they should lie within roughly ±2 to 3 SDs of zero. Standardized residuals of +/- 4 or more SDs should be investigated as possible outliers.

A histogram of the residuals allows visual assessment of the assumption that the measurement errors in the response variable are normally distributed.

The residual plot can show raw or standardized residuals with an optional histogram.

 To modify the scatter plot:

  1. If the Linear regression dialog box is not visible click Edit on the Analyse-it tab/toolbar. 
  2. Click Residual plot and select Raw to plot the actual residual (difference from fitted line) or Standardized to show the residuals standardized (raw / SE).
  3. Tick with Histogram of Residuals to show a histogram with normal overlay of the distribution of the residuals.
  4. Click OK.

The residual plot (see below) shows the residuals and a histogram with a normal distribution overlay.

(click to enlarge)

References to further reading

  1. Applied Regression Analysis (3rd edition)
    Norman R. Draper, Harry Smith. ISBN 0-471-17082-8 1998.

(click to enlarge)