# Multiple linear regression, with scatter & residual plots

Linear regression, or Multiple Linear regression when more than one predictor is used, determines the linear relationship between a response (Y/dependent) variable and one or more predictor (X/independent) variables. The least-squares method is used to minimize the vertical distance between the response and the fitted linear line.

The requirements of the test are:

- A dependent response and at least one independent predictor variable, measured on a continuous scale.
- Measurement error in the response variable must be normally distributed and have constant variance, with predictors free of measurement error.

## Arranging the dataset

Data in existing Excel worksheets can be used and should be arranged in a List dataset layout. The dataset must contain at least two continuous scale variables.

When entering new data we recommend using New Dataset to create a new **k variables** dataset ready for data entry.

## Using t**he test**

To start the test:

- Excel 2007:

Select any cell in the range containing the dataset to analyse, then click**Regression**on the**Analyse-it**tab, then click**Linear**. - Tick the predictor variables in
**Variable X (independent)**. - Click
**Variable Y (dependent)**and select the dependent response variable. - Enter
**Confidence interval**to calculate for the regression coefficients. The level should be entered as a percentage between 50 and 100, without the % sign. - Click
**OK**to run the test.

Excel 97, 2000, 2002 & 2003:

Select any cell in the range containing the dataset to analyse, then click **Analyse **on the **Analyse-it **toolbar, click **Regression** then click **Linear**.

The report shows the number of observations analysed, and, if applicable, how many missing values were listwise deleted.

R^{2} and adjusted R^{2 }statistics summarise the goodness of the linear fit. Both statistics range from 0 to 1, with higher values indicating a better fit, and a value of 1 indicating a perfect fit. Adjusted R^{2} is similar to R^{2} except it is adjusted for the number of predictors in the model so Adjusted R^{2} statistics from models with a different number of predictor variables can be compared. The standard error of the regression line is also shown.

## Model comparison table

An analysis of variance table is shown to test the hypothesis that the linear fit is a better fit than fitting to just the mean of the response. Total variation is the variance when a model is fit to just the mean of the response variable. Residual variation is the variance when the linear model is fit. Therefore, the model variation is the difference between the total and residual variation and is the amount of variation explained by the linear model. The F statistic is the ratio of the model and residual variance and represents the proportional increase in error of fitting a mean model versus the linear model. The *p*-value is the probability of rejecting the null hypothesis, that the linear fit is equal to the mean fit alone, when it is in fact true. A significant p-value implies that the linear fit is a better fit than the mean alone.

## Linear fit equation

The regression coefficients table shows the linear fit coefficients and confidence intervals for each predictor variable and the intercept. The coefficients together combine to form the regression equation of the linear fit and can be used to predict the response from the predictors as follows:

`y = a + bx`

_{1} + cx_{2} + dx_{3} ...

where a is the intercept coefficient (the point where the line intersects the Y axis), and b, c, d (and so on...) are the coefficients for the x_{1}, x_{2}, x_{3} (1st, 2nd, 3rd an so on...) predictor variables.

** IMPORTANT ** When using the equation to predict values for Y ensure the coefficients are used to at least 4 significant figures. The values are shown to 4 significant figures, but if necessary, the cells contain the coefficients to much higher precision.

A t-statistic and hypothesis test are shown for each regression coefficient. The *p*-value is the probability of rejecting the null hypothesis, that the predictor has no effect on the response, when it is in fact true. A significant p-value implies that the predictor contributes to the linear fit. In some cases it may be possible to remove predictors that have no effect on the response from the model in favour of a simpler model.

To add or remove predictors from the model:

- If the Linear regression dialog box is not visible click
**Edit**on the**Analyse-it**tab/toolbar. - Tick/Untick variables in the
**Variable X (independent)**selector to change the predictors. - Click
**OK**.

Usually the linear fit will include a constant term where the fitted line intersects the Y axis at X = 0. The constant term can be removed from the model, if it is known that Y=0 when X=0, forcing the fitted regression through zero.

To remove the constant term from the model:

- If the Linear regression dialog box is not visible click
**Edit**on the**Analyse-it**tab/toolbar. - Tick
**No intercept term in model**. - Click
**OK**.

**Scatter plot**

A scatter plot allows visual assessment of the relationship between the response and predictor variable.

The fit, simultaneous confidence interval for the fit, and prediction intervals can also be overlaid on the scatter plot.

To modify the scatter plot:

- If the Linear regression dialog box is not visible click
**Edit**on the**Analyse-it**tab/toolbar. - Click
**Scatter plot**and select**with Fit**to show the fit,**with Fit + CI**to show the fit with confidence interval for the fit,**with Fit + PI**to show the fit with prediction interval, or**with Fit + CI + PI**to show the fit with confidence and prediction intervals on the scatter plot.. - Click
**OK**.

The scatter plot (see below) shows the fit, simultaneous confidence intervals and prediction intervals.

## Examining the residuals

A residual plot allows visual assessment of the distance of each observation from the fitted line. The residuals should be randomly scattered in a constant width band about the zero line, if the prior assumption of constant variance is met. Runs of residuals above or below the zero line may indicate a non-linear relationship. If the residuals are standardized they should lie within roughly ±2 to 3 SDs of zero. Standardized residuals of +/- 4 or more SDs should be investigated as possible outliers.

A histogram of the residuals allows visual assessment of the assumption that the measurement errors in the response variable are normally distributed.

The residual plot can show raw or standardized residuals with an optional histogram.

To modify the scatter plot:

- If the Linear regression dialog box is not visible click
**Edit**on the**Analyse-it**tab/toolbar. - Click
**Residual plot**and select**Raw**to plot the actual residual (difference from fitted line) or**Standardized**to show the residuals standardized (raw / SE). - Tick
**with Histogram of Residuals**to show a histogram with normal overlay of the distribution of the residuals. - Click
**OK**.

The residual plot (see below) shows the residuals and a histogram with a normal distribution overlay.

## References to further reading

- Applied Regression Analysis (3rd edition)

Norman R. Draper, Harry Smith. ISBN 0-471-17082-8 1998.

- Welcome
- Getting started
- What's new in this version
- Installing Analyse-it
- Starting Analyse-it
- Defining Datasets
- Setting Variable properties
- Running a statistical test
- Working with analysis reports
- Analyse-it Standard edition
- Describe
- Compare groups
- Compare pairs
- Correlation
- Agreement
- Regression
- Multiple linear regression
- Polynomial regression
- Analyse-it Method Evaluation edition
- Citing Analyse-it
- Contact us
- About us

Version 2.30

Published 9-Jun-2009