Preparing data for analysis

Analyse-it takes your data for analysis directly from an Excel worksheet. It also provides commands to arrange and structure your data ready to analyze. In addition, you can use the built-in Excel commands to transform, sort, and filter data.

Datasets
A dataset is a range of contiguous cells on an Excel worksheet containing data to analyze.
Variables
A variable is an attribute of an object. The value of a variable can vary from one thing to another.
Measurement scales
Five different scales are used to classify measurements based on how much information each measurement conveys.
Missing values
Missing values occur when no data is recorded for an observation; you intended to make an observation but did not. Missing data are a common occurrence and can have a significant effect on the statistical analysis.
Case and frequency form data
Most data are available in case form with data for each sampling unit. Sometimes data are not available for each unit but are already summarized by counting the frequency of occurrences of each value, called frequency form data.

Datasets

A dataset is a range of contiguous cells on an Excel worksheet containing data to analyze.

When arranging data on an Excel worksheet you must follow a few simple rules so that Analyse-it works with your data:

Title to clearly describe the data. If you do not specify a title, the cell range of the dataset (such as A3:C13) is used to refer to the dataset.
A header row containing variable labels. Each variable name should be unique. Units of measurement can be included in the label by enclosing them in brackets after the name.
Rows containing the data for each case. The number of rows is only limited by Excel (currently over one million).
Columns containing the data for each variable.
Optional: Labels in the first column to provide a meaningful name/identifier for each case.

When you use an Analyse-it command, the extent of a dataset is determined by scanning outwards from the active cell to include all surrounding contiguous cells. The extent is known when a blank row or column surrounding the dataset, or the edge of the worksheet, is reached.

Most Excel commands (for example, Sort, Pivot Table) use the same technique to determine the range of cells to operate on. It avoids the need to select often-large ranges of cells using the mouse, which is laborious and error-prone. It also ensures that if you add, or remove, cases or variables, that subsequent analyses automatically reflect any changes to the dataset.

When you analyze data, any data in hidden rows on the worksheet are excluded. This feature lets you easily limit analysis to a subset of the cases in the dataset. You can hide rows manually or use a filter to hide them based on criteria.

You can locate a dataset anywhere on a worksheet, and you can keep multiple datasets on a single worksheet provided you separate them from each other by at least one blank row and column. However, we recommend that you use a separate worksheet for each dataset. Using separate worksheets allows you to name datasets using the Excel worksheet tabs, navigate between datasets using the worksheet tabs, and ensures that filtering a dataset does not affect other datasets on the same worksheet.

Variables

A variable is an attribute of an object. The value of a variable can vary from one thing to another.

There are two different types of variables:

Quantitative (or numeric) - data are numeric values of a quantity.
Qualitative (or categorical) - data differs only in kind.

Record quantitative data as the numeric value.

Record qualitative data as a numeric coding (for example, 0, 1, 2) or labels (for example, Low, Medium, High). Numeric codings are often preferred as they succinctly represent the values and allow quicker data entry. However, codings can be difficult to understand by anyone other than the person who assigned them. To make their meaning clear, you can also assign a label to each numeric coding.

Data are ordered based on numeric value, numeric coding, or if a numeric coding is not used into alphabetical label order. For some statistical tests, the sort order determines how observations are rank-ordered and can affect the results. Often alphabetic sort order does represent the correct ordering. For example Low, Medium, High is sorted alphabetically into High, Low, Medium which does not represent their correct, natural order. In such cases, you must explicitly set the order before analyzing the data.

Measurement scales

Five different scales are used to classify measurements based on how much information each measurement conveys.

The different levels of measurement involve different properties of the numbers or symbols that constitute the measurements and also an associated set of permissible transformations.

Measurement scale	Properties	Permissible transformation
Nominal	Two things are assigned the same symbol if they have the same value of the attribute. For example, Gender (Male, Female); Religion (coded as 0=None, 1=Christian, 2=Buddhist).	Permissible transformations are any one-to-one or many-to-one transformation, although a many-to-one transformation loses information.
Ordinal	Things are assigned numbers such that the order of the numbers reflects an order relation defined on the attribute. Two things x and y with attribute values a(x) and a(y) are assigned numbers m(x) and m(y) such that if m(x) > m(y), then a(x) > a(y). For example, Moh's scale for the hardness of minerals; academic performance grades (A, B, C, ...).	Permissible transformations are any monotone increasing transformation, although a transformation that is not strictly increasing loses information.
Interval	Things are assigned numbers such that differences between the numbers reflect differences of the attribute. If m(x) - m(y) > m(u) - m(v), then a(x) - a(y) > a(u) - a(v). For example, Temperature measured in degrees Fahrenheit or Celsius.	Permissible transformations are any affine transformation t(m) = c * m + d, where c and d are constants; another way of saying this is that the origin and unit of measurement are arbitrary.
Ratio	Things are assigned numbers such that differences and ratios between the numbers reflect differences and ratios of the attribute. For example, Temperature measured in degrees Kelvin scale; Length in centimeters.	Permissible transformations are any linear (similarity) transformation t(m) = c * m, where c is a constant; another way of saying this is that the unit of measurement is arbitrary.
Absolute	Things are assigned numbers such that all properties of the numbers reflect analogous properties of the attribute. For example, Number of children in a family, Frequency of occurrence.	The only permissible transformation is the identity transformation.

While the measurement scale cannot determine a single best statistical method appropriate for data analysis, it does define which statistical methods are inappropriate. Where possible the use of a variable is restricted when its measurement scale is not appropriate for the analysis. For example, a nominal variable cannot be used in a t-test.

When the measurement scale of a variable is unknown, the scale is inferred from its role in the analysis and the type of data in the variable. If the measurement scale cannot be inferred, you must set the measurement scale.

Missing values

Missing values occur when no data is recorded for an observation; you intended to make an observation but did not. Missing data are a common occurrence and can have a significant effect on the statistical analysis.

Missing values arise due to many reasons. For example, a subject dropped out of the study; a machine fault occurred that prevented a measurement been taken; a subject did not answer a question in a survey; or a researcher made a mistake recording an observation.

A missing value is indicated by an empty cell or a . (full stop) or * (asterisk) in the cell. These values always indicate a missing value, regardless of measurement scale of the variable. Similarly, cell values that are not valid for the measurement scale of the variable are treated as missing. For example, a cell containing text or a #N/A error value is treated as missing for a quantitative measurement scale variable, since neither are valid numeric values.

Some applications, especially older statistical software, use special values such as 99999 to indicate missing values. If you import such data, you should use the Excel Find & Replace command to replace the values with a recognized missing indicator.

In the presence of missing values you often still want to make statistical inferences. Some analyses support techniques such as deletion, imputation, and interpolation to allow the analysis to cope with missing values. However, it is still important before performing an analysis that you understand why the data is missing. When the reason for the missing data is not completely random, the study may be biased, and the statistics may be misleading.

Case and frequency form data

Most data are available in case form with data for each sampling unit. Sometimes data are not available for each unit but are already summarized by counting the frequency of occurrences of each value, called frequency form data.

Whenever possible you should record data in case form. Case form data can be reduced to frequency form, but frequency form data cannot be reconstructed into case form. You may want to use frequency form data when working with a large amount of data from a database, where you can save computing resources by letting the database server tabulate the data.

Case form dataset layout

Each column is a variable (Height, Eye color), each row is a separate case (Subject) with the values of the variables on that case.

Subject (optional)	Height	Eye color
1	175	Blue
2	180	Blue
3	160	Hazel
4	190	Green
5	180	Green
6	150	Brown
7	140	Blue
8	160	Brown
…	…	…

Frequency form dataset layout

Each column is a variable (Eye color) and a separate column for the number of cases (Frequency), each row is a combination of categories with the frequency count.

Eye color	Frequency
Brown	221
Blue	215
Hazel	93
Green	64

Note: Examples of dataset layouts are included in the Statistical Reference Guide.

Setting the measurement scale of a variable

Set the measurement scale of a variable so data analysis can use the variable correctly.

Note: If you have the analysis task pane open, you can quickly set the measurement scale of a variable. In the analysis task pane, click the drop-down menu button alongside the variable, and then click the measurement scale.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
In the Variables list, select the variables:
- To select a single variable, click the variable.
- To select multiple variables, click the first variable then hold down the CTRL key and click each additional variable.
- To select a range of variables, click the first variable then hold down the SHIFT key and click the last variable in the range.
The cells of the selected variables highlight on the worksheet. You can also select cells on the worksheet to select the corresponding variables in the list.
Click Nominal, Ordinal, Interval, Ratio, or Absolute.
Click Apply.
The changes are saved and will be used in future analyses.

Setting the minimum, maximum and units of a variable

Set the minimum/maximum and units so that the axes of charts are scaled to your preferences.

Note: If you do not set a scale then the chart axis scale will be determined automatically by the data values. If you do specify the scale, but the data values are outside the minimum/maximum then axes are scaled to the next unit. If the unit is too small to show with a label on the axis it is shown as a minor tick mark.

Note: If you have the analysis task pane open you can quickly change the scaling for a variable: right-click over the variable, or click the drop-down menu button alongside the variable selector, and then click Scale.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
In the Variable list, select the variables:
- To select a single variable, click the variable.
- To select multiple variables, click the first variable then hold down the CTRL key and click each additional variable.
- To select a range of variables, click the first variable then hold down the SHIFT key and click the last variable in the range.
The cells of the selected variables highlight on the worksheet. You can also select cells on the worksheet to select the corresponding variables in the list.
If the measurement scale is not already set, click Interval, Ratio, or Absolute.
In the MinimumMaximum and Units edit boxes, optionally, type the value for each. You can provide a minimum value only, a maximum value only, or a minimum, maximum and units. If you provide units then the range (the maximum - minimum) must be a factor of the units (the range must be divisible by the units).
Click OK.
Any future or updated analysis will now use the preferred scale where possible.

Ordering categorical data

Set the order of categories for data analysis and output in the analysis report.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
In the Variables list, select the variable.
Click Nominal or Ordinal to set the measurement scale.
The Categories grid shows the numeric coding and labels used in the data.
Clear the Sort Categories check box to allow you to reorder the categories manually.
Arrange the categories into order:
- Using drag and drop, click and hold the mouse button down over a category until the drag insertion cursor appears, drag it to reposition it in the list and then release the mouse button.
- Using the arrows, click a category to select it and then use the up and down arrow buttons to reposition it in the list.
Click Apply.
The changes are saved and will be used in future analyses.

Setting the number format of a variable

Set the precision of the data so that the results of the analysis show to a suitable number of decimal places.

Note: If you do not set the number format (or you use the Excel General number format), the number of decimal places is determined by the data value with the most decimal places.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
In the Variable list, select the variables:
- To select a single variable, click the variable.
- To select multiple variables, click the first variable then hold down the CTRL key and click each additional variable.
- To select a range of variables, click the first variable then hold down the SHIFT key and click the last variable in the range.
The cells of the selected variables highlight on the worksheet. You can also select cells on the worksheet to select the corresponding variables in the list.
Click Number format.
In the Format Cells window, click Number, set the number of Decimal places, and then click OK.
Click OK.
The data is re-formatted.

Labeling cases

Set a variable to use for labeling data points in plots for easy identification.

Ensure that the variable containing the labels is the first (leftmost) column in the dataset.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
Select the Labels in first column check box.
Click Apply.
The changes are saved and will be used in future analyses.

Assigning labels to categories

Assign labels to a numerical coding to make an analysis report more readable.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
In the Variables list, select the variable.
If the measurement scale is not set, click Nominal or Ordinal.
The Categories grid shows the numeric coding and labels used in the data.
In the Name column alongside each coding, type the label.

Note: Names must be unique.
Click Apply.
The changes are saved and will be used in future analyses.

Assigning colors/symbols to categories

Assign colors or symbols to highlight observations on plots.

Activate the dataset worksheet.
On the Analyse-it ribbon tab, in the Dataset group, click Dataset.
The dataset task pane opens.
In the Variables list, select the variable.
If the measurement scale is not set, click Nominal or Ordinal.
The Categories grid shows the codings used in the data.
Select the Assign colors/symbols check box.
The Categories grid shows the Symbol and Color columns.
Optional: Click a cell in the Color column and select a preset color.
Optional: Click a cell in the Symbol column and select a symbol.
Repeat steps 6 through 7 for each coding.

Note: Any categories without an assigned symbol use a square, and any without an assigned color use black. You can assign symbols and colors just to the categories you want to highlight. The others use the default symbol and color.
Click Apply.
The changes are saved and will be used in future analyses.

Transforming variables

Transform a variable to normalize, shift, scale or otherwise change the shape of the distribution so that it meets the assumptions of a statistical test.

Activate the dataset worksheet.
In an empty column adjoining the dataset, enter the transformation function.
For example =LOG(A1)
Double-click on the drag handle at the bottom right of the active cell to copy the formula down the entire column.

Transform functions

Useful Microsoft Excel functions for transforming data.

Transform	Excel function
Natural log	=LN(cell)
Inverse natural log	=EXP(cell)
Reciprocal	=1 / cell
Square root	=SQRT(cell)
Power	=cell^power
Log base 10	=LOG(cell) or =LOG10(cell)
Inverse log base 10	=10^cell
Log base 2	=LOG(cell, 2)
Inverse log base 2	=2^cell

Analyzing a subset of the data

Use an Excel filter to show only the rows that match a value or meet a condition. Rows that do not match the condition are hidden temporarily, and subsequent data analysis excludes them. This feature is extremely useful when you want to see how the analysis changes for a different subset of the data.

Note: Prior to version 5.10 the active filter is not saved with the analysis. When you recalculate an analysis the rows currently visible are analyzed, regardless of the filter you used when creating the analysis. From version 5.10 onwards, by default the filter criteria are saved with the analysis and re-applied when you recalculate or make other changes to the analysis. The drop-down arrow next to the Filter command on the ribbon of the active analysis lets you change between Use active filter which always uses the currently active filter when recalculating (pre version 5.10 behavior) and Save Filter & Re-apply (version 5.10 or later behavior) which saves the filter when the analysis is calculated and re-applies it on subsequent recalculation.

On the Analyse-it ribbon tab, click Filter to show filter icons on the active dataset.
If you are on an analysis report, the saved filter (if any) will be applied to the dataset. You can modify the filter and return to the analysis and click Recalculate to apply the new filter, or you can start a new analysis to reuse the filter criteria.
Click the drop-down button alongside a variable, and then either select the values to match, or apply other criteria.
The save filter feature in version 5.10 or later works with almost all filter conditions: filter by value, by condition, by color/fill/icon, above/below average, top/bottom items, and so on. There is one limitation: filtering to multiple dates. If you filter to more than 2 dates, add-ins such as Analyse-it cannot retrieve the list of values to filter to. This is a limitation of Microsoft Excel and unfortunately it hasn’t been remedied in the last few years so it is low on Microsoft’s priorities. If this is seriously affecting you, please get in touch with Microsoft and ask them to resolve this limitation for add-in developers.
Create or recalculate an analysis to see the effect of the filter.

Examples

To filter to Female cases only, click the drop-down button alongside Sex, and then select Female in the list of values to match against:

To filter to cases where the subject is older than 40, click the drop-down button alongside Age, and then click Number Filters > Greater Than, and then enter 40 as the cut-off point: