Kappa & Weighted Kappa inter-rater agreement

Qualitative uses Kappa to compare two qualitative methods, a test method against a reference/comparative method, to determine accuracy. It is often called the Kappa test for inter-rater agreement since it's most common use is to compare the scores of two raters.

The requirements of the test are:

Two methods measured on a nominal or ordinal scale.
Observations must be classified using the same groups.

Arranging the dataset

Data in existing Excel worksheets can be used and should be arranged in a List dataset layout containing two nominal or ordinal scale variables. If only a summary of the number of subjects for each combination of groups is available (contingency table) then a 2-way table dataset containing counts can be used.

When entering new data we recommend using New Dataset to create a new 2 variables (categorical) dataset or R x C contingency table ready for data entry.

Using the test

To start the test:

Excel 2007:
Select any cell in the range containing the dataset to analyse, then click Comparison on the Analyse-it tab, then click Qualitative.

Excel 97, 2000, 2002 & 2003:
Select any cell in the range containing the dataset to analyse, then click Analyse on the Analyse-it toolbar, click Comparison then click Qualitative.

Click Reference/Comparative method and Test method and select the methods to compare.
Enter Confidence level to calculate for the Kappa statistic. The level should be entered as a percentage between 50 and 100, without the % sign.
Click OK to run the test.

The report shows the number of observations analysed, and, if applicable, how many missing cases were listwise deleted.

The number of observations cross-classified by the two factors are shown as a contingency table. The main diagonal (from top-left to bottom-right) show the number of observations in agreement. Those off the diagonal show disagreement.

The agreement is shown along with the agreement expected by chance alone.

The Kappa statistic measures the degree of agreement between the methods above that expected by chance alone. It has a maximum of 1 when agreement is perfect, 0 when agreement is no better than chance, and negative values when agreement is worse than chance. Other values can be roughly interpreted as:

Kappa statistic	Agreement
< 0.20	Poor
< 0.40	Fair
< 0.60	Moderate
< 0.80	Good
to 1	Very good

A confidence interval is shown which is the range in which the true population kappa statistic is likely to lie with the given probability.

The hypothesis test is shown. The p-value is the probability of rejecting the null hypothesis, that agreement between the methods is no better than chance, when it is in fact true. A significant p-value implies that the agreement between the methods is not just chance.

METHOD The p-value and confidence interval are calculated using the method of Fleiss (see [2]). Two standard errors shown SE0 is the standard error for testing the kappa statistic against the hypothesis that the kappa statistic equals 0, SE is the standard error for testing the kappa statistic against any other hypothesised value and calculating the confidence interval.

Applying weights to the disagreements

A weakness of the standard Kappa statistic is that it takes no account of the degree of disagreement, all disagreements are treated equally. When the methods are measured on an ordinal scale it may preferable to give different weights to the disagreements depending on the magnitude.

To calculate Kappa with weights for the disagreements:

If the Kappa test dialog box is not visible click Edit on the Analyse-it tab/toolbar.

Click Weights then select Linear to weight the disagreements based on the distance between the two ordinal groups (=1 - (|i - j| / (k - 1)), or Quadratic to square the difference between groups (= 1 - (i - j)² / (k - 1)²) (where i and j are the index of the groups, and k is the number of groups).

Click OK.

The Weighted Kappa statistic measures the degree of agreement and can be interpreted like that of Kappa above.

Kappa & Weighted Kappa inter-rater agreement

Arranging the dataset

Using the test

Applying weights to the disagreements

Further reading & references

Cookie preferences