Interrater Reliability Estimation via Maximum Likelihood for Gwet’s Chance Agreement Model ()
1. Introduction
Interrater reliability (IRR) (also known as “kappa” (
)) statistics, are used to measure agreement between two raters or coders classifying items into mutually exclusive categories.
statistics are widely used in fields such as psychology and medicine to evaluate the reliability or consistency of expert judgments [1].
Simply calculating the percentage of cases where raters agree does not account for the possibility that some agreement occurs by chance.
is designed to measure the the degree of agreement between raters beyond what is expected by chance. Assume two raters independently classify N cases into categories + and −, and denote by
the number of cases on which they agree. Assume
agreements occur by chance, and the rest
are due to knowledge (not due to chance), so that
. The number of cases remaining after subtracting chance agreements is
. Thus the percentage of the observed agreement
in excess of chance agreement is:
where
denotes the observed percent agreement, and
is the percent agreement due to chance.
is observed, whereas
must be estimated.
Several approaches have been proposed to estimate the probability of chance agreement. The approach used most commonly in the past (Cohen’s
) has recently fallen under criticism [2] [3], leading to a new approach (Gwet’s
) which has been gained popularity over the past several years [1] [4]-[7]. However, we show that the new approach is biased. We demonstrate an unbiased approach to estimating
based on maximum likelihood estimation.
2. Cohen’s Kappa and Its Limitations
Historically, the most commonly used
statistic has been Cohen’s
[8] [9], which quantifies interrater reliability for two raters applying binary ratings. Other approaches are discussed at length in [10]-[12].
Cohen proposed calculating the probability of chance agreement
based on an ‘always guess’ model. Suppose two raters A and B independently assign N items to two categories, + and −. Let the numbers of items assigned to each category be
,
,
,
, and the number of items on which they agree be
. Now consider what percentage of cases raters A and B would be expected to agree on if they assigned the same numbers of items to each category as they do in the observed data, but made the assignments at random (“guessing”). Under this model, A and B classify items as + with probabilities
,
, and as − with probabilities
,
. Any agreements under this model occur by chance, with probability
Critiques of Cohen’s Model
Two main criticisms have been raised against Cohen’s
. First, Cohen’s
produces “paradoxical” results under certain circumstances [2] [10] [11] [13]: high levels of observed agreement can accompany a low
value. This happens because Cohen’s
depends only on the rates of ratings in the data. Thus, if raters A and B score most cases as class +, it may be because they correctly recognize that most cases are +, yet Cohen’s
cannot give credit for agreement due to expertise. This problem is most pronounced when the proportion of classes in the data deviates from 50% [12].
Second, some authors [12] [14] dispute the idea that
“takes into account” chance agreement. Truly doing this requires a realistic model of how chance affects rater decisions; Cohen’s ‘always guess’ model is unrealistic as a model of how raters behave. For this reason
can be misleading in situations such as the diagnosis of rare diseases. In these scenarios,
tends to underestimate agreement on the rare category [15].
is thus considered an overly conservative measure of agreement [16].
3. Gwet’s Kappa: An Improved Model of Chance Agreement
Gwet proposed an alternative to Cohen’s
, which we call Gwet’s
(also known as AC1 (Agreement Coefficient 1)) that addresses the limitations discussed above [12]. Gwet’s key contribution was a more realistic model of chance agreement,
, which we call the “occasional guessing” model. Because this model addresses the limitations of Cohen’s
, Gwet’s
has been increasingly adopted in studies of IRR [1] [4]-[7]. However, as we show below, Gwet’s
also has important limitations. Specifically, the formula Gwet proposed for estimating
is biased.
3.1. The “Occasional Guessing” Model for Chance Agreement
Gwet suggested that a more realistic model for how chance agreement occurs is:
1) Cases are easy or hard. Raters always classify easy cases correctly, and for hard cases, they guess with equal probability. Thus, for hard cases, the probability of agreement is 1/2.
2) The fraction of hard cases is r.
3.2. Theoretical Value of κ under the Occasional Guessing Model
Using this model, we can calculate the theoretical true value of
, denoted
. For any case evaluated by two raters consider the following events: A = {Raters agree}, and R = {the case is hard: raters guess randomly}. Then the probability of agreement due to chance (arising out of guessing) for any case is
The overall probability of agreement is
Thus, the expected proportion of beyond-chance agreement is
We note that r can also be expressed in terms of
, as
It is easy to check that
. Also, noting that
, we observe that for high and low values of r, we get
,
.
Any estimate of
whose expected value deviates from the theoretical value
is said to be biased. We next consider Gwet’s proposal for estimating
, and will show that it is biased in some important settings.
3.3. Gwet’s Formula for the Probability of Chance Agreement
Gwet proposed a formula for
based on the following heuristic argument. Consider the random variable
The variance of
is
, where
is the average rate at which raters assign cases to the “+” category. The maximum possible variance for classification is reached when rating is done completely at random, with each category assigned with probability 1/2, in which case the variance is
. Gwet suggested that a reasonable measure of the randomness with which raters choose the + category is the ratio of the observed choice variance to the maximal possible variance, i.e.
, thus:
This leads to chance agreement probability of
which can be substituted into
.
3.4. Gwet’s κ Is Biased
Gwet showed that, when considered from the point of view of the ‘occasional guessing’ model of chance agreement, Cohen’s
and several other well-known
and
-like statistics for interrater agreement are biased, particularly at high levels of agreement [1] [12]. By contrast, Gwet’s formula is accurate (nearly unbiased, i.e.
) when agreement between raters
is high or low, overcoming a key limitation of Cohen’s
[1] [12]. This is easy to show: When agreement is high,
, we have
, regardless of
. When agreement is low (both raters guessing all the time,
), agreement occurs in approximately half the cases,
, approximately half of the ratings are positive,
, and
, and
.
However, for intermediate levels of agreement, Gwet’s formula is biased. We show this by expressing
in terms of r, substituting into Gwet’s formula for
, then comparing this with the true value
. The proportion of + ratings is the sum of the proportions of + ratings on hard cases, r/2, and easy cases,
, where
is the proportion of easy cases whose true rating is +. Thus
, and Gwet’s formula gives
. The deviation of Gwet’s formula for
from the true value r/2 is
Note that this bias does not depend on q. Figure 1(A) & Figure 1(B) illustrate the bias and 95% confidence intervals for 2 raters scoring
cases, where
, over the entire range of possible true values
of the underlying IRR.
Figure 1. (A) True
vs Gwet’s
. (B) Bias (Gwet’s
). (C)
vs
. (D) Bias (
).
4. Maximum Likelihood Estimation of
Here we present a direct approach to estimating
in Gwet’s occasional guessing model. Unlike Gwet’s
, the ML
is not based on a heuristic approximation. Rather, we derive
by writing down the likelihood of the observed data under the occasional guessing model and then solving for the r that maximizes that likelihood.
Let
represent the agreement and disagreements for the N cases, where
indicates disagreement and
indicates agreement. When event R occurs (random guessing), we have
. For easy cases, raters are not guessing (i.e.
occurs), and we have
,
. The probability that raters guess is
. The probabilities for
conditional on r are
The likelihood function for the data is:
, so the log-likelihood is
. Splitting the sum into
terms in which they disagree (
) and
terms in which they agree (
), we get
Taking the derivative of
with respect to r, setting it equal to zero, and solving, we get:
where
. Note that
the probability of disagreement.
This result makes sense: Given that the probability of agreement when raters guess is 1/2, the best estimate from the data of the number of times at least one rater was in fact guessing is twice the number of observed disagreements.
From the above calculation it follows that the estimated probability of agreement due to chance is
Is Unbiased
We now show that the expected value of the ML estimator for
is equal to the theoretical value, hence
is an unbiased estimator of
.
Recall that
is the probability of chance agreement used in calculating
, where
is the number of disagreements observed between the two raters performing binary assignments. We can rewrite this as , since
denotes disagreements, and
in cases of agreement. Thus,
Consequently,
Figure 1(C) & Figure 1(D) illustrate the estimation of
in a case with
cases scored by 2 raters, including bootstrap estimates of the 95% confidence intervals.
5. Variance of
We now compute the variance of our estimate of
. The key computation is computing the second moment of
.
Thus,
Let
. The maximum derivative of f over
is 2. Thus, we have for all
,
:
In other words, a confidence interval for
translates into a confidence interval for
which is
. Confidence intervals can also be calculated numerically using bootstrapping, as shown in Figure 1.
6. Multiple Categories
The preceding sections have dealt with the case of classifying into 2 categories. We can analogously derive
and r in the case where there is instead an arbitrary number, n, of classes. To do this, we generalize the “occasional guessing” model so that, for hard cases, raters guess all n classes with equal probability. Under this model, the probability of agreement by guessing is
and the overall probability of agreement is
Now, to find the theoretical
in terms of r,
Next we derive the ML estimator of r. Let
represent the agreement and disagreements for the N cases, where
indicates disagreement and
indicates agreement. When event R occurs (random guessing), we have
. When neither rater guesses (i.e. event
occurs), we have
,
. The probability that raters guess randomly is
. The probabilities for
conditional on r are
Now, to find
, we maximize the likelihood function for the data
, or the log-likelihood
. Splitting the sum into
terms with
and
terms with
, we get
Taking the derivative with respect to r, setting it equal to zero, and solving, we get:
where
.
7. Conclusions
We have presented a maximum likelihood approach to estimating the chance agreement probability
in Gwet’s “occasional guessing” model of interrater agreement. Our estimator,
, is derived directly from the likelihood function of the data under this model, rather than relying on heuristic approximations as in Gwet’s
.
We have shown that the maximum likelihood estimator
for the probability of guessing r is simply twice the observed disagreement rate between raters. Consequently, the chance agreement probability estimate
used in
is the observed disagreement rate. We have also generalized this result to the case of raters scoring cases that can belong to multiple classes.
A key advantage of
is that it is an unbiased estimator of the true value of
predicted by the occasional guessing model. In contrast, we have demonstrated that Gwet’s formula for
, while overcoming certain limitations of Cohen’s
, is itself biased for intermediate levels of agreement.
We have also provided the variance of the
estimator, which can be used to construct confidence intervals. The variance depends on both the true value of r and the sample size N, decreasing as N increases as expected for a consistent estimator.
In summary,
provides a principled approach to estimating chance agreement in the occasional guessing model, addressing limitations of previous
statistics. As the use of interrater reliability measures continues to grow across fields, having an unbiased estimator is important for obtaining reliable inferences from data.
Author Contributions
Conceptualization, A.M.W., T.M.W. and M.B.W.; methodology, A.M.W., T.M.W. and M.B.W.; writing—original draft preparation, A.M.W., T.M.W. and M.B.W.; writing—review and editing, A.M.W., T.M.W. and M.B.W.; supervision, M.B.W. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The code that supports the findings of this study is available from the corresponding author upon request.
Abbreviations
The following abbreviations are used in this manuscript:
IRR |
Interrater reliability |
ML |
Maximum likelihood |
AC1 |
Agreement coefficient 1 |