Interrater Reliability Estimation via Maximum Likelihood for Gwet’s Chance Agreement Model

Abstract

Interrater reliability (IRR) statistics, like Cohen’s kappa, measure agreement between raters beyond what is expected by chance when classifying items into categories. While Cohen’s kappa has been widely used, it has several limitations, prompting development of Gwet’s agreement statistic, an alternative “kappa”statistic which models chance agreement via an “occasional guessing” model. However, we show that Gwet’s formula for estimating the proportion of agreement due to chance is itself biased for intermediate levels of agreement, despite overcoming limitations of Cohen’s kappa at high and low agreement levels. We derive a maximum likelihood estimator for the occasional guessing model that yields an unbiased estimator of the IRR, which we call the maximum likelihood kappa ( κ ML ). The key result is that the chance agreement probability under the occasional guessing model is simply equal to the observed rate of disagreement between raters. The κ ML statistic provides a theoretically principled approach to quantifying IRR that addresses limitations of previous κ coefficients. Given the widespread use of IRR measures, having an unbiased estimator is important for reliable inference across domains where rater judgments are analyzed.

Share and Cite:

Westover, A. , Westover, T. and Westover, M. (2024) Interrater Reliability Estimation via Maximum Likelihood for Gwet’s Chance Agreement Model. Open Journal of Statistics, 14, 481-491. doi: 10.4236/ojs.2024.145021.

1. Introduction

Interrater reliability (IRR) (also known as “kappa” ( κ )) statistics, are used to measure agreement between two raters or coders classifying items into mutually exclusive categories. κ statistics are widely used in fields such as psychology and medicine to evaluate the reliability or consistency of expert judgments [1].

Simply calculating the percentage of cases where raters agree does not account for the possibility that some agreement occurs by chance. κ is designed to measure the the degree of agreement between raters beyond what is expected by chance. Assume two raters independently classify N cases into categories + and −, and denote by N a the number of cases on which they agree. Assume N c agreements occur by chance, and the rest N k are due to knowledge (not due to chance), so that N a = N c + N k . The number of cases remaining after subtracting chance agreements is N N c . Thus the percentage of the observed agreement N a in excess of chance agreement is:

κ= N a N c N N c = N k N N c = P a P c 1 P c ,

where P a = N a /N denotes the observed percent agreement, and P c = N c /N is the percent agreement due to chance. P a is observed, whereas P c must be estimated.

Several approaches have been proposed to estimate the probability of chance agreement. The approach used most commonly in the past (Cohen’s κ ) has recently fallen under criticism [2] [3], leading to a new approach (Gwet’s κ ) which has been gained popularity over the past several years [1] [4]-[7]. However, we show that the new approach is biased. We demonstrate an unbiased approach to estimating κ based on maximum likelihood estimation.

2. Cohen’s Kappa and Its Limitations

Historically, the most commonly used κ statistic has been Cohen’s κ [8] [9], which quantifies interrater reliability for two raters applying binary ratings. Other approaches are discussed at length in [10]-[12].

Cohen proposed calculating the probability of chance agreement P c based on an ‘always guess’ model. Suppose two raters A and B independently assign N items to two categories, + and −. Let the numbers of items assigned to each category be N A + , N A , N B + , N B , and the number of items on which they agree be N a . Now consider what percentage of cases raters A and B would be expected to agree on if they assigned the same numbers of items to each category as they do in the observed data, but made the assignments at random (“guessing”). Under this model, A and B classify items as + with probabilities p A + = N A + /N , p B + = N B + /N , and as − with probabilities p A = N A /N , p B = N B /N . Any agreements under this model occur by chance, with probability

P c = p A + p B + + p A p B .

Critiques of Cohen’s Model

Two main criticisms have been raised against Cohen’s κ . First, Cohen’s κ produces “paradoxical” results under certain circumstances [2] [10] [11] [13]: high levels of observed agreement can accompany a low κ value. This happens because Cohen’s κ depends only on the rates of ratings in the data. Thus, if raters A and B score most cases as class +, it may be because they correctly recognize that most cases are +, yet Cohen’s κ cannot give credit for agreement due to expertise. This problem is most pronounced when the proportion of classes in the data deviates from 50% [12].

Second, some authors [12] [14] dispute the idea that κ “takes into account” chance agreement. Truly doing this requires a realistic model of how chance affects rater decisions; Cohen’s ‘always guess’ model is unrealistic as a model of how raters behave. For this reason κ can be misleading in situations such as the diagnosis of rare diseases. In these scenarios, κ tends to underestimate agreement on the rare category [15]. κ is thus considered an overly conservative measure of agreement [16].

3. Gwet’s Kappa: An Improved Model of Chance Agreement

Gwet proposed an alternative to Cohen’s κ , which we call Gwet’s κ (also known as AC1 (Agreement Coefficient 1)) that addresses the limitations discussed above [12]. Gwet’s key contribution was a more realistic model of chance agreement, P c , which we call the “occasional guessing” model. Because this model addresses the limitations of Cohen’s κ , Gwet’s κ has been increasingly adopted in studies of IRR [1] [4]-[7]. However, as we show below, Gwet’s κ also has important limitations. Specifically, the formula Gwet proposed for estimating κ is biased.

3.1. The “Occasional Guessing” Model for Chance Agreement

Gwet suggested that a more realistic model for how chance agreement occurs is:

1) Cases are easy or hard. Raters always classify easy cases correctly, and for hard cases, they guess with equal probability. Thus, for hard cases, the probability of agreement is 1/2.

2) The fraction of hard cases is r.

3.2. Theoretical Value of κ under the Occasional Guessing Model

Using this model, we can calculate the theoretical true value of κ , denoted κ * . For any case evaluated by two raters consider the following events: A = {Raters agree}, and R = {the case is hard: raters guess randomly}. Then the probability of agreement due to chance (arising out of guessing) for any case is

P c =P( A,R )=P( R )P( A|R )=r/2

The overall probability of agreement is

P a =P( A ) =P( A,R )+P( A, R ¯ ) =P( R )P( A|R )+P( R ¯ )P( A| R ¯ ) =r/2 +( 1r ) =1r/2 .

Thus, the expected proportion of beyond-chance agreement is

κ * = P a P c 1 P c = 1r 1r/2 .

We note that r can also be expressed in terms of κ , as

r= 1 κ * 1 κ * /2 .

It is easy to check that 0<κ< P a . Also, noting that κ=κ( r ) , we observe that for high and low values of r, we get κ( 0 )=1 , κ( 1 )=0 .

Any estimate of κ whose expected value deviates from the theoretical value κ * is said to be biased. We next consider Gwet’s proposal for estimating κ , and will show that it is biased in some important settings.

3.3. Gwet’s Formula for the Probability of Chance Agreement

Gwet proposed a formula for r=P( R ) based on the following heuristic argument. Consider the random variable

X + ={ 1 ifaraterclassifiesagivencaseas+ 0 otherwise

The variance of X + is Var( X + )= π + ( 1 π + ) , where π + is the average rate at which raters assign cases to the “+” category. The maximum possible variance for classification is reached when rating is done completely at random, with each category assigned with probability 1/2, in which case the variance is Var max =1/2 ( 11/2 )=1/4 . Gwet suggested that a reasonable measure of the randomness with which raters choose the + category is the ratio of the observed choice variance to the maximal possible variance, i.e. P( R ) Var( X + )/ V max , thus:

r=P( R )= π + ( 1 π + ) 1/2 ( 11/2 ) =4 π + ( 1 π + ),

This leads to chance agreement probability of

P c =r/2 =2 π + ( 1 π + ),

which can be substituted into κ= ( P a P c )/ ( 1 P c ) .

3.4. Gwet’s κ Is Biased

Gwet showed that, when considered from the point of view of the ‘occasional guessing’ model of chance agreement, Cohen’s κ and several other well-known κ and κ -like statistics for interrater agreement are biased, particularly at high levels of agreement [1] [12]. By contrast, Gwet’s formula is accurate (nearly unbiased, i.e. κ κ * ) when agreement between raters P a is high or low, overcoming a key limitation of Cohen’s κ [1] [12]. This is easy to show: When agreement is high, P a 1 , we have κ ( 1 P c )/ ( 1 P c ) =1 , regardless of P c . When agreement is low (both raters guessing all the time, r=1 ), agreement occurs in approximately half the cases, P a 1/2 , approximately half of the ratings are positive, π + 1/2 , and P c =2( 1/2 )( 11/2 )=1/2 , and κ= ( 1/2 1/2 )/ ( 11/2 ) =0 .

However, for intermediate levels of agreement, Gwet’s formula is biased. We show this by expressing π + in terms of r, substituting into Gwet’s formula for P c , then comparing this with the true value P c =r/2 . The proportion of + ratings is the sum of the proportions of + ratings on hard cases, r/2, and easy cases, ( 1r )q , where q[ 0,1 ] is the proportion of easy cases whose true rating is +. Thus π + =r/2 +( 1r )q , and Gwet’s formula gives P c =2( r/2 +( 1r )q )( 1r/2 ( 1r ) )q . The deviation of Gwet’s formula for P c from the true value r/2 is Δ P c =2( r/2 +( 1r )q )( 1r/2 ( 1r )q )r/2 =r/2 r 2 /2 .

Note that this bias does not depend on q. Figure 1(A) & Figure 1(B) illustrate the bias and 95% confidence intervals for 2 raters scoring N=100 cases, where q=0.2 , over the entire range of possible true values κ * of the underlying IRR.

Figure 1. (A) True κ= κ * vs Gwet’s κ . (B) Bias (Gwet’s κ κ * ). (C) κ * vs κ ML . (D) Bias ( κ ML κ * ).

4. Maximum Likelihood Estimation of P( R )

Here we present a direct approach to estimating P( R )=r in Gwet’s occasional guessing model. Unlike Gwet’s κ , the ML κ is not based on a heuristic approximation. Rather, we derive κ ML by writing down the likelihood of the observed data under the occasional guessing model and then solving for the r that maximizes that likelihood.

Let X=[ X 1 , X 2 ,, X N ] represent the agreement and disagreements for the N cases, where X i =0 indicates disagreement and X i =1 indicates agreement. When event R occurs (random guessing), we have P( X i =0|R )=P( X i =1|R )=1/2 . For easy cases, raters are not guessing (i.e. R ¯ occurs), and we have P( X i =0| R ¯ )=0 , P( X i =1| R ¯ )=1 . The probability that raters guess is P( R )=r . The probabilities for X i conditional on r are

P( X i =0|r )=P( R )P( X i =0|R )+P( R ¯ )P( X i =0| R ¯ )=r/2

P( X i =1|r )=P( R )P( X i =1|R )+P( R ¯ )P( X i =1| R ¯ )=1r/2

The likelihood function for the data is: P( X|r )= i=1 N P( X i |r ) , so the log-likelihood is L( X|r )= i=1 N logP( X i |r ) . Splitting the sum into N d terms in which they disagree ( X i =0 ) and N a terms in which they agree ( X i =1 ), we get

L( X|r )= N d logP( X=0|r )+ N a logP( X=1|r ) = N d log r 2 + N a log( 1r/2 )

Taking the derivative of L( X|r ) with respect to r, setting it equal to zero, and solving, we get:

r L( X,r )= N d / r ML 1 2 N a / ( 1 r ML /2 ) =0

r ML = 2 N d N ,

where N= N d + N a . Note that N d /N = P d the probability of disagreement.

This result makes sense: Given that the probability of agreement when raters guess is 1/2, the best estimate from the data of the number of times at least one rater was in fact guessing is twice the number of observed disagreements.

From the above calculation it follows that the estimated probability of agreement due to chance is

P c =P( R )P( A|R )= r ML /2 = N d /N .

κ ML Is Unbiased

We now show that the expected value of the ML estimator for κ is equal to the theoretical value, hence κ ML is an unbiased estimator of κ * .

Recall that r ML = ( 2 N d )/N is the probability of chance agreement used in calculating κ ML , where N d is the number of disagreements observed between the two raters performing binary assignments. We can rewrite this as N d =E[ i=1 N X i ¯ ] , since X i =0 denotes disagreements, and X i =1 in cases of agreement. Thus,

E[ r ML ]=E[ 2 N d /N ] = 2 N E[ i=1 N X i ¯ ] = 2 N N( r/2 ) =r

Consequently,

E[ κ ML ]= 1E[ r ML ] 1 E[ r ML ]/2 = 1r 1r/2 = κ *

Figure 1(C) & Figure 1(D) illustrate the estimation of κ ML in a case with N=100 cases scored by 2 raters, including bootstrap estimates of the 95% confidence intervals.

5. Variance of κ ML

We now compute the variance of our estimate of r ML . The key computation is computing the second moment of N d .

E[ N d 2 ]=E[ ( i X i ¯ ) 2 ] = ij P( X i ¯ X j ¯ ) + i P( X i ¯ ) =( N 2 N ) r 2 /4 + Nr/2 .

Thus,

Var[ r ML ]= 4 N 2 ( E[ N d 2 ]E [ N d ] 2 ) = 4 N 2 ( ( N 2 N ) r 2 /4 + Nr/2 ( Nr/2 ) 2 ) = r( 2r ) N .

Let f( r )= 1r 1r/2 . The maximum derivative of f over r[ 0,1 ] is 2. Thus, we have for all ϵ>0 , r[ ϵ,1 ] :

| f( r )f( rϵ ) |2ϵ.

In other words, a confidence interval for r[ r 0 δ, r 0 +δ ] translates into a confidence interval for κ ML which is κ ML [ f( r 0 )2δ,f( r 0 )+2δ ] . Confidence intervals can also be calculated numerically using bootstrapping, as shown in Figure 1.

6. Multiple Categories

The preceding sections have dealt with the case of classifying into 2 categories. We can analogously derive κ ML and r in the case where there is instead an arbitrary number, n, of classes. To do this, we generalize the “occasional guessing” model so that, for hard cases, raters guess all n classes with equal probability. Under this model, the probability of agreement by guessing is

P c =P( A,R )=P( R )P( A|R )=r/n ,

and the overall probability of agreement is

P a =P( A,R )+P( A, R ¯ ) =P( R )P( A|R )+P( R ¯ )P( A| R ¯ ) =r/n +( 1r )1 =1+ r n ( 1n ).

Now, to find the theoretical κ * in terms of r,

κ * = P a P c 1 P c = 1+ r n ( 1n )r/n 1r/n = 1r 1r/n .

Next we derive the ML estimator of r. Let X=[ X 1 , X 2 ,, X N ] represent the agreement and disagreements for the N cases, where X i =0 indicates disagreement and X i =1 indicates agreement. When event R occurs (random guessing), we have P( X i =0|R,r )=P( X i =1|R,r )=1/2 . When neither rater guesses (i.e. event R ¯ occurs), we have P( X i =0| R ¯ ,r )=0 , P( X i =1| R ¯ ,r )=1 . The probability that raters guess randomly is P( R )=r . The probabilities for X i conditional on r are

P( X i =0|r )=1 P a = r n ( n1 )

P( X i =1|r )= P a =1+ r n ( 1n )

Now, to find κ ML , we maximize the likelihood function for the data P( X|r )= i=1 N P( X i |r ) , or the log-likelihood L( X|r )= i=1 N logP( X i |r ) . Splitting the sum into N d terms with X i =0 and N a terms with X i =1 , we get

L( X|r )= N d logP( X=0|r )+ N a logP( X=1|r ) = N d log( r n ( n1 ) )+ N a log( 1+ r n ( 1n ) ).

Taking the derivative with respect to r, setting it equal to zero, and solving, we get:

r L( X,r )= N d n r( n1 ) n1 n + N a 1 1+ r n ( 1n ) n 1n = N d r + N a n 1n +r =0

r ML = N d N n n1

where N= N d + N a .

7. Conclusions

We have presented a maximum likelihood approach to estimating the chance agreement probability P c in Gwet’s “occasional guessing” model of interrater agreement. Our estimator, κ ML , is derived directly from the likelihood function of the data under this model, rather than relying on heuristic approximations as in Gwet’s κ .

We have shown that the maximum likelihood estimator r ML for the probability of guessing r is simply twice the observed disagreement rate between raters. Consequently, the chance agreement probability estimate P c used in κ ML is the observed disagreement rate. We have also generalized this result to the case of raters scoring cases that can belong to multiple classes.

A key advantage of κ ML is that it is an unbiased estimator of the true value of κ predicted by the occasional guessing model. In contrast, we have demonstrated that Gwet’s formula for P c , while overcoming certain limitations of Cohen’s κ , is itself biased for intermediate levels of agreement.

We have also provided the variance of the κ ML estimator, which can be used to construct confidence intervals. The variance depends on both the true value of r and the sample size N, decreasing as N increases as expected for a consistent estimator.

In summary, κ ML provides a principled approach to estimating chance agreement in the occasional guessing model, addressing limitations of previous κ statistics. As the use of interrater reliability measures continues to grow across fields, having an unbiased estimator is important for obtaining reliable inferences from data.

Author Contributions

Conceptualization, A.M.W., T.M.W. and M.B.W.; methodology, A.M.W., T.M.W. and M.B.W.; writing—original draft preparation, A.M.W., T.M.W. and M.B.W.; writing—review and editing, A.M.W., T.M.W. and M.B.W.; supervision, M.B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code that supports the findings of this study is available from the corresponding author upon request.

Abbreviations

The following abbreviations are used in this manuscript:

IRR

Interrater reliability

ML

Maximum likelihood

AC1

Agreement coefficient 1

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1] Gwet, K.L. (2014) Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Raters. Advanced Analytics, LLC.
[2] Cicchetti, D.V. and Feinstein, A.R. (1990) High Agreement but Low Kappa: II. Resolving the Paradoxes. Journal of Clinical Epidemiology, 43, 551-558.
https://doi.org/10.1016/0895-4356(90)90159-m
[3] Feinstein, A.R. and Cicchetti, D.V. (1990) High Agreement but Low Kappa: I. The Problems of Two Paradoxes. Journal of Clinical Epidemiology, 43, 543-549.
https://doi.org/10.1016/0895-4356(90)90158-l
[4] Wongpakaran, N., Wongpakaran, T., Wedding, D. and Gwet, K.L. (2013) A Comparison of Cohen’s Kappa and Gwet’s AC1 When Calculating Inter-Rater Reliability Coefficients: A Study Conducted with Personality Disorder Samples. BMC Medical Research Methodology, 13, Article No. 61.
https://doi.org/10.1186/1471-2288-13-61
[5] Ohyama, T. (2020) Statistical Inference of Gwet’s AC1 Coefficient for Multiple Raters and Binary Outcomes. Communications in StatisticsTheory and Methods, 50, 3564-3572.
https://doi.org/10.1080/03610926.2019.1708397
[6] Jimenez, A.M. and Zepeda, S.J. (2020) A Comparison of Gwet’s AC1 and Kappa When Calculating Inter-Rater Reliability Coefficients in a Teacher Evaluation Context. Journal of Education Human Resources, 38, 290-300.
https://doi.org/10.3138/jehr-2019-0001
[7] Gaspard, N., Hirsch, L.J., LaRoche, S.M., Hahn, C.D. and Westover, M.B. (2014) Interrater Agreement for Critical Care EEG Terminology. Epilepsia, 55, 1366-1373.
https://doi.org/10.1111/epi.12653
[8] Cohen, J. (1960) A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, 20, 37-46.
https://doi.org/10.1177/001316446002000104
[9] Cohen, J. (1968) Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit. Psychological Bulletin, 70, 213-220.
https://doi.org/10.1037/h0026256
[10] Gwet, K. (2002) Kappa Statistic Is Not Satisfactory for Assessing the Extent of Agreement between Raters. Statistical Methods for Inter-Rater Reliability Assessment, 1, 1-6.
[11] Gwet, K. (2002) Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity. Statistical Methods for Inter-Rater Reliability Assessment, 2, 1-9.
[12] Gwet, K.L. (2008) Computing Inter-Rater Reliability and Its Variance in the Presence of High Agreement. British Journal of Mathematical and Statistical Psychology, 61, 29-48.
https://doi.org/10.1348/000711006x126600
[13] Byrt, T., Bishop, J. and Carlin, J.B. (1993) Bias, Prevalence and Kappa. Journal of Clinical Epidemiology, 46, 423-429.
https://doi.org/10.1016/0895-4356(93)90018-v
[14] Uebersax, J.S. (1987) Diversity of Decision-Making Models and the Measurement of Interrater Agreement. Psychological Bulletin, 101, 140-146.
https://doi.org/10.1037//0033-2909.101.1.140
[15] Viera, A.J. and Garrett, J.M. (2005) Understanding Interobserver Agreement: The Kappa Statistic. Family Medicine, 37, 360-363.
[16] Strijbos, J., Martens, R.L., Prins, F.J. and Jochems, W.M.G. (2006) Content Analysis: What Are They Talking about? Computers & Education, 46, 29-48.
https://doi.org/10.1016/j.compedu.2005.04.002

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.