Benchmark of Fleiss' kappa, Cohen's kappa, and the Brennan-Prediger coefficient

The main benefit of quadagree versus irrCAC for data on long form is its support for missing data and continuous data, but bootstrapping and transformations can come in handy as well. As it turns out, quadagree is sightly faster than irrCAC for most practical data sizes. The largest difference is found for Conger’s kappa, as the method used by irrCAC, which works for any weighting function, must necessarily do much more work for Conger’s kappa. For quadagree, Fleiss’s kappa and Conger’s kappa are equally fast.

The following benchmark is being run on $n=50$ ratings and $R=4$ raters.

library("quadagree")
x <- dat.zapf2016
irr_conger <- \(x) irrCAC::conger.kappa.raw(x, weights = "quadratic")
irr_fleiss <- \(x) irrCAC::fleiss.kappa.raw(x, weights = "quadratic")
irr_bp <- \(x) irrCAC::bp.coeff.raw(x, weights = "quadratic")
microbenchmark::microbenchmark(
  irr_conger(x),
  congerci(x),
  irr_fleiss(x),
  fleissci(x),
  irr_bp(x),
  bpci(x),
  times = 1000
)

## Unit: microseconds
##           expr      min       lq      mean   median        uq       max neval
##  irr_conger(x) 1251.396 1297.387 1376.6267 1312.595 1336.6350 13004.414  1000
##    congerci(x)  370.131  401.379  437.6575  417.174  428.8305  3877.338  1000
##  irr_fleiss(x)  662.847  697.096  779.4348  707.851  721.8075  8714.877  1000
##    fleissci(x)  366.133  397.948  420.8645  411.658  423.5655  3446.813  1000
##      irr_bp(x)  649.272  682.263  741.6886  693.820  710.0455  4340.332  1000
##        bpci(x)  397.572  430.624  470.3892  445.702  457.1540  4023.059  1000

Let’s try $n=500$ .

y <- rbind(x, x, x, x, x, x, x, x, x, x)
# 500 ratings
microbenchmark::microbenchmark(
  irr_conger(y),
  congerci(y),
  irr_fleiss(y),
  fleissci(y),
  irr_bp(y),
  bpci(y),
  times = 1000
)

## Unit: microseconds
##           expr      min        lq      mean    median        uq       max neval
##  irr_conger(y) 1953.326 2037.0375 2351.9054 2066.0815 2102.8500 58641.427  1000
##    congerci(y)  457.013  502.8190  537.7253  522.8710  543.8705  4083.753  1000
##  irr_fleiss(y)  788.131  830.8805  875.0251  847.4120  866.4730  4839.012  1000
##    fleissci(y)  458.465  498.5860  533.6132  518.4475  542.4325  3856.188  1000
##      irr_bp(y)  764.878  815.3120  868.8832  831.8830  851.6340  4550.654  1000
##        bpci(y)  543.104  589.3950  633.4511  608.9765  635.2210  5555.360  1000

For $n=3000$ , quadagree is roughly $5$ times faster than irrCAC for Conger’s kappa and the Brennan-Prediger coefficient.

z <- rbind(y, y, y, y, y, y, y, y, y, y)
# 5000 ratings
microbenchmark::microbenchmark(
  irr_conger(z),
  congerci(z),
  irr_fleiss(z),
  fleissci(z),
  irr_bp(z),
  bpci(z)
)

## Unit: milliseconds
##           expr       min        lq      mean    median        uq       max
##  irr_conger(z) 16.605005 19.614978 20.086820 19.975275 20.263523 71.729186
##    congerci(z)  2.221176  2.299062  2.435888  2.349476  2.451561  5.915702
##  irr_fleiss(z)  3.474505  3.579702  4.054850  3.645415  3.733063 13.285418
##    fleissci(z)  2.202451  2.290601  2.478628  2.356649  2.484287  5.746637
##      irr_bp(z)  3.409935  3.506064  3.924030  3.557545  3.618785 10.126331
##        bpci(z)  3.191246  3.298451  3.612792  3.381982  3.452605  6.659040
##  neval
##    100
##    100
##    100
##    100
##    100
##    100

If we increase the number of categories, the differential becomes very large.

w <- cbind(y, y, y, y, y, y, y, y, y, y)
# 500 ratings and 40 categories.
microbenchmark::microbenchmark(
  irr_conger(w),
  congerci(w),
  irr_fleiss(w),
  fleissci(w),
  irr_bp(w),
  bpci(w)
)

## Unit: milliseconds
##           expr       min        lq      mean    median        uq       max
##  irr_conger(w) 15.279891 18.056969 20.821602 18.474173 19.064069 72.911493
##    congerci(w) 20.324657 23.242999 26.181156 24.336600 26.389737 81.909648
##  irr_fleiss(w)  2.939937  3.060958  3.480633  3.210573  3.291263  7.502654
##    fleissci(w) 20.260156 23.249191 26.381585 24.020200 26.692999 78.435323
##      irr_bp(w)  2.888512  3.045229  3.755012  3.154713  3.230786 10.834052
##        bpci(w) 23.700528 26.757284 32.093062 28.027285 30.285835 85.202615
##  neval
##    100
##    100
##    100
##    100
##    100
##    100

Benchmark for the non-aggregrated functions

Jonas Moss

4/25/2023