Short benchmark of Fleiss' kappa and Brennan-Prediger

The most feature complete R package for agreement coefficients is irrCAC. It implements Fleiss’ kappa and the Brennan-Prediger coefficient for both aggregated and long form data. Despite the fact that their method of inference, based on $U$ -statistics, does not look the same as ours (based on moments) they are equivalent in the case of aggregated data. Compare the confidence intervals below to be convinced of this.

library("quadagree")
irrCAC::bp.coeff.dist(dat.fleiss1971, weights = "quadratic")

##         coeff.name     coeff    stderr      conf.int     p.value        pa   pe
## 1 Brennan-Prediger 0.3338889 0.1036175 (0.122,0.546) 0.003134299 0.8334722 0.75

bpci_aggr(dat.fleiss1971)

## Call: bpci_aggr(x = dat.fleiss1971)
## 
## 95% confidence interval (n = 30).
##     0.025     0.975 
## 0.1219674 0.5458104 
## 
## Sample estimates.
##     kappa        sd 
## 0.3338889 0.5579971

For aggregated data, quadagree supports user-supplied values vectors, transforms, and studentized bootstrapping, which irrCAC does not. But is it faster? It turns out quadagree is roughly twice as fast for reasonable numbers of raters. This suggests there is no benefit in using the moment formulation (as done in quadagree) when calculating these coefficients, as quadagree is likely to be substantially more optimized than irrCAC.

We note that no effort has been made to bin the values. One of the benefits of the moment formulation of Fleiss’ kappa is its ability to handle continuous values, and binning will not help here. It is, however, possible to take advantage of binning when dealing with categorical data, also in the moment formulation. Implementing binning is likely to further increase the speed differential between irrCAC and quadagree, especially when there are few categories compared to the number of rows, but its utility is questionable.

Benchmarks

We will run three benchmarks of various sizes using the microbenchmark package. We start off with dat.fleiss1971, which contains $n=30$ rows.

x <- dat.fleiss1971
irr_bp <- \(x) irrCAC::bp.coeff.dist(x, weights = "quadratic")
irr_fleiss <- \(x) irrCAC::fleiss.kappa.dist(x, weights = "quadratic")
microbenchmark::microbenchmark(
  irr_bp(x),
  bpci_aggr(x),
  irr_fleiss(x),
  fleissci_aggr(x),
  times = 1000
)

## Unit: microseconds
##              expr     min       lq     mean   median       uq       max neval
##         irr_bp(x) 512.136 539.3210 590.4476 549.9970 567.5500  3339.994  1000
##      bpci_aggr(x) 236.372 253.2975 281.0334 262.2095 271.7070  4356.462  1000
##     irr_fleiss(x) 532.463 556.0775 600.0553 565.1850 580.8595 11972.417  1000
##  fleissci_aggr(x) 228.326 246.2600 265.4145 256.7995 265.4405  3026.619  1000

So quadagree is roughly twice as fast. Let’s see what happens when $n=300$ .

x <- dat.fleiss1971
x <- rbind(x, x, x, x, x, x, x, x, x, x)
microbenchmark::microbenchmark(
  irr_bp(x),
  bpci_aggr(x),
  irr_fleiss(x),
  fleissci_aggr(x),
  times = 1000
)

## Unit: microseconds
##              expr     min       lq     mean  median       uq      max neval
##         irr_bp(x) 522.114 550.7680 590.0042 561.734 575.7545 3765.599  1000
##      bpci_aggr(x) 234.628 256.3135 283.5659 267.825 277.2875 3363.398  1000
##     irr_fleiss(x) 534.688 569.1725 610.6677 580.529 596.8040 3562.339  1000
##  fleissci_aggr(x) 229.218 251.1335 267.3655 262.285 271.9920 2946.239  1000

The run time is almost the same for all methods as it was for $n=30$ , suggesting that there is substantial overhead to both methods. Let’s check $n=3000$ .

x <- rbind(x, x, x, x, x, x, x, x, x, x)
# x has 3000 elements.
microbenchmark::microbenchmark(
  irr_bp(x),
  bpci_aggr(x),
  irr_fleiss(x),
  fleissci_aggr(x),
  times = 1000
)

## Unit: microseconds
##              expr      min       lq      mean    median        uq       max
##         irr_bp(x) 1155.928 1191.705 1300.3245 1204.1130 1222.3515  8321.642
##      bpci_aggr(x)  458.166  485.511  542.7880  501.3005  512.8220  8476.342
##     irr_fleiss(x) 1214.808 1244.323 1396.5112 1258.7090 1273.3215 55595.862
##  fleissci_aggr(x)  483.543  510.538  559.5695  525.1600  535.3945  7557.928
##  neval
##   1000
##   1000
##   1000
##   1000

It appears that bpci_aggr is pulling ahead of irrCAC::bp.coeff.dist.

Let’s finish off with a larger number of categories.

x <- cbind(x, x, x, x, x, x, x, x, x, x)
# x has 3000 elements and 50 categories.
microbenchmark::microbenchmark(
  irr_bp(x),
  bpci_aggr(x),
  irr_fleiss(x),
  fleissci_aggr(x),
  times = 1000
)

## Unit: milliseconds
##              expr      min       lq     mean   median       uq       max neval
##         irr_bp(x) 4.069577 4.225246 5.467688 4.295928 7.115412 69.519161  1000
##      bpci_aggr(x) 1.143644 1.194665 1.533498 1.260047 1.340272 62.989804  1000
##     irr_fleiss(x) 4.194409 4.379644 5.589256 4.459890 7.359672 61.883599  1000
##  fleissci_aggr(x) 1.166597 1.218840 1.446629 1.271258 1.356411  4.858178  1000

So quadagree is substantially faster on data with very many categories and items rated.

Benchmark for the aggregated functions

Jonas Moss

4/25/2023

Benchmarks