Short benchmark of Fleiss' kappa and Brennan-Prediger
Benchmark for the aggregated functions
Jonas Moss
4/25/2023
benchmarks_aggr.rmd
The most feature complete R
package for agreement
coefficients is irrCAC
. It implements Fleiss’ kappa and the
Brennan-Prediger coefficient for both aggregated and long form data.
Despite the fact that their method of inference, based on
-statistics,
does not look the same as ours (based on moments) they are equivalent in
the case of aggregated data. Compare the confidence intervals below to
be convinced of this.
library("quadagree")
irrCAC::bp.coeff.dist(dat.fleiss1971, weights = "quadratic")
## coeff.name coeff stderr conf.int p.value pa pe
## 1 Brennan-Prediger 0.3338889 0.1036175 (0.122,0.546) 0.003134299 0.8334722 0.75
bpci_aggr(dat.fleiss1971)
## Call: bpci_aggr(x = dat.fleiss1971)
##
## 95% confidence interval (n = 30).
## 0.025 0.975
## 0.1219674 0.5458104
##
## Sample estimates.
## kappa sd
## 0.3338889 0.5579971
For aggregated data, quadagree
supports user-supplied
values
vectors, transforms, and studentized bootstrapping,
which irrCAC
does not. But is it faster? It turns out
quadagree
is roughly twice as fast for reasonable numbers
of raters. This suggests there is no benefit in using the moment
formulation (as done in quadagree
) when calculating these
coefficients, as quadagree
is likely to be substantially
more optimized than irrCAC
.
We note that no effort has been made to bin the values. One of the
benefits of the moment formulation of Fleiss’ kappa is its ability to
handle continuous values, and binning will not help here. It is,
however, possible to take advantage of binning when dealing with
categorical data, also in the moment formulation. Implementing binning
is likely to further increase the speed differential between
irrCAC
and quadagree
, especially when there
are few categories compared to the number of rows, but its utility is
questionable.
Benchmarks
We will run three benchmarks of various sizes using the
microbenchmark
package. We start off with
dat.fleiss1971
, which contains
rows.
x <- dat.fleiss1971
irr_bp <- \(x) irrCAC::bp.coeff.dist(x, weights = "quadratic")
irr_fleiss <- \(x) irrCAC::fleiss.kappa.dist(x, weights = "quadratic")
microbenchmark::microbenchmark(
irr_bp(x),
bpci_aggr(x),
irr_fleiss(x),
fleissci_aggr(x),
times = 1000
)
## Unit: microseconds
## expr min lq mean median uq max neval
## irr_bp(x) 512.136 539.3210 590.4476 549.9970 567.5500 3339.994 1000
## bpci_aggr(x) 236.372 253.2975 281.0334 262.2095 271.7070 4356.462 1000
## irr_fleiss(x) 532.463 556.0775 600.0553 565.1850 580.8595 11972.417 1000
## fleissci_aggr(x) 228.326 246.2600 265.4145 256.7995 265.4405 3026.619 1000
So quadagree
is roughly twice as fast. Let’s see what
happens when
.
x <- dat.fleiss1971
x <- rbind(x, x, x, x, x, x, x, x, x, x)
microbenchmark::microbenchmark(
irr_bp(x),
bpci_aggr(x),
irr_fleiss(x),
fleissci_aggr(x),
times = 1000
)
## Unit: microseconds
## expr min lq mean median uq max neval
## irr_bp(x) 522.114 550.7680 590.0042 561.734 575.7545 3765.599 1000
## bpci_aggr(x) 234.628 256.3135 283.5659 267.825 277.2875 3363.398 1000
## irr_fleiss(x) 534.688 569.1725 610.6677 580.529 596.8040 3562.339 1000
## fleissci_aggr(x) 229.218 251.1335 267.3655 262.285 271.9920 2946.239 1000
The run time is almost the same for all methods as it was for , suggesting that there is substantial overhead to both methods. Let’s check .
x <- rbind(x, x, x, x, x, x, x, x, x, x)
# x has 3000 elements.
microbenchmark::microbenchmark(
irr_bp(x),
bpci_aggr(x),
irr_fleiss(x),
fleissci_aggr(x),
times = 1000
)
## Unit: microseconds
## expr min lq mean median uq max
## irr_bp(x) 1155.928 1191.705 1300.3245 1204.1130 1222.3515 8321.642
## bpci_aggr(x) 458.166 485.511 542.7880 501.3005 512.8220 8476.342
## irr_fleiss(x) 1214.808 1244.323 1396.5112 1258.7090 1273.3215 55595.862
## fleissci_aggr(x) 483.543 510.538 559.5695 525.1600 535.3945 7557.928
## neval
## 1000
## 1000
## 1000
## 1000
It appears that bpci_aggr
is pulling ahead of
irrCAC::bp.coeff.dist
.
Let’s finish off with a larger number of categories.
x <- cbind(x, x, x, x, x, x, x, x, x, x)
# x has 3000 elements and 50 categories.
microbenchmark::microbenchmark(
irr_bp(x),
bpci_aggr(x),
irr_fleiss(x),
fleissci_aggr(x),
times = 1000
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## irr_bp(x) 4.069577 4.225246 5.467688 4.295928 7.115412 69.519161 1000
## bpci_aggr(x) 1.143644 1.194665 1.533498 1.260047 1.340272 62.989804 1000
## irr_fleiss(x) 4.194409 4.379644 5.589256 4.459890 7.359672 61.883599 1000
## fleissci_aggr(x) 1.166597 1.218840 1.446629 1.271258 1.356411 4.858178 1000
So quadagree
is substantially faster on data with very
many categories and items rated.