Estimation and inference for the quadratically weighted Fleiss' kappa and Conger's (Cohen's) kappa

Confidence intervals for quadratic agreement coefficients with optional bootstrapping and transforms. Based on the formulas of Moss and van Oest (wip) and Moss (wip) along with standard asymptotic theory (Magnus, Neudecker, 2019) and the missing data theory of van Praag et al. (1985).

Usage

fleissci(
  x,
  type = c("adf", "elliptical", "normal", "unbiased"),
  transform = "none",
  conf_level = 0.95,
  alternative = c("two.sided", "greater", "less"),
  bootstrap = FALSE,
  n_reps = 1000
)

congerci(
  x,
  type = c("adf", "elliptical", "normal", "unbiased"),
  transform = "none",
  conf_level = 0.95,
  alternative = c("two.sided", "greater", "less"),
  bootstrap = FALSE,
  n_reps = 1000
)

cohenci(
  x,
  type = c("adf", "elliptical", "normal", "unbiased"),
  transform = "none",
  conf_level = 0.95,
  alternative = c("two.sided", "greater", "less"),
  bootstrap = FALSE,
  n_reps = 1000
)

bpci(
  x,
  values = NULL,
  kind = 1,
  type = c("adf", "elliptical", "normal", "unbiased"),
  transform = "none",
  conf_level = 0.95,
  alternative = c("two.sided", "greater", "less"),
  bootstrap = FALSE,
  n_reps = 1000
)

fleissci_aggr(
  x,
  values = seq_len(ncol(x)),
  transform = "none",
  conf_level = 0.95,
  alternative = c("two.sided", "greater", "less"),
  bootstrap = FALSE,
  n_reps = 1000
)

bpci_aggr(
  x,
  values = seq_len(ncol(x)),
  kind = 1,
  transform = "none",
  conf_level = 0.95,
  alternative = c("two.sided", "greater", "less"),
  bootstrap = FALSE,
  n_reps = 1000
)

fleiss(x)

bp(x, values = stats::na.omit(unique(c(x))), kind = 1)

conger(x)

cohen(x)

fleiss_aggr(x, values = seq_len(ncol(x)))

bp_aggr(x, values = seq_len(ncol(x)), kind = 1)

Arguments

x: Input data data can be converted to a matrix using as.matrix.
type: Type of confidence interval. Either adf, elliptical, or normal. Ignored in fleiss_aggrci.
transform: One of "none", "log", "fisher", and "arcsin. Defaults to "none".
conf_level: Confidence level. Defaults to 0.95.
alternative: A character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less".
bootstrap: If TRUE, performs a studentized bootstrap with n_reps repetitions. Defaults to FALSE.
n_reps: Number of bootstrap samples if bootstrap = TRUE. Ignored if bootstrap = FALSE. Defaults to 1000.
values: to attach to each column on the Fleiss form data. Defaults to 1:C, where C is the number of categories. Only used in fleiss_aggr and bp_aggr.
kind: The kind of Brennan-Prediger coefficient used, 1 for the classical kind and 2 for the kind introduced in Moss (2023). Only relevant for bp_aggr and bp.

Value

A vector of class quadagree containing the confidence end points. The arguments of the function call are included as attributes.

Details

There are two kinds of functions. The functions ending in aggr should be applied to data on aggregated form, where each row contains the number of selected ratings for each category. The data set dat.fleiss1971 provides an example. Missing data is not supported for the aggr functions. The other functions should be applied to data on long form, where each row contains the ratings of every rater. The data sets dat.zapf2016 and dat.klein2018 are examples. Missing data, and continuous data, is supported. See the usage vignette for more information.

For data on long form, the methods handle missing data using pairwise available information, i.e., the option use = "pairwise.complete.obs" in stats::cov() along with the asymptotic theory of van Praag et al. (1985). The bootstrap option uses the studentized bootstrap (Efron, B. 1987), which is second order correct. Both functions makes use of future.apply when bootstrapping.

The type variables defaults to adf, asymptotically distribution-free, which is consistent when the fourth moment is finite. The normal option assumes normality, and is not consistent for models with excess kurtosis unequal to 0. The elliptical option assumes an elliptical or pseudo-elliptical distribution of the data. The resulting confidence intervals are corrected variants of the normal theory intervals with a kurtosis correction (Yuan & Bentler 2002). The common kurtosis parameter is calculated using the unbiased sample kurtosis (Joanes, 1998).

Conger's (1980) kappa is a multi-rater generalization of Cohen's kappa. All functions in this package work for multiple raters, so functions starting with cohen or conger are aliases. The quadratically weighted Cohen's kappa is also known as Lin's concordance coefficient.

The only difference between Cohen's kappa and Fleiss' kappa lies on how they measure disagreement due to chance. Here Fleiss' marginalizes the rating distribution across raters, essentially assuming there is no difference in the rating distribution across raters, while Cohen's kappa does not. There is a large literature comparing Fleiss' kappa to Cohen's kappa, and there is no consensus on which to prefer.

The aggregated functions takes an argument values, which specifies what numerical value to attach to each category. The default value for values is 1...C, where C is the number of categories.

The Brennan-Prediger coefficients take an argument kind. If equal to 1, it returns the traditional Brennan-Prediger coefficient. If kind equals 2, it returns the new Brennan-Prediger coefficient of Moss (wip).

References

Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171-185. https://doi.org/10.2307/2289144

Van Praag, B. M. S., Dijkstra, T. K., & Van Velzen, J. (1985). Least-squares theory based on general distributional assumptions with an application to the incomplete observations problem. Psychometrika, 50(1), 25-36. https://doi.org/10.1007/BF02294145

Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician), 47(1), 183-189. https://doi.org/10.1111/1467-9884.00122

Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213-220. https://doi.org/10.1037/h0026256

Fleiss, J. L. (1975). Measuring agreement between two judges on the presence or absence of a trait. Biometrics, 31(3), 651-659. https://www.ncbi.nlm.nih.gov/pubmed/1174623

Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2), 322-328. https://doi.org/10.1037/0033-2909.88.2.322

Lin, L. I. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45(1), 255-268. https://www.ncbi.nlm.nih.gov/pubmed/2720055

Moss, van Oest (work in progress). Inference for quadratically weighted multi-rater kappas with missing raters.

Moss (work in progress). On the Brennan-Prediger coefficients.

Magnus, J. R., & Neudecker, H. (2019). Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons. https://doi.org/10.1002/9781119541219

Examples

library("quadagree")
# Fleiss' kappa for data on long form
fleissci(dat.zapf2016)
#> Call: fleissci(x = dat.zapf2016)
#> 
#> 95% confidence interval (n = 50).
#>     0.025     0.975 
#> 0.8418042 0.9549730 
#> 
#> Sample estimates.
#>     kappa        sd 
#> 0.8983886 0.1971016 

# Brennan-Prediger for data on aggregated form
bpci_aggr(dat.fleiss1971)
#> Call: bpci_aggr(x = dat.fleiss1971)
#> 
#> 95% confidence interval (n = 30).
#>     0.025     0.975 
#> 0.1219674 0.5458104 
#> 
#> Sample estimates.
#>     kappa        sd 
#> 0.3338889 0.5579971 

# Conger's (Cohen's) kappa for data on long form with missing values
congerci(dat.klein2018)
#> Call: congerci(x = dat.klein2018)
#> 
#> 95% confidence interval (n = 10).
#>       0.025       0.975 
#> -0.05610579  0.60352456 
#> 
#> Sample estimates.
#>     kappa        sd 
#> 0.2737094 0.4373903