Kappa

Topic Tools

Papers published on a yearly basis

1 / 2

Papers

Journal Article•10.11613/BM.2012.031•

Interrater reliability: the kappa statistic

[...]

Marry L. McHugh

15 Oct 2012-Biochemia Medica

TL;DR: While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.

...read moreread less

Abstract: The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability. While there have been a variety of methods to measure interrater reliability, traditionally it was measured as percent agreement, calculated as the number of agreement scores divided by the total number of scores. In 1960, Jacob Cohen critiqued use of percent agreement due to its inability to account for chance agreement. He introduced the Cohen's kappa, developed to account for the possibility that raters actually guess on at least some variables due to uncertainty. Like most correlation statistics, the kappa can range from -1 to +1. While the kappa is one of the most commonly used statistics to test interrater reliability, it has limitations. Judgments about what level of kappa should be acceptable for health research are questioned. Cohen's suggested interpretation may be too lenient for health related studies because it implies that a score as low as 0.41 might be acceptable. Kappa and percent agreement are compared, and levels for both kappa and percent agreement that should be demanded in healthcare studies are suggested.

...read moreread less

12,321 citations

Journal Article•10.1037/H0026256•

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

[...]

Jacob Cohen¹•Institutions (1)

York University¹

01 Oct 1968-Psychological Bulletin

TL;DR: The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.

...read moreread less

Abstract: A previously described coefficient of agreement for nominal scales, kappa, treats all disagreements equally. A generalization to weighted kappa (Kw) is presented. The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi

...read moreread less

8,594 citations

Journal Article•

Understanding interobserver agreement: the kappa statistic.

[...]

Anthony J. Viera¹, Joanne M. Garrett•Institutions (1)

University of North Carolina at Chapel Hill¹

01 May 2005-Family Medicine

TL;DR: Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers and studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance.

...read moreread less

Abstract: Items such as physical exam findings, radiographic interpretations, or other diagnostic tests often rely on some degree of subjective interpretation by observers. Studies that measure the agreement between two or more observers should include a statistic that takes into account the fact that observers will sometimes agree or disagree simply by chance. The kappa statistic (or kappa coefficient) is the most commonly used statistic for this purpose. A kappa of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement equivalent to chance. A limitation of kappa is that it is affected by the prevalence of the finding under observation. Methods to overcome this limitation have been described.

...read moreread less

7,641 citations

Journal Article•10.1111/J.1365-2664.2006.01214.X•

Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS)

[...]

Omri Allouche¹, Asaf Tsoar¹, Ronen Kadmon¹•Institutions (1)

Hebrew University of Jerusalem¹

01 Dec 2006-Journal of Applied Ecology

TL;DR: In this article, the authors provide a theoretical explanation for the observed dependence of kappa on prevalence, and introduce an alternative measure of accuracy, the true skill statistic (TSS), which corrects for this dependence while still keeping all the advantages of Kappa.

...read moreread less

Abstract: Summary 1In recent years the use of species distribution models by ecologists and conservation managers has increased considerably, along with an awareness of the need to provide accuracy assessment for predictions of such models. The kappa statistic is the most widely used measure for the performance of models generating presence–absence predictions, but several studies have criticized it for being inherently dependent on prevalence, and argued that this dependency introduces statistical artefacts to estimates of predictive accuracy. This criticism has been supported recently by computer simulations showing that kappa responds to the prevalence of the modelled species in a unimodal fashion. 2In this paper we provide a theoretical explanation for the observed dependence of kappa on prevalence, and introduce into ecology an alternative measure of accuracy, the true skill statistic (TSS), which corrects for this dependence while still keeping all the advantages of kappa. We also compare the responses of kappa and TSS to prevalence using empirical data, by modelling distribution patterns of 128 species of woody plant in Israel. 3The theoretical analysis shows that kappa responds in a unimodal fashion to variation in prevalence and that the level of prevalence that maximizes kappa depends on the ratio between sensitivity (the proportion of correctly predicted presences) and specificity (the proportion of correctly predicted absences). In contrast, TSS is independent of prevalence. 4When the two measures of accuracy were compared using empirical data, kappa showed a unimodal response to prevalence, in agreement with the theoretical analysis. TSS showed a decreasing linear response to prevalence, a result we interpret as reflecting true ecological phenomena rather than a statistical artefact. This interpretation is supported by the fact that a similar pattern was found for the area under the ROC curve, a measure known to be independent of prevalence. 5Synthesis and applications. Our results provide theoretical and empirical evidence that kappa, one of the most widely used measures of model performance in ecology, has serious limitations that make it unsuitable for such applications. The alternative we suggest, TSS, compensates for the shortcomings of kappa while keeping all of its advantages. We therefore recommend the TSS as a simple and intuitive measure for the performance of species distribution models when predictions are expressed as presence–absence maps.

...read moreread less

4,801 citations

Journal Article•10.1093/PTJ/85.3.257•

The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements

[...]

Julius Sim¹, Chris Wright²•Institutions (2)

Keele University¹, Coventry University²

01 Mar 2005-Physical Therapy

TL;DR: The issue of statistical testing of kappa is considered, including the use of confidence intervals, and appropriate sample sizes for reliability studies using kappa are tabulated.

...read moreread less

Abstract: Purpose. This article examines and illustrates the use and interpretation of the kappa statistic in musculoskeletal research. Summary of Key Points. The reliability of clinicians' ratings is an important consideration in areas such as diagnosis and the interpretation of examination findings. Often, these ratings lie on a nominal or an ordinal scale. For such data, the kappa coefficient is an appropriate measure of reliability. Kappa is defined, in both weighted and unweighted forms, and its use is illustrated with examples from musculoskeletal research. Factors that can influence the magnitude of kappa (prevalence, bias, and nonindependent ratings) are discussed, and ways of evaluating the magnitude of an obtained kappa are considered. The issue of statistical testing of kappa is considered, including the use of confidence intervals, and appropriate sample sizes for reliability studies using kappa are tabulated. Conclusions. The article concludes with recommendations for the use and interpretation of kappa.

...read moreread less

4,170 citations

...

Expand

Year	Papers
2025	270
2024	458
2023	657
2022	1,123
2021	50
2020	46

Topic Tools

Papers published on a yearly basis

Papers

Interrater reliability: the kappa statistic

Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.

Understanding interobserver agreement: the kappa statistic.

Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS)

The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements

Related Topics (5)

Performance Metrics