Grader Variability and the Importance of Reference Standards for Evaluating Machine Learning Models for Diabetic Retinopathy
Jonathan Krause,Varun Gulshan,Ehsan Rahimy,Peter Karth,Kasumi Widner,Greg S. Corrado,Lily Peng,Dale R. Webster +7 more
TL;DR: Adjudication reduces the errors in DR grading by using a small number of adjudicated consensus grades as a tuning dataset and higher-resolution images as input, and to train an improved automated algorithm for DR grading.
read more
About: This article is published in Ophthalmology. The article was published on 12 Mar 2018. and is currently open access.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Table 4. Comparison of ophthalmologist grades versus adjudicated grades from retina specialists on the validation dataset. Confusion matrix for diabetic retinopathy and DME between the grade determined by majority decision of the ophthalmologists and the adjudicated consensus of retinal specialists. 
Table 5. Agreement between ophthalmologists’ grades with the adjudicated reference standard on the validation dataset. Sensitivity and specificity metrics are for moderate or worse DR and referable DME for each grader. Agreement between the adjudicated grade and the 5-point scale is also measured by the quadratic-weighted kappa. 
Table 3. Agreement between each retina specialist and the adjudicated reference standard on the validation dataset. Retina specialists correspond to those who contributed to the final adjudicated reference standard. Sensitivity and specificity metrics reported are for moderate or worse DR. Agreement between the preadjudication 5-point DR grade and the final adjudicated grade is also measured by the quadratic-weighted kappa. 
Table 2. Comparison of retinal specialist grades before and after adjudication on the validation dataset. Confusion matrix for diabetic retinopathy between the grade determined by majority decision and adjudicated consensus. 
Fig. 1. Grader agreement based on the adjudicated consensus grade for referable diabetic retinopathy (DR) and diabetic macular edema (DME). Independent grading of all 3 retinal specialists and all 3 ophthalmologists are included in this analysis. 
Fig. 2. Image resolution input to model versus area under the curve (AUC) for mild and above DR. Left: Using majority decision of retinal specialists as the reference standard. Right: Using the adjudicated consensus grade of retinal specialists as a reference standard. Shaded areas represent a 95% confidence interval as measured via bootstrapping.
Citations
Use of multimodal dataset in AI for detecting glaucoma based on fundus photographs assessed with OCT: focus group study on high prevalence of myopia
Wee Shin Lim,Heng-Yen Ho,Heng-Chen Ho,Yan Wu Chen,Chih-Kuo Lee,Pao Ju Chen,Feipei Lai,Jyh-Shing Roger Jang,Mei-Lan Ko +8 more
TL;DR: In this paper , a decision support system for the automatic detection of glaucoma using fundus images, which can be applied for general screening, especially in areas of high incidence of myopia.
Artificial Intelligence: the unstoppable revolution in ophthalmology.
TL;DR: A review of the state of the art of AI in the field of ophthalmology, focusing on the strengths and weaknesses of current systems, and defining the vision that will enable us to advance scientifically in this digital era is presented in this article.
16
Are current clinical studies on artificial intelligence-based medical devices comprehensive enough to support a full health technology assessment? A systematic review
TL;DR: In this article , the authors conducted a systematic literature review based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses methodology to extract articles published between 2016 and 2021 related to the assessment of AI-based MDs.
16
Correction: UK Biobank retinal imaging grading: methodology, baseline characteristics and findings for common ocular diseases
Alasdair Warwick,Katie Curran,Barbra Hamill,Kelsey Stuart,Anthony P Khawaja,Andrew J. Lotery,Michael Quinn,Savita Madhusudhan,Konstantinos Balaskas,Tunde Peto,Tabassum. Aslam,Denize Atan,Sushanta Kumar Barman,Jenny H. Barrett,Paul Bishop,Greg Black,T. W. Braithwaite,Roxana O. Carare,Usha Chakravarthy,May Chan,Sharon Chua,Alexander Day,Parul Desai,Bal Dhillon,Amanda Dick,A. Doney,Catherine A Egan,Susan P. Ennis,Paul J. Foster,Marcus Fruttiger,James Gallacher,David Garway-Heath,Jerome Gibson,JA Guggenheim,C Hammond,Alison J. Hardcastle,S.P. Harding,Ruth Hogg,Pirro G. Hysi,P A Keane,PT Khaw,Afshan Khawaja,Gerassimos Lascaratos,TJ Littlejohns,A. Lotery,PJ Luthert,Thomas MacGillivray,Sarah Mackie,Bernadette McGuinness,Ginny McKay,Marcy Peck Mckibbin,T Moore,Jonathan H. Morgan,Richard J. Oram,E. O'Sullivan,Christopher G. Owen,Euan N Paterson,Andreas Petzold,Nikolas Pontikos,Jugnoo S Rahi,Aleksandra Rudnicka,Nauman Sattar,J. Self,Panos Sergouniotis,Sobha Sivaprasad,Duncan S. Steel,Ira W. Stratton,Nicholas G. Strouthidis,Cathie Sudlow,Ronke Renee Lattimore Tapp,Emanuele Trucco,Aqsa Tufail,Anath Viswanathan,Veronique Vitart,MJ Weedon,Kareem Williams,Citra Williams,J C Woodside,Max Yates,Jennifer L.Y. Yip +79 more
TL;DR: In this paper , the grading methods and baseline characteristics for UK Biobank (UKBB) participants who underwent retinal imaging in 2009-2010, and to characterise individuals with retinal features suggestive of age-related macular degeneration (AMD), glaucoma and retinopathy.
•Posted Content
Deep Learning vs. Human Graders for Classifying Severity Levels of Diabetic Retinopathy in a Real-World Nationwide Screening Program
Paisan Raumviboonsuk,Jonathan Krause,Peranut Chotcomwongse,Rory Sayres,Rajiv Raman,Kasumi Widner,Bilson J. L. Campana,Sonia Phene,Kornwipa Hemarat,Mongkol Tadarati,Sukhum Silpa-archa,Jirawut Limwattanayingyong,Chetan Rao,Oscar Kuruvilla,Jesse J. Jung,Jeffrey Tan,Surapong Orprayoon,Chawawat Kangwanwongpaisan,Ramase Sukulmalpaiboon,Chainarong Luengchaichawang,Jitumporn Fuangkaew,Pipat Kongsap,Lamyong Chualinpha,Sarawuth Saree,Srirat Kawinpanitan,Korntip Mitvongsa,Siriporn Lawanasakol,Chaiyasit Thepchatri,Lalita Wongpichedchai,Greg S. Corrado,Lily Peng,Dale R. Webster +31 more
TL;DR: Across different severity levels of DR for determining referable disease, deep learning significantly reduced the false negative rate at the cost of slightly higher false positive rates, suggesting that deep learning algorithms may serve as a valuable tool for DR screening.
15
References
Deep learning
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
67K
Gradient-based learning applied to document recognition
Yann LeCun,Léon Bottou,Léon Bottou,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio,Patrick Haffner +6 more
- 01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
53.5K
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,Aditya Khosla,Michael S. Bernstein,Alexander C. Berg,Li Fei-Fei +11 more
TL;DR: The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) as mentioned in this paper is a benchmark in object category classification and detection on hundreds of object categories and millions of images, which has been run annually from 2010 to present, attracting participation from more than fifty institutions.
A Coefficient of agreement for nominal Scales
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Gradient-based learning applied to document recognition
Yann LeCun,Léon Bottou,Léon Bottou,Yoshua Bengio,Yoshua Bengio,Yoshua Bengio,Patrick Haffner,Patrick Haffner +7 more
- 01 Jan 2001
TL;DR: This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task, and Convolutional neural networks are shown to outperform all other techniques.
32.7K
Related Papers (5)
Daniel Shu Wei Ting,Daniel Shu Wei Ting,Carol Y. Cheung,Carol Y. Cheung,Gilbert Lim,Gavin Tan,Gavin Tan,Nguyen Duc Quang,Alfred Tau Liang Gan,Haslina Hamzah,Renata Garcia-Franco,Ian Yew San Yeo,Ian Yew San Yeo,Shu Yen Lee,Shu Yen Lee,Edmund Yick Mun Wong,Edmund Yick Mun Wong,Charumathi Sabanayagam,Charumathi Sabanayagam,Mani Baskaran,Mani Baskaran,Farah Nur Ilyana Mohd Ibrahim,Ngiap Chuan Tan,Ngiap Chuan Tan,Eric A. Finkelstein,Ecosse L. Lamoureux,Ecosse L. Lamoureux,Ian Y. H. Wong,Neil M. Bressler,Sobha Sivaprasad,Rohit Varma,Jost B. Jonas,Mingguang He,Ching-Yu Cheng,Ching-Yu Cheng,Gemmy Cheung,Gemmy Cheung,Tin Aung,Tin Aung,Wynne Hsu,Mong Li Lee,Tien Yin Wong,Tien Yin Wong +42 more
Jeffrey De Fauw,Joseph R. Ledsam,Bernardino Romera-Paredes,Stanislav Nikolov,Nenad Tomasev,Sam Blackwell,Harry Askham,Xavier Glorot,Brendan O'Donoghue,Daniel Visentin,George van den Driessche,Balaji Lakshminarayanan,Clemens Meyer,Faith Mackinder,Simon Bouton,Kareem Ayoub,Reena Chopra,Dominic King,Alan Karthikesalingam,Cian Hughes,Rosalind Raine,Julian Hughes,Dawn A Sim,Catherine A Egan,Adnan Tufail,Hugh Montgomery,Demis Hassabis,Geraint Rees,Trevor Back,Peng T. Khaw,Mustafa Suleyman,Julien Cornebise,Pearse A. Keane,Olaf Ronneberger +33 more
[...]