The Messidor-2 dataset used for this comparison study consists of high quality retinal images,
9 which are not necessarily a good representation of data from screening programs, generally, and certainly not reflective of the quality of images that are seen in the non-eye care settings where screening algorithms have the potential to deliver their biggest impact.
24 The ability to detect an ungradeable image is an important component when assessing the capabilities of a device for automated detection of diabetic retinopathy in the real world. Because of the relatively high quality of the images in Messidor-2, only a small number (4%) would have had an insufficient image quality output if the protocol had been complete. Thus, while Messidor-2 is a dataset that is useful in measuring performance of an algorithm on high quality exams, or comparing it with other algorithms, as in the present study, it is not sufficient to establish an algorithm's performance in broader clinical use. In addition, Messidor-2 images contain a single image per eye, limiting the area of retina covered. Many screening programs,
9 and algorithms such as the device, are designed with two images per eye, one fovea centered and one disc centered, leading to a larger area of retina examined. Using two or more images per eye, algorithms as well as human experts may find additional cases of DR not visible on the single image,
8 leading to different measured performance. Similarly, in the real world, reference standards often differ, depending on the characteristics of the clinicians reading the images and how many are involved in reading and how consensus is reached. For example, the ME reference standard was graded from the retinal images, which lack stereo, and no optical coherence tomography (OCT) was available. This implies that isolated retinal thickening cannot be appreciated,
7 though human expert detection of ME from exudates only, in single images, may be almost as sensitive as clinical stereo biomicroscopic analysis of retinal thickening.
35 Thus, DR and ME prevalence and severity may be underestimated in this dataset, and a different reference standard could lead to differences in a device's measured algorithmic performance. Finally, we purposely chose the device's rDR and vtDR outputs to have set points that can be expected to result in high sensitivity, to be able to compare performance of the IDP with the device. This indeed resulted in a high sensitivity of 96.8% and specificity of 87%. In the real world, algorithms such as the device can potentially have different set points that allow a more equal balance between sensitivity and specificity, in accordance with the prevalence of vtDR in the population as well as medical and public health objectives for screening.