Article Text
Statistics from Altmetric.com
Message
Conventional white-light endoscopy has high interobserver variability for the diagnosis of gastric precancerous conditions. Here we present a deep-learning (DL) approach for the diagnosis of atrophic gastritis developed and trained using real-world endoscopic images from the proximal stomach. The model achieved an accuracy of 93% (area under the curve (AUC): 0.98; F-score 0.93) in an independent data set, outperforming expert endoscopists. DL may overcome conventional appraisal of white-light endoscopy and support human decision making. The algorithm is available free of charge via a web-based interface (https://www.ccb.uni-saarland.de/atrophy).
In more detail
Introduction
Chronic inflammation of the gastric mucosa induces a cascade of precancerous conditions (chronic atrophic gastritis, intestinal metaplasia) and lesions (dysplasia) that may result in the development of intestinal-type gastric cancer.1 Infection with Helicobacter pylori and autoimmune gastritis are the most relevant factors initiating these mechanisms. Conventional white-light endoscopy has moderate sensitivity and specificity, as well as a high interobserver variability, and is therefore not sufficient to reliably diagnose gastric atrophy or intestinal metaplasia.2 3 Thus, especially in Western countries, histology-based diagnosis of precancerous conditions using standardised biopsy protocols is favoured. Advanced endoscopic techniques (eg, virtual or conventional chromoendoscopy, magnification endoscopy, confocal laser endomicroscopy) are often hindered by technical availability and costs.
DL has demonstrated potential in medical imaging, including GI endoscopy.4 In this field, DL has been used to diagnose focal pathologies (in particular colorectal polyps and oesophageal adenocarcinoma), and only occasionally for diseases diffusely affecting the GI mucosa (eg, H. pylori-associated gastritis).4–7 Here, for the first time, we present a DL approach that overcomes the limitations of white-light endoscopy in diagnosing atrophic gastritis.
Patients and methods
For a first data set, we identified 200 real-world images from patients with and without histology-proven atrophic gastritis (100 each) from subjects undergoing routine oesophagogastroduodenoscopy between 2008 and 2018 (data set DS1). Endoscopies were performed with various generations of Olympus scopes (GIF-Q160, GIF-Q160Z, GIF-1TQ160, GIF-Q165, GIF-H180, GIF-H190; Olympus Europe, Hamburg, Germany). Images were unaltered white-light images anonymised and exported as Digital Imaging and Communications in Medicine (DICOMs). Non-standardised images (eg, various scope positions, distances, angles and illumination; bile, food and mucus contaminations) were taken from the non-overinflated proximal stomach (gastric corpus and fundus). All images were cropped, resized and normalised to have a set average and SD.
An independent second data set (data set DS2) of 70 images (30 with atrophy; 40 without) was used for independent testing and evaluation by six endoscopists with less than 1500 and more than 1500 esophagogastroduodenoscopy (EGDs). Since the two groups (three each) did not differ (p>0.05), their ratings were combined. Table 1 summarises the patient characteristics. Patients included in the study had no evidence of persisting H. pylori infection. Histopathological assessment of H&E-stained slices was carried out by seven board-certified academic pathologists, with at least two pathologists evaluating each specimen (non-blinded) using the updated Sydney system.8
With traditional machine learning, handcrafted features are fed to a model for classification. With DL these are computed incrementally by the model without expert intervention. Thus, there is no theoretical limit that prevents it from learning any feature representation. Convolutional neural networks (CNNs) are the gold standard for image analysis. CNNs take advantage of the local structural relationships in the image and create progressively more complex abstract representations from layer to layer. However, this requires a large amount of training data.
To overcome this limitation, we used a fine-tuned, pretrained CNN; that is, we used pretrained weights to initialise the network, thus improving the stability and performance of our model (figure 1). The training data were artificially augmented by image rotation, mirroring and scaling. First, we assessed the best architecture (pretrained models on ImageNet).9 We performed 10-fold stratified cross-validation on DS1. For each round, data were split into training, tuning and testing sets (80%/10%/10%). The test set was classified using the best performing hyperparameter combination, as assessed in the tuning set (early stop grid search to select dropout, learning rate, momentum). All images were used for testing once only. In a second stage, DS1 was used for training and tuning (90%/10% split), whereas DS2 was used for testing. Model architecture was chosen from the first-stage results. Hyperparameters were assessed in the tuning set. Online supplementary file 1 provides an indepth description of the methods.
Supplemental material
For all models and expert evaluations, accuracy, balanced accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and F-score were computed. In addition, NPV, PPV and accuracy were computed for prevalence rates between 20% and 50% in steps of 1%. Statistical differences between the expert evaluations and DL were assessed with Wilcoxon signed-rank test. Further, we computed the receiver operating characteristic (ROC) curves and the AUC.
Results
The best performing pretrained DL model for diagnosis of atrophic gastritis, as assessed by cross-validation, was VGG16.10 The algorithm yielded results for all images. Table 2 summarises the results.
Accuracy, balanced accuracy and F-score were significantly lower for the endoscopists when compared with the DL-based approach (p=0.03). There was no significant difference between the endoscopy experts and the model for the remaining performance metrics. Online supplementary figure 1A,B shows the ROC curves.
Supplemental material
Discussion
We present a DL approach capable of surpassing expert assessment for endoscopic diagnosis of atrophic gastritis. Despite the low number of images available, our model achieved a diagnostic accuracy of 93%, which was significantly better than the combined results of endoscopists working at a tertiary referral centre.
Endoscopic surveillance is generally advised for patients with extensive atrophy or intestinal metaplasia, but not in case of precancerous conditions restricted to the antrum.2 Thus, we decided to focus on the proximal stomach (gastric corpus and fundus). Histopathology was used as gold standard. This method is, especially in initial or patchy disease, prone to sampling error. Thus, a false-positive rate of 12.5% is acceptable, because false-negative results of histopathology (at least two biopsies in the proximal stomach) cannot be ruled out. Our algorithm cannot sharply discriminate simple atrophy from metaplastic atrophic gastritis, since most patients in both cohorts suffered from atrophic gastritis with intestinal metaplasia, which is the most reliable histological marker of atrophy.2
The strength of our approach is that we used real-world images for training, tuning and evaluation. Thus, our algorithm has the capability to work reliably under these conditions and is not dependent on high-quality, ideal images. Nevertheless, the generalisation of these results needs to be taken cautiously since the size of the training data set was limited. The prevalence of atrophic gastritis varies in different parts of the world,11 and affected patients are more likely to be present in endoscopy-based cohorts. Therefore, we extrapolated the performance metrics for the reported prevalence range from 20% to 50%.11 Notwithstanding that the algorithm performs adequately across these real-world prevalence rates, as shown in online supplementary figure 1C, further prospective evaluation in additional cohorts is inevitable before standard implementation.
To provide worldwide direct access for a broad group of users, we developed a web-based software tool where image files can be uploaded for analysis by the DL-based algorithm (available free of charge at https://www.ccb.uni-saarland.de/atrophy). Moreover, uploaded images from different settings may lead to more robust algorithms in the future, overcoming the limitations associated with one training data set.
In conclusion, DL can support human decision making in complex settings of GI endoscopy and is a promising tool for clinically relevant endoscopy applications.
Acknowledgments
The authors thank Thomas Adams, MD, Dr Bettina Friesenhahn-Ochs, MD, Dr Katharina Grotemeyer, MD, Dr Oliver Linn, MD, Dr Matthias Reichert, MD, and Simone Zimmermann, MD, for blinded evaluation of endoscopy images, and the team of Professor Dr Rainer Bohle, MD, for histopathological evaluation.
Footnotes
Contributors PG: programming of the deep-learning algorithm, image analysis, statistics, manuscript preparation. AK: revision and editing of the manuscript, supervision of the artificial intelligence part, idea for the study. TF: Programming of the web-based software tool. FL: revision and editing of the manuscript; supervision of the clinical part, idea for the study. MC: manuscript preparation, patient identification and coordination of image evaluation by endoscopists.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient consent for publication Not required.
Ethics approval The study was approved by the ethics committee of Ärztekammer des Saarlandes (Saarbrücken, Germany; #36/19).
Provenance and peer review Not commissioned; externally peer reviewed.