Assessing Generalizability of an AI-based Visual Test for Cervical Cancer Screening
Syed Rakin Ahmed, Didem Egemen, Brian Befano, Ana Cecilia Rodriguez, Jose Jeronimo, Kanan Desai, Carolina Teran, Karla Alfaro, Joel Fokom-Domgue, Kittipat Charoenkwan, Chemtai Mungo, Rebecca Luckett, Rakiya Saidu, Taina Raiol, Ana Ribeiro, Julia C. Gage, Silvia de Sanjose, Jayashree Kalpathy-Cramer, Mark Schiffman
Abstract
A number of challenges hinder artificial intelligence (AI) models from effective clinical translation. Foremost among these challenges is the lack of generalizability, which is defined as the ability of a model to perform well on datasets that have different characteristics from the training data. We recently investigated the development of an AI pipeline on digital images of the cervix, utilizing a multi-heterogeneous dataset of 9,462 women (17,013 images) and a multi-stage model selection and optimization approach, to generate a diagnostic classifier able to classify images of the cervix into “normal”, “indeterminate” and “precancer/cancer” (denoted as “precancer+”) categories. In this work, we investigate the performance of this multiclass classifier on external data not utilized in training and internal validation, to assess the generalizability of the classifier when moving to new settings.
Introduction
The development of artificial intelligence (AI) and deep learning (DL) approaches have become seemingly ubiquitous in recent years, across several clinical domains, with optimized models reporting near-clinician-level performance [1–4]. However, translation of AI models from bench to bedside remain sparse. To be clinically translatable, AI/DL models should be robust, computationally-efficient, low-cost, and blend well with existing clinical workflows, ensuring the inputs/outputs of the model and the task it performs are most relevant to the clinician for a given use case. This is often not the case with existing models, which are frequently hindered by several key methodological flaws in their design [5], thereby undermining their validity, and hindering clinical translation.
Materials & Methods
In this paper, we utilized a model that we developed in a prior study, following a multi-stage model selection and optimization process utilizing a multi-heterogeneous dataset, henceforth referred to as “SEED” [20]. The primary discernible axes of heterogeneity in this prior work included image capture device and geography. In the current study, we conducted a thorough external validation of our model by running the model on images collected from a new, external dataset, henceforth termed “EXT”.
Results
Our results highlight two critical findings in terms of model generalizability, which, we believe, hold relevance even outside of cervical imaging, as noted below:
- Device-level heterogeneity impacts model performance greater than geography level heterogeneity. Our model performs well out of the box (no retraining) on external datasets where the axis of heterogeneity is geography only vs. device, i.e., on images from a different geography but sharing a device that is represented in the training dataset. However, the repeatability of our model is unaffected by data heterogeneity and is strong throughout.
- Incremental retraining with inclusion of new device images to the training dataset progressively improves classification performance and class discrimination on images from a new device previously not incorporated in the training dataset, up to a point of saturation.
Discussion
The use of AI models as possible biomarkers continue to be hindered by key factors that affect their clinical translation. To be effective, any biomarker needs to: 1. generate reproducible test results; 2. acknowledge uncertainty, particularly when the underlying predictive task has pre-existing uncertainty (e.g., ASCUS in the Bethesda system); and 3. acknowledge the need for, or the lack of, generalizability to data heterogeneities. In this work, we address each of these properties in turn via first investigating the key axes of heterogeneities present in the underlying data, and subsequently demonstrating that the key design innovations of our multiclass AVE model are optimized for improved repeatability and classification performance and can translate well into new settings in order to facilitate clinical decision-making.
Citation: Ahmed SR, Egemen D, Befano B, Rodriguez AC, Jeronimo J, Desai K, et al. (2024) Assessing generalizability of an AI-based visual test for cervical cancer screening. PLOS Digit Health 3(10): e0000364. https://doi.org/10.1371/journal.pdig.0000364
Editor: J. Mark Ansermino, University of British Columbia, CANADA
Received: September 8, 2023; Accepted: July 16, 2024; Published: October 2, 2024
Copyright: This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: The materials used to train and generate results can be found at the following repository: https://github.com/QTIM-lab/cervix_generalizability.
Funding: This work was supported by the National Cancer Institute (NCI) of the National Institutes of Health (NIH). All NCI-affiliated staff are supported by the NCI Intramural Research Program including supplemental funding from the Cancer Cures Moonshot Initiative. Additionally, BB was supported by NCI/NIH under Grant T32CA09168. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.