Abstract
SUMMARY: Clinical adoption of an artificial intelligence–enabled imaging tool requires critical appraisal of its life cycle from development to implementation by using a systematic, standardized, and objective approach that can verify both its technical and clinical efficacy. Toward this concerted effort, the ASFNR/ASNR Artificial Intelligence Workshop Technology Working Group is proposing a hierarchal evaluation system based on the quality, type, and amount of scientific evidence that the artificial intelligence–enabled tool can demonstrate for each component of its life cycle. The current proposal is modeled after the levels of evidence in medicine, with the uppermost level of the hierarchy showing the strongest evidence for potential impact on patient care and health care outcomes. The intended goal of establishing an evidence-based evaluation system is to encourage transparency, foster an understanding of the creation of artificial intelligence tools and the artificial intelligence decision-making process, and to report the relevant data on the efficacy of artificial intelligence tools that are developed. The proposed system is an essential step in working toward a more formalized, clinically validated, and regulated framework for the safe and effective deployment of artificial intelligence imaging applications that will be used in clinical practice.
ABBREVIATIONS:
- AI
- artificial intelligence
- HIPPA
- Health Insurance Portability and Accountability Act
As artificial intelligence (AI) reimagines many facets of health care, radiology will be a leading force for developing and leveraging AI-based imaging technologies.1⇓-3 This past decade saw a dramatic rise in the number of commercially available AI products receiving US FDA approval for clinical use in imaging.4 As of October 2022, there are 521 FDA-authorized AI-enabled medical devices, of which 75.2% are for radiology use.5 Of these, neuroimaging applications comprise a large share, with estimates of up to 40% of products on the market.6 With the increasing availability of AI software, a systematic method of integrating these tools into a clinically validated and regulated framework is necessary for the safe and effective deployment of medical imaging AI applications in routine clinical patient care. Unlike AI in other industries, such as entertainment and advertising, which can afford to be tolerant of errors, errors in medicine can be fatal.
Adoption of an AI-enabled tool requires critical appraisal of its life cycle from development to implementation, with careful consideration of the existing scientific evidence supporting its clinical utility. However, standardized objective metrics to quantitate AI quality and clinical utility are currently lacking, limiting the fair and accurate evaluation and comparison of different AI-enabled tools, especially when multiple products exist for the same clinical task.7
These are not new issues as they also affect other medical imaging software products, but the number and diversity of AI-enabled tools suddenly now hitting the market makes it a timely moment to consider practical and unbiased ways of assessing such tools. Thus, the ASFNR/ASNR has created an AI workshop technology working group with the goal of providing a practical approach for evaluating the potential effectiveness of AI technology in clinical practice.
Toward this goal, here we introduce an evaluation system using hierarchal levels of evidence that reflect the rigor of scientific data (Figure). Demonstration of clinical efficacy and value, at the pinnacle of this evaluation system, is the most important factor for clinical adoption.
Different points in the imaging workflow can be augmented by AI-enabled tools, with a range of clinical applications including but not limited to administrative, operational, patient, and image centered tasks.8⇓-10 For the purposes of this white paper, the hierarchal levels of evidence system is most useful for imaging and patient-related AI applications. However, the main principles can be generalized to other applications.
Finally, the radiologist continues to be an instrumental gatekeeper of patient care quality and safety, particularly now as we enter the era of AI. As clinical domain experts, radiologists provide important oversight on the effective use of AI software in the clinical setting.11 To better position the radiologist in this role, this white paper presents a structured method of guidance on the critical appraisal of AI software using the levels of evidence system.
Levels of Evidence
To date, there are no agreed upon levels of evidence needed for the evaluation of AI-enabled tools; thus, the already established medicine model provides a practical starting point for the development of such a systematic process.12 We propose a hierarchy of levels of evidence reflecting the critical elements of an AI product’s life cycle from development to the clinical implementation phase (Figure).
The two levels at the base of the hierarchy, levels 6 and 7, are considered fundamental requirements that an AI product must meet before further consideration for implementation in the clinical workflow. For example, an AI product must comply with current legal and regulatory requirements (level 7) such as Health Insurance Portability and Accountability Act (HIPPA) and FDA clearance. Thereafter, it must be compatible with the information technology infrastructure (level 6) at the site where it will be deployed, before proceeding with other requirements listed in the hierarchy.
Further description of the levels of evidence from 1 to 7 is detailed below, with level 1 denoting the highest quality and strongest evidence for potential impact on patient care and health care outcomes. In addition, Table 1 provides an abbreviated summary, while Table 2 provides an expanded summary of each component of the evaluation system.
Data Quality and AI Model Development
AI models should be developed from data that are large, diverse, and reflective of the intended population. However, in practice, access to comprehensive and “big” data is challenging, and training is often performed on limited data.13 This introduces bias that can affect reproducibility, generalizability, and performance outside the data range on which the model was trained. Thus, peer-reviewed publications including information on the source and characteristics of the data used to train, validate, and test the AI model can help end-users determine overall compatibility with the target patient population of interest.14⇓-16
AI companies and developers do not typically publicly report detailed information on data used to develop or validate algorithms, despite having undergone the necessary FDA clearance process, limiting the ability of end-users to make informed decisions about these products. Thus, the emphasis on more than 1 peer-reviewed publication in this white paper encourages some level of independent, critical, and structured analysis to provide scientific evidence for verifying the intended use and clinical impact of the AI product.
At the very least, even if a product does not meet this level of evidence expectation, it is most responsible for a company to provide information about their patient population, including demographic characteristics, model development and validation methods, and indicators of statistical efficacy. Purchasers and end-users should expect and require statistical evidence and, preferably, consider these levels of evidence as indicators for the strength of a tool’s methodological quality of design, validity, and applicability to patient care.
Barriers to improving AI transparency include competing financial incentives among developers, data privacy and sharing restrictions, and some degree of acceptance of the “black box” nature of AI-based solutions. To overcome these limitations, initiatives have been proposed to establish minimum data reporting standards for AI in health care including but not limited to MINIMAR (MINimum Information for Medical AI Reporting), CONSORT-AI (Consolidated Standard of Reporting Trials-Artificial Intelligence), and CLAIM (Checklist for Artificial Intelligence in Medical Imaging).17⇓-19 Others have also introduced checklists, recommendations, and guidelines toward assessing the suitability of AI-based tools in the health care environment.11,20⇓-22 Our proposal utilizing the levels of evidence builds on these ongoing initiatives, with a greater focus on the availability of peer-reviewed evidence and publications, to improve confidence and trust for all stakeholders using AI-based tools.
Selection of a quality standard of reference during the development phase is critical for an accurate and fair comparison of the AI model’s performance against the current standard of practice.23,24 After all, the adoption of any clinical tool relies on scientific evidence that it imparts some advantage over an already existing approach to the problem. Using subpar proxies for the intended clinical task may overestimate the actual performance of the AI model in the clinical setting. For example, assessment of an AI-enabled tool for the detection of intracranial hemorrhage might utilize turnaround time in outpatients with unexpected bleeds as a metric rather than reporting the overall accuracy of the tool.15,25
To evaluate potential real-world clinical efficacy and generalizability, it is important to gauge an AI tool’s performance on an external data set. Selection bias and reliance on retrospective data can lead to an AI model that too closely aligns with the original data and lacks the ability to generalize to new and unseen data. A recent study of deep learning algorithms for image-based radiologic diagnosis suggests that most will demonstrate diminished algorithm performance on the external data set, with some reporting a substantial performance decrease.25
External validation is increasingly recognized as a critical step for evaluating model performance but has been employed in relatively few published studies.26 The latter may be attributed to the challenges of obtaining an appropriate external data set. However, nonetheless, it remains important to use an external testing data set, separate from the original data used to develop the model, to calculate final performance metrics.15,25 This criterion is used to differentiate level 5A and level 5B. Potential sources of external data includes information from a different institution or public data bases. Further rigorous external verification of performance, generalization, and reproducibility can be tested through a multi-institution approach.
To provide appropriate oversight on how AI decisions will impact patients, radiologists must encourage AI vendors to explain steps in the AI product’s life cycle, in a manner that would allow for greater understandability and interpretability of its results. Of particular interest are details of the steps taken to reduce bias and ensure quality during the development process.27 Detecting and mitigating bias in a machine learning model can be one of the most effort-intensive steps in the development process, as bias may be introduced at any point in the product’s life cycle. Various approaches to reducing bias include emphasis on data transparency, mathematical approaches to de-biasing, interpretability/explainability of the decision-making process, and postdeployment surveillance strategies.28
Technical Efficacy versus Clinical Efficacy
There is a need to verify both the technical and clinical efficacy of any AI-enabled tool before clinical implementation.29,30 Interestingly, a study in 2020 found that fewer than 40% of commercially available AI products had published, peer-reviewed evidence available demonstrating their efficacy.4 Receiving FDA clearance for clinical use in radiology in no way guarantees clinical utility or clinical efficacy of the product.
Technical efficacy is defined by the ability of the AI model to correctly perform the task for which it was trained (level 4).31 Scientific evidence that supports technical efficacy is often in the form of retrospective studies and includes peer-reviewed information about the AI model’s data quality, development, and performance metrics, benchmarked against similar or alternatively accepted methods in the literature. For example, an automated brain tumor segmentation task may require initial published results on the Dice coefficient or Jaccard index score to demonstrate technical efficacy. Subsequently, it would be important to provide scientific evidence that performance is reproducible and generalizable across different clinical institutions, patient populations, MR imaging field strengths, and imaging vendors.25
Clinical efficacy is defined by the ability of the AI model to change patient care and health care outcomes (level 1). Therefore, this requires a higher level of evidence, often in the form of prospective and randomized clinical trials to prove that the AI-enabled tool can lead to results that are better than standard level of care. It is important to note that technical efficacy does not equate to clinical efficacy.29⇓⇓-32 For example, performance metrics such as reproducibility, sensitivity, specificity, positive and negative predictive values, and area under the curve are able to summarize AI model performance well but provide little information on how it could change patient outcome. Thus, despite impressive and exciting AI research, we continue to see relatively slow adoption of this technology to the health care setting. This is partly attributed to the paucity of scientific evidence supporting clinical efficacy.33
Bias and Error Mitigation
AI clinical errors often reflect the interplay of different types of biases introduced by the imperfect process of collecting, training, and applying data (level 2).16,34,35 Additionally AI-enabled tools can project societal and historical biases that may further exacerbate existing inequities related to sex, age, and socioeconomic differences, among others. Thus, it is important to have a systematic approach for monitoring performance variances in different patient populations.36,37 Other mechanisms that can be used to mitigate errors include ensuring data quality, as described above; verifying generalizability and reproducibility across different clinical sites (level 3); and careful consideration of epidemiological and statistical factors, such as disease prevalence, that can impact AI performance on a specific population.25,31 A major goal of this white paper is to emphasize the importance of peer-reviewed publications, including robust internal and external validation during model development and subsequent validation at other sites. Differing feature distribution among clinical sites and patient populations such as sex, ethnicity, age, socioeconomic condition, geographic distribution, disease risk factors, imaging equipment, and image quality can lead to unexpected model performance errors.
Health care is a fluid and dynamic landscape, with new and evolving clinical practice standards that will require routine re-evaluation of the performance of the AI-enabled tool. This is further compounded by the yet to be defined process of how AI models continuously learn and evolve over time with new data. Thus, defining a practical mechanism for postdeployment monitoring including incorporating an iterative feedback loop between the radiologist, AI-enabled tool, and AI company during the implementation phase will be critical for adapting to these changes and achieving long-term consistent effectiveness.11,29,30,32
Legal and Regulatory Frameworks
Policies pertaining to patient consent, data collection, and data usage will vary on a state, local, and institutional level. However, AI companies and health care systems should have standard operating procedures to maintain HIPPA compliance, patient data safety, confidentiality, and privacy (level 7).36,38,39
AI-enabled tools can be subjected to different regulatory requirements, depending on the proposed clinical setting and intended use. For example, for medically oriented AI-based tools, the FDA has 3 levels of clearance: the 510(k), premarket approval, and de novo pathways, each with its own specific criteria, which have been thoroughly explained elsewhere.40
Additionally, many other innovative and experimental AI research tools are being developed in-house under institutional internal review board approval outside the purview of government oversight.
Of the AI-enabled tools that have gone through FDA review, most have received FDA 510(k) clearance, which does not require safety or effectiveness data from clinical trials. Instead, the manufacturer can demonstrate that it is substantively equivalent to a predicate (another FDA-cleared or approved product). Thus, the emphasis on AI-enabled tools having more than 1 peer-reviewed publication is necessary in this white paper to encourage an independent, critical, and structured analysis of the AI-product. In contrast, substantially fewer products have gone through the FDA’s more rigorous premarket approval or, alternatively, the de novo pathway, which is designed for AI-enabled medical devices that are not deemed high risk but do not have a predicate.
Currently, any major changes to an AI-enabled tool will require resubmission for FDA approval; thus, most AI algorithms may remain “static” or “locked” after they are introduced into the market. However, periodic surveillance and refinement of AI algorithms may be needed to adapt to the evolving health care environment,41 without going through the full FDA review process again. This has prompted the FDA to consider more efficient and streamlined regulatory pathways to evaluate continuously learning AI through proposals such as the digital health precertification program and predetermined change control plan, which are currently under discussions. Unfortunately, as of now, no official process exists for major amendments to an existing AI algorithm.
The proposed hierarchy levels of evidence can be used to support an AI product’s life cycle in both the static and continuously learning environment. For continuous learning AI, there is mobility between the levels of the hierarchy. As an example, once an AI-enabled tool has established its baseline technical and clinical efficacy, modifications to the AI algorithm requiring FDA approval may allow it to move between level 7 and any other upper levels by providing additional scientific data, since the other levels have been supported by scientific evidence during its development phase.
Interoperability and Integration into the IT Infrastructure
AI software should integrate seamlessly into the hospital information system, radiology information system, and PACS to be clinically and functionally useful.30,32 A recent white paper on AI interoperability in imaging has explored the problems and challenges that must be addressed to achieve an ecosystem of interoperable AI products.42 Until such harmonized standards are adopted, AI companies will need to provide a clear plan with defined interoperability standards for integration into the existing digital infrastructure (level 6).43 The AI vendor should also be able to provide an on-site demonstration of the clinical tool in action in real time before full deployment. This will be an important opportunity to observe the AI model’s performance on the target population, impact on workflow, and potential errors in clinical practice.
Added Clinical Value
It can take decades for health care innovations to become fully implemented into clinical practice.44 Thus, the full clinical impact of AI on the health care system is likely to still mature and may not be completely apparent at the present time. Although challenging, defining and measuring the added value of an early technology remains the single most important factor for achieving clinical success and adoption.2 No current consensus exists on how to measure the added value of an AI-enabled tool in clinical practice. However, one approach is to consider the tool’s potential to improve patient outcomes compared with the cost of achieving that improvement in a value-based health care system:45⇓-47 Value = Patient Outcome/Cost. As emphasized previously, AI performance accuracy alone does not necessarily lead to improved patient outcomes; future prospective investigations, clinical trials, or meta-analyses (level 1 evidence) are needed to establish such a link. Similarly, AI-enabled tools may reduce cost to the patient and health care system by guiding clinical decision-making through a much more evidenced-based approach (ie, early detection of cerebral ischemia); however, more long-term investigations are still needed to understand the cost-benefit ratio. Randomized clinical trials are considered the gold standard for determining an intervention’s impact on clinical care. Several recent failures to implement AI-based tools in the clinical setting have suggested their relevance for selecting AI products with meaningful clinical benefit, especially given some inherent opacity and incomplete understanding of the mechanistic basis for how AI models actually make predictions.48,49 Toward establishing scientific evidence for clinical efficacy, several AI-enabled tools have successfully demonstrated a positive impact on patient-centered related outcomes in clinical trials (level 1 evidence).50 The proposed hierarchy levels of evidence can be used to support an AI product’s potential effectiveness and added value in the context of its available scientific data.
User Cases
To understand how the levels of evidence can be utilized, the following user cases derive from selected real-world applications of AI-enabled tools in the literature. Employing the levels of evidence can facilitate communication and understanding among stakeholders regarding the strength of peer-reviewed evidence available to support that tool’s reported goal and potential clinical impact.
Level 1 Evidence.
Strong scientific evidence exists for the positive clinical impact of AI-based tools used to guide clinical decision-making in stroke care.51 Specifically, AI-based ischemic stroke triage and management have been shown to decrease patient morbidity and mortality while improving patient functionality through multiple practice-defining clinical trials.52,53 There is also emerging evidence that these tools have the potential to reduce overall health care costs.54
Level 3 Evidence.
AI-based tools can be used to augment aneurysm detection and analysis. In this example, the AI-based tool has at least 2 retrospective peer-reviewed publications inclusive of 2 or more different institutions.55,56 However, there are currently no prospective data to assess the clinical impact of such a tool.
Level 5B Evidence.
An AI-based tool designed to segment brain tumors with 1 retrospective study describing model development and performance without use of an external data set.
In summary, the levels of evidence are an important component of evidence-based medicine, and the adoption of such a classification system can help end-users prioritize information on the quality of AI products. Most importantly, AI-enabled tools exist on a spectrum with regard to their scientific rigor, with some products lacking peer-reviewed publications altogether to those that have been well-validated through multiple randomized clinical trials. The level of evidence that an AI-enabled tool will need, of course, will depend on its intended task, as illustrated above. As with all classification systems, level 1 evidence does not necessarily mean that these data should be accepted as fact while level 5B data should be disregarded. Our goal is to introduce a method of scientific scrutiny to address the disconnect between expectations and reality.
CONCLUSIONS
Barriers to the clinical implementation of AI-enabled tools include factors related to the lack of understandability of the AI development and decision-making process, standardized criteria for comparing product quality and effectiveness, and rigorous scientific evidence supporting meaningful impact on patient care and health care outcomes. To overcome some of these challenges, the ASFNR/ASNR AI Workshop Technology Working Group has proposed hierarchal levels of evidence to objectively evaluate the scientific merit and potential effectiveness of AI technologies in clinical practice.
Footnotes
Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.
References
- Received December 16, 2022.
- Accepted after revision March 16, 2023.
- © 2023 by American Journal of Neuroradiology