SUMMARY:
In this review, concepts of algorithmic bias and fairness are defined qualitatively and mathematically. Illustrative examples are given of what can go wrong when unintended bias or unfairness in algorithmic development occurs. The importance of explainability, accountability, and transparency with respect to artificial intelligence algorithm development and clinical deployment is discussed. These are grounded in the concept of “primum no nocere” (first, do no harm). Steps to mitigate unfairness and bias in task definition, data collection, model definition, training, testing, deployment, and feedback are provided. Discussions on the implementation of fairness criteria that maximize benefit and minimize unfairness and harm to neuroradiology patients will be provided, including suggestions for neuroradiologists to consider as artificial intelligence algorithms gain acceptance into neuroradiology practice and become incorporated into routine clinical workflow.
ABBREVIATION:
- AI
- artificial intelligence
Artificial intelligence (AI) is beginning to transform the practice of radiology, from order entry through image acquisition and reconstruction, workflow management, diagnosis, and treatment decisions. AI will certainly change neuroradiology practice across routine workflow, education, and research. Neuroradiologists are understandably concerned about how AI will affect their subspecialty and how they can shape its development. Multiple published consensus statements advocate the need for radiologists to play a primary role in ensuring that AI software used for clinical care is fair to and unbiased against specific groups of patients.1 In this review, we focus on the need for developing and implementing fairness criteria and how to balance competing interests that minimize harm and maximize patient benefits when implementing AI solutions in neuroradiology. The responsibility for promoting health care equity rests with the entire neuroradiology community, from academic leaders to private practitioners. We all have a stake in establishing best practices as AI enters routine clinical practice.
Definitions
“Ethics,” in a strict dictionary definition, is a theory or system of values that governs the conduct of individuals and groups.2 Ethical physicians should endeavor to promote fairness and avoid bias in their personal treatment of patients and with respect to the health care system at large. A biased object yields 1 outcome more frequently than statistically expected, eg, a 2-headed coin. Similarly, a biased algorithm systematically produces outcomes that are not statistically expected. One proposed definition for algorithmic bias in health care systems is “when the application of an algorithm compounds existing inequities in socioeconomic status, race, ethnic background, religion, sex, disability, or sexual orientation to amplify them and adversely impact inequities in health systems.”3 This definition, while not ideal, is a request for developers and end users of AI algorithms in health care to be aware of the potential risk of poorly designed algorithms for not merely reflecting societal imbalances but also amplifying inequities.
“Fairness” can be defined as the absence of favoritism toward specific subgroups of populations.4 Individual fairness is the principle that any 2 individuals who are similar should be treated equally.5 In contrast, “group fairness,” ie, statistical or demographic parity, is the principle that the demographics of the group receiving positive or negative treatment are the same as the population as a whole.5 Considering harm caused by algorithmic bias, ie, allocational (denial of opportunities or resources6) or representational (reinforcement of negative stereotypes7) harm, may be more intuitive.
Algorithmic Bias
We can quantify bias (δ) for AI models with regard to bias features, z, as for which is a distance metric that measures the difference between and . The formula intuitively corresponds to the dictionary definition of “bias” of AI models by measuring how much the model outcomes deviate from expected . Bias features z can be explicit (eg, sex, race, age, and so forth) or implicit (eg, data set imbalance, model architectures, poorly chosen learning metrics).8
Algorithmic Fairness
Scientists and companies involved in designing and implementing AI solutions across various industries have recognized the importance of fairness and social responsibility in the software they create, embodied in the concept of fairness, accountability, transparency, and ethics in AI.9 For commercial algorithms, there are regulatory considerations. For example, the Federal Trade Commission is empowered to prohibit “unfair or deceptive acts or practices in or affecting commerce,” which include racially biased algorithms.10 A bill introduced in Congress (the Algorithmic Accountability Act) would go further by directing the Federal Trade Commission to require impact assessment around privacy, security, bias, and fairness from companies developing automated decision-making systems.11
Multiple ways to measure algorithmic fairness have been developed.12⇓⇓-15 Corbett-Davies and Goel14 proposed 3 definitions for algorithmic fairness: 1) anticlassification for which protected features (eg, sex, race) are explicitly excluded from the model, 2) classification parity for which model performance is equal across groups organized by protected features, and 3) calibration for which model outcomes are independent of protected attributes. However, the impossibility theorem shows that it is not possible to simultaneously equalize false-positive rates, false-negative rates, and positive predictive values across protected classes while maintaining calibration or anticlassification fairness.12 If only 1 fairness criterion can be achieved, clinical and ethical reasoning will be required to determine which one is appropriate.16
Techniques have been developed to explain poor fairness scores in AI algorithms. One approach applied the decomposition method of “additive features”17 to quantitative fairness metrics14,15 (eg, statistical parity).18 By means of simulation data for features which were purposefully manipulated to result in poor statistical parity, this method identified features that were most responsible for fairness disparities in the outputs of AI algorithms.
AI Algorithms: What Could Possibly Go Wrong?
Prominent examples from outside of medicine can be instructive in understanding how particular problems in AI processes, namely lack of representative data sets and inadequate validation, may lead to unfair outcomes with the potential for serious consequences. A sparsity of training data from geographically diverse sources can lead to both representational harm (through bias amplification)19 and allocational harm (from algorithms working less accurately).20,21 A study of facial-recognition programs reported that while all software correctly identified white males (<1% error rate), the failure rate for women of color ranged from 21% to 35%.22 A ProPublica23 investigation of an AI algorithm that assessed the risk of recidivism showed that white defendants who re-offended were incorrectly classified as low risk almost twice as often as black offenders. In contrast, black defendants who did not re-offend were almost twice as likely as white defendants to be misclassified as at high risk of violent recidivism. These AI algorithms were inadvertently used to perpetuate institutional racism.24 There are many theoretical reasons for the poor performance, with nonrepresentative training data being the most likely important factor.
Primum No Nocere
Embedded in the Hippocratic Oath for physicians is the concept of “primum no nocere” (first, do no harm), which applies to technological advances in medicine, including neuroradiology and AI implementation. AI models deployed in health care can lead to unintended unfair patient outcomes and can exacerbate underlying inequity. Not surprising, given massive interest in applying AI to medical imaging, examples of bias specific to neuroradiology are emerging. In a study that analyzed >80 articles that used AI on head CT examinations, >80% of data sets were found to be from single-center sources, which increases the susceptibility of the models to bias and increases model error rates.25 The prevalence of brain lesions in the training and testing data sets did not match real world prevalence, which will likely overinflate the performances of models.25 In a meta-analysis of AI articles on intracranial aneurysm detection, the authors concluded that most studies had a high risk of bias with poor generalizability, with only one-quarter of studies using an appropriate reference standard and only 6/43 studies using an external or hold-out test set.26 They found low-level evidence for using these AI algorithms and that none of the studies specifically tested for the possibility of bias in algorithm development.26 In a study that used AI models to detect both intracranial hemorrhage and large-vessel occlusion, the algorithm showed similar excellent performance in diverse populations regardless of scanning parameters and geographic distribution, suggesting that it is unbiased.27 This study did not use independent data sets to test that assertion formally.27
In the neuroradiology literature, there are currently few studies assessing how bias may affect AI algorithms developed for routine clinical use. In 1 study, training cohort bias in15O-Water PET CBF calculation was evaluated.28 The study showed that predictions in patients with cerebrovascular disease were poorer if only healthy controls were used for training models. However, predictions for healthy controls were unaffected if the models were trained only on patient data.28 Training with data including healthy controls and patients with cerebrovascular disease yielded the best performance.28 From these neuroradiology examples, incorporating diverse patient characteristics that reflect target patient populations in the training and validation sets may be a reasonable strategy for mitigating bias.
In health care, there are many potential sources of bias such as age, sex, ethnicity, cultural, geographic, environmental, and socioeconomic status along with additional confounders such as disease prevalence and comorbidities.1 It is easy to imagine that physical characteristics present in neuroradiology images could affect algorithm performance if not sufficiently represented in training sets. Inadequate sampling or matching disease prevalence could impact performance for different populations. Population-based studies could have inadequate inclusion of diverse data. In neuroradiology, additional sources of bias include heterogeneity of scanners, scanner parameters, acquisition protocols, and postprocessing algorithms.
Other ethical issues in AI use center on clinical deployment. Will the use of algorithms be equitable across hospital systems, or will only large, urban academic hospitals have access to state-of-the-art tools? Other considerations include whether the AI model will perform robustly across time. Medicine, health care practices, and devices are constantly evolving. Models need to be periodically validated on diverse populations and calibrated with data reflecting current clinical practices if they are expected to remain clinically relevant.29 In medicine, interesting case studies that defy common medical knowledge can improve our understanding of disease and lead to practice changes. One such example is that of a patient who defied the odds of a severe motor vehicle crash to achieve complete recovery.30 How to incorporate these outlier cases into AI algorithms is unclear. Overall, effective, fair, and ethical applications of AI to neuroradiology problems will require balancing competing demands across multiple domains (Online Supplemental Data).
Mitigating Bias and Unfairness
Sources of bias in medical AI have been previously described.16 In brief, there may be biases in the training data set construction, model training, clinician/patient interaction, and model deployment. It is incumbent on all stakeholders to do their part in mitigating bias and unfairness in the development, deployment, and use of AI models in neuroradiology.
Integration of Fairness, Accountability, Transparency, and Ethics Principles in the AI Cycle
Fairness, accountability, transparency, and ethics principles should be integrated1,31,32 into the AI development lifecycle (Fig 1, adapted from Cramer et al31 and the Online Supplemental Data). Diverse stakeholder involvement is critical for all stages. For task definition, one should clearly define the intended long-term effects of the task and model. One should define processes for discovering unintended biases at this stage. This outcome can be achieved by defining fairness requirements.
Data collection that is ethical and transparent and allows sufficient representation of protected groups should be ensured. One should check for biases in data sources. Many neuroradiologic AI applications require labeled data, eg, subarachnoid hemorrhage versus subdural hemorrhage. How and by whom are labels generated? Does it match the expected clinical deployment context? One should check for biases in how data are collected, which could lead to underrepresentation of underserved populations. Data collection should preserve privacy. For example, the collection of high-resolution images enables reconstruction of faces that can potentially be cross-linked to the patient's real identity through face-recognition software33 as demonstrated in a reconstruction of data from an anthropomorphic phantom (Fig 2).34 In a PET study for which CT and MR imaging data were collected for standard uptake value quantification, researchers showed that face-recognition software could match facial reconstructions from CT and MR imaging data to their actual face photographs with correct match rates ranging from 78% (CT) to 97%–98% (MR imaging), leading the researchers to advocate for the routine use of de-identification software.35 When one uses de-identification software, the rates of recognition plummet to 5% for CT and 8% for MR imaging, without impacting standard uptake value quantification.35 A recent report used a novel de-identification software that deliberately distorted the ears, nose, and eyes that prevented facial recognition from CT and MR images,36 which may be a viable solution for this privacy concern.
To address patient privacy concerns, many AI applications use synthetic data for training.37 These synthetic data sets are typically produced using generative algorithms38 and have the potential for promoting data-sharing (being unrestricted by regulatory agencies) and for the creation of diverse data sets.37 However, the use of synthetic data can lead to nonrealistic scenarios39 or inadvertently reinforce biases.40,41
For the model definition stage, model assumptions must be clearly defined, and potential biases, identified. Model architecture must be checked for introduction of biases and whether the cost function has unintended adverse effects.42
For the training stage, there are several online free resources to detect and mitigate bias43 based on statistical definitions of fairness. Fairlearn44 and AI Fairness 36045 provide tools to detect and mitigate unfairness. Machine learning–Fairness-Gym takes a slightly different approach using simulation to evaluate the long-term fairness effects of learning agents in a specified environment.46 The What-If Tool lets one visualize trained models to detect bias with minimal coding.47 In addition, embedding learning methods that can debias AI models may help mitigate unfairness.8 For example, Amini et al8 proposed incorporating learning latent space structures for reweighting data during training to produce a less biased classifier.
For the testing stage, one should ensure that testing data have not leaked into the training data, match the expected deployed clinical context, and sufficiently represent the expected patient population. Potential issues with data-distribution discrepancies48 can exacerbate unfairness. Variations among data sets can lead to biased learning of features from data sets collected from different sources (ie, domains) under different conditions.49 Comparing differences between the source domain (where training data were collected) and the target domain (the test data for which the AI model will be used) may help explain any biases that are found. Many advanced domain-matching algorithms have been introduced to improve AI fairness by reducing the domain differences for cross-site data sets.50,51
In the deployment stage, continued surveillance of performance in terms of fairness and accuracy is needed. One should determine whether detected errors are one-off or systemic problems. There is no consensus yet on who bears this responsibility. Is it the end-users (radiologists/clinicians), the health care system/hospitals, or the vendors who make and sell the product? How will the algorithms be provided to the medical community? Will they be available equitably to diverse communities? One should ideally be able to explain how the trained AI model makes its decisions and predictions.
For the feedback stage, use and misuse of the system in the real world should be monitored and corrected in a transparent fashion. Fairness metrics14,15 should be evaluated and then used to refine the model. Accountability for errors needs to be predefined.
Trust, Radiology, and AI: Guiding Principles
Neuroradiologists need to become educated and involved to ensure that AI is used appropriately in the diagnosis, management, and treatment of patients. For neuroradiologists to trust the use of AI in image interpretation, there needs to be greater transparency about the algorithm. Training data are foundationally critical to algorithm development, explaining why “good” data are so valuable. Therefore, trust-building for neuroradiology starts with the quality of data, its collection and management, its evaluation, the quality of its associated labels, and the protection of patient privacy. To many radiologists, the entire field of AI is opaque, where a “black box” takes images and spews out predictive analytics. For AI to gain widespread acceptance by patients and radiologists, everyone needs to comprehend how a particular trained AI model works.52
There are many unresolved issues around the development of AI in radiology. Large amounts of imaging data are needed, which are difficult to share among institutions because there is reticence to engage in data-sharing agreements when imaging data are financially valuable to industry. Additionally, there are data-use agreements and data-sharing agreements that stipulate noncommercial use. However, some might argue that excluding companies from developing products on the basis of de-identified, shared data is itself counterproductive and cannot be enforced in a meaningful way. Federated learning shows promise in disrupting this sharing-based landscape because it alleviates the need to share patient data by training models that gain knowledge from local data that are retained within the acquiring institution at all times.53,54 However, security concerns54 such as inferential attacks and “model poisoning” from corruption of the AI model and/or data from ≥1 site remain.55,56 Unfairness in federated learning can be exacerbated by the challenge of simultaneously maintaining accuracy and privacy;57 however, these potential limitations are being addressed.58 Informed consent, ownership of data, privacy, and protection of data are major topics that remain in flux without clear best practice guidelines.59
For AI algorithm development in academic medical centers, new concepts are necessary. Should we assume that patients who enter a major academic medical center automatically opt-in to allow their anonymized imaging data to be used for research, including AI? Do patients need to explicitly opt-out in writing? If patient data are used to develop AI algorithms, should patients be financially compensated? One viewpoint is that “clinical data should be treated as a form of public good, to be used for the benefit of future patients” once its use for clinical treatment has ended.60 These questions underscore the need to consider both the patient's and society's rights with respect to the use of such data.
The core principles of ethical conduct in patient research include beneficence (do only good), non-maleficence (do no harm), autonomy, and justice,61 which must also guide AI development in neuroradiology. In the AI era of neuroradiology, there may be conflicts that evolve around how much decision-making is retained by the neuroradiologist and how much is willingly ceded to an AI algorithm. Floridi and Cowls52 stated that the “autonomy of humans should be promoted and that the autonomy of machines should be restricted and made intrinsically reversible, should human autonomy need to be protected and re-established.” This statement is precisely the major problem that occurred when pilots were unable to override an automated, erroneous AI-driven navigation system to prevent nosedives, leading to plane crashes with significant loss of life.62 Justice is conceptually implicit throughout AI development in neuroradiology from the data chosen to train the model to its validation so that no harm or unfairness occurs to certain groups of patients.52
Some researchers have articulated the need for a new bioethical consideration specifically to address algorithm development of AI in neuroradiology. Explicability can include explainability (how does it work?) and accountability (who is responsible for how it works?).52,63,64 It is important that both patients and neuroradiologists understand how imaging tools such as AI algorithms are used to render decisions that impact their health and well-being, particularly around potentially life-saving decisions in which neuroradiology has a clear role. For example, a visual saliency map that delineates on images where the AI algorithm focused its attention to arrive at a prediction (ie, intracranial metastatic lesion on a brain MR imaging examination) would be useful to drive its acceptance by both clinicians and patients.65 Neuroradiologists need to think like patients and adopt patient-centered practices when AI is deployed. Neuroradiologists should establish a practice to address real or perceived grievances for any unintended harm attributable to AI use.52 Fear, ignorance, and misplaced anxiety around novel technology can derail the best of scientific intentions and advances, so we need to be prudent as we develop AI and encode bioethical principles into its development and deployment. Transparency can build trust,66 with both code and data sets made publicly available whenever possible. However, for AI applications involving medical images, one must also balance the need for open science with patient privacy.
Ideally, neuroradiologists should be able to explain in lay language how data are used to build an AI tool, how the AI algorithm rendered a particular prediction, what that prediction means to patient care, and how accurate and reliable those predictions are.64,65 This explanation will require education in AI from residency through fellowship and a process of life-long learning. The American Society of Neuroradiology (ASNR) convened an AI Task Force to make recommendations around education, training, and research in AI so that the ASNR maintains its primacy as a leader in this rapidly evolving field.
Suggestions for Neuroradiologists in AI
Academic neuroradiologists need to lead. It is our responsibility to establish the benchmarks for best practices in the clinical utility of AI in conjunction with our academic partners in imaging societies such as the American College of Radiology and the Radiological Society of North America, as well as federal stakeholders such as the National Institutes of Health, National Institute of Standards and Technology, the Advanced Research Projects Agency, and the Food and Drug Administration. Although guidelines have been published around the ethical implementation of AI code, more work is needed from all relevant stakeholders including neuroradiologists, clinicians, patients, institutions, and regulatory bodies so that consensus builds around best practices that include the new concepts of explainability and accountability while preserving patient privacy and protection against security breaches such as cyberattacks.1,52,61,65 Quality assurance and quality improvement processes will be needed to detect potential biases in algorithms used in clinical care. Additional processes are needed to redress any perceived grievances and to quantify how AI affects patient outcomes.67 In the Online Supplemental Data, across the AI development lifecycle, guidelines are listed in the form of essential questions that should be considered and asked around task definition, data collection, model definition, training and testing, and deployment and feedback, particularly when neuroradiologists are asked to evaluate clinical AI tools for their practices.
Summary
In a joint North American and European consortium white paper,1 the authors made a recommendation that AI in radiology should “promote any use that helps individuals such as patients and providers and should block the use of radiology data and AI algorithms for irresponsible financial gains.” Additionally, all AI algorithms must be informed by bioethical principles in which the benefits of AI outweigh the risks and minimize the potential for harm or bad outcomes and minimize the chances that AI will lead to greater health care inequity. Neuroradiologists need to participate fully in this transformative technology and set best practice standards for fair, ethical, and nonbiased deployment of AI in routine neuroimaging practice.
ACKNOWLEDGMENTS
We acknowledge Yilan Gu for assistance in literature research and Jacob Calkins for assistance with the phantom data.
Footnotes
Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.
Indicates open access to non-subscribers at www.ajnr.org
References
- Received January 30, 2023.
- Accepted after revision July 7, 2023.
- © 2023 by American Journal of Neuroradiology