Abstract
BACKGROUND AND PURPOSE: Recently, artificial intelligence tools have been deployed with increasing speed in educational and clinical settings. However, the use of artificial intelligence by trainees across different levels of experience has not been well-studied. This study investigates the impact of artificial intelligence assistance on the diagnostic accuracy for intracranial hemorrhage and large-vessel occlusion by medical students and resident trainees.
MATERIALS AND METHODS: This prospective study was conducted between March 2023 and October 2023. Medical students and resident trainees were asked to identify intracranial hemorrhage and large-vessel occlusion in 100 noncontrast head CTs and 100 head CTAs, respectively. One group received diagnostic aid simulating artificial intelligence for intracranial hemorrhage only (n = 26); the other, for large-vessel occlusion only (n = 28). Primary outcomes included accuracy, sensitivity, and specificity for intracranial hemorrhage/large-vessel occlusion detection without and with aid. Study interpretation time was a secondary outcome. Individual responses were pooled and analyzed with the t test; differences in continuous variables were assessed with ANOVA.
RESULTS: Forty-eight participants completed the study, generating 10,779 intracranial hemorrhage or large-vessel occlusion interpretations. With diagnostic aid, medical student accuracy improved 11.0 points (P < .001) and resident trainee accuracy showed no significant change. Intracranial hemorrhage interpretation time increased with diagnostic aid for both groups (P < .001), while large-vessel occlusion interpretation time decreased for medical students (P < .001). Despite worse performance in the detection of the smallest-versus-largest hemorrhages at baseline, medical students were not more likely to accept a true-positive artificial intelligence result for these more difficult tasks. Both groups were considerably less accurate when disagreeing with the artificial intelligence or when supplied with an incorrect artificial intelligence result.
CONCLUSIONS: This study demonstrated greater improvement in diagnostic accuracy with artificial intelligence for medical students compared with resident trainees. However, medical students were less likely than resident trainees to overrule incorrect artificial intelligence interpretations and were less accurate, even with diagnostic aid, than the artificial intelligence was by itself.
ABBREVIATIONS:
- AI
- artificial intelligence
- ICH
- intracranial hemorrhage
- LVO
- large-vessel occlusion
- MS
- medical students
- RT
- resident trainees
SUMMARY
PREVIOUS LITERATURE:
Prior work suggests that physicians with less experience in radiology benefit the most from AI-assistance, while a recent large-scale study found that experience-based factors do not reliably predict the impact of AI assistance. The factors influencing use and trust across different levels of interpreter expertise remain poorly understood.
KEY FINDINGS:
Diagnostic aid simulating AI demonstrated improvement in ICH and LVO detection for medical students, but not for resident trainees. Furthermore, MS were less likely than RT to overrule incorrect aid interpretations and were less accurate than the simulated AI alone.
KNOWLEDGE ADVANCEMENT:
AI may provide a greater benefit for nonexperts; however, a threshold level of experience may be necessary for the safe and effective use of deep learning tools.
Over the past several decades, the volume of medical imaging has dramatically increased within the US health care system.1,2 Drivers of high volume include increasing population size and age, growing emphasis on cross-sectional studies, and a lack of widespread adoption of evidence-based guidelines for imaging use.3 Although imaging is intended to improve medical decision-making, increased imaging volume demands increased throughput from radiologists, which increases the risk of diagnostic error; this outcome may have devastating consequences for patient care.4,5 Moreover, medical error is expensive, accounting for an estimated $17 billion to $29 billion in annual excess spending in the United States.6
More recently, there has been an exponential increase in the number of available artificial intelligence (AI) products, which represent 1 solution for managing high study volumes. Several studies have demonstrated that these tools enhance physician performance and may prevent burnout by reducing reading time and improving diagnostic accuracy.7⇓⇓⇓-11 AI is increasingly used to support clinical decision-making and to triage acute findings. In a recent randomized clinical trial of 443 participants across 4 comprehensive stroke centers, Martinez-Gutierrez et al12 showed significantly reduced time to endovascular thrombectomy for patients with large-vessel occlusion (LVO) using an LVO detection AI algorithm that automatically alerts clinicians and radiologists. Although machine learning has demonstrated impressive performance in detecting specific imaging abnormalities, current technology is limited to simple tasks, lacks clinical decision-making capabilities, and continues to require physician oversight.13
The increasing prevalence of AI in radiology raises questions about its role in medical education and resident training. As many as 40% of imaging studies from teaching institutions are cosigned by radiology trainees.14,15 Although several studies have reported improved trainee performance with deep learning tools, the factors influencing use and trust across different levels of interpreter expertise remain poorly understood.16
The purpose of this randomized, controlled trial was to investigate how having an AI result available at the time of interpretation influences accuracy and interpretation time across different levels of medical training and task complexity. We hypothesized that such diagnostic aid will increase accuracy and decrease interpretation time for all trainees but that the effect will be greater for less experienced readers. Similarly, we expected the benefit to be greater for tasks of greater complexity. The study also investigated whether the level of training influences how trainees deal with incorrect diagnostic aid. This article follows the CONSORT reporting guidelines (https://www.bmj.com/content/340/bmj.c869).
MATERIALS AND METHODS
Study Design
This prospective study was conducted at the University of California, Irvine, and approved by our institutional review board. After providing written informed consent, medical students (MS) and resident trainees (RT) were randomized to 1 of 2 groups: 1) intracranial hemorrhage (ICH) detection without diagnostic aid and LVO detection with diagnostic aid; or 2) ICH detection with diagnostic aid and LVO detection without diagnostic aid. The primary interpretation target of LVO detection was identification of occlusions in the M1 segment of the MCA. Randomization and intervention assignment were performed following a 1:1 allocation ratio. To limit the potential for study participants to assess the fixed accuracy of the provided diagnostic aid, we presented positive and negative cases in a random sequence, and false-positive/false-negative diagnostic aid responses were randomly distributed. All medical students attended a 60-minute lecture on the fundamentals of recognizing ICH and LVO on CT scans through neuroanatomy and case examples.
Participants
The medical student group consisted of first- and second-year medical students from the University of California. RT consisted of University of California, Irvine radiology residents in their third-to-fifth postgraduate years. Recruitment occurred between January 2023 and October 2023. Participants who did not complete both assigned tasks were excluded. Participants did not know the accuracy of the AI beforehand.
Viewer
Participants were tasked with completing 2 reading sessions: 100 noncontrast head CTs and 100 CTAs of the head. Both sets were balanced (50:50) between normal and abnormal findings (presence/absence of ICH or LVO). Diagnostic aid was shown to participants as a binary yes/no for the presence of ICH or LVO. Tasks were completed on participants’ devices using an established, research-grade viewing platform offering standard functionality such as Zoom (https://www.zoom.us/download) and adjustable window/level. Responses were collected in a separate browser window. Diagnostic aid was calibrated to have both a sensitivity and specificity of 80% to ensure a robust set of false-positive/false-negative aid responses.
Data Set
The data set used for this study included 200 total de-identified CT scans: 50 CTAs with LVO, 50 noncontrast head CTs with ICH, and 100 CTAs and noncontrast head CT scans with no pathology. The same scans were used for sessions in which participants had or did not have access to diagnostic aid. Two hundred patients were included in the data set.
Ground Truth Definition
Ground truth was established by an experienced neuroradiologist (D.S.C., with 12 years of experience).
Outcome Measures
The primary outcome measures included reader accuracy, sensitivity, and specificity without or with diagnostic aid. These were determined according to whether the participant’s answer (yes or no to the presence of ICH or LVO) agreed with the ground truth. The secondary outcome measure was interpretation time, which was calculated automatically for each case using Qualtrics survey software (https://www.qualtrics.com/strategy/research/survey-software/).
Subgroup Evaluation
Primary outcomes were evaluated in several subgroups: within tasks (ICH and LVO); according to whether the user agreed or disagreed with the diagnostic aid interpretation and according to whether the supplied aid interpretation was accurate. After being segmented with a previously validated algorithm, positive ICH cases were split into quintiles according to hemorrhage size, and primary outcome measures were assessed within these quintiles.
Statistical Analysis
On the basis of an anticipated diagnostic accuracy of 70% for MS and 75% for RT, a desired power of 80%, and expected higher enrollment rate for medical student participants, we estimated that a total of at least 4000 responses, or 20 participants, would be required in each arm. Statistical analyses were performed using Python Version 3.10 (Python Software Foundation; Pandas Version 2.1.0). De-identified user and response information was stored as raw data within Excel (Microsoft). Answers with response times of >4 SDs above the mean were discarded. Mean accuracy, sensitivity, and specificity were computed for each participant and used as data points. Mean accuracy, sensitivity, and specificity were computed for each group and compared using a t test. A t test was also used to compare median response times. ANOVA was used for comparison of accuracy across different hemorrhage sizes in the ICH task.
RESULTS
Participants
A total of 93 participants expressed interest in the study and began the consent process. Sixty-eight participants provided written informed consent and were enrolled. Ultimately, 48 participants completed the study (Fig 1). This group included 37 MS and 11 RT. The final medical student group included first- and second-year medical students recruited from University of California Irvine (UCI) and Riverside medical schools. The group of RT included 11 UCI radiology residents. Given that sample size, the minimum difference in accuracy for RT without and with diagnostic aid that could have been detected at 80% power was 8.5%.
Participant enrollment flow chart.
Primary Analysis
With diagnostic aid, the accuracy of MS improved 11.0 points (62.6% to 73.6%, P < .001, Fig 2), while the accuracy of RT showed no significant change. The sensitivity of MS improved from 48.0% to 68.6% with aid (P < .001, Table 1), while specificity was not significantly different. For RT, sensitivity improved from 74.0% to 86.0% (P = .025). Specificity was not significantly different.
A, Overall accuracy for MS and RT without and with diagnostic aid. B–D, Accuracy, sensitivity, and specificity changes within each task without and with diagnostic aid.
Accuracy, sensitivity, and specificity overall and for each individual task, without and with diagnostic aid
Task Analysis
Next, we assessed differences in the benefit of diagnostic aid across different tasks. For the ICH task, the accuracy of MS improved from 62.0% to 70.4% (P < .001, Table 1) with aid. On the LVO task, the MS accuracy of improved from 63.2% to 76.7% (P < .001).
For RT performing the ICH task, accuracy and sensitivity were not significantly changed with diagnostic aid. Specificity actually decreased from 92.0% to 81.6% (P = .041, Table 1). In the LVO task, RT accuracy, sensitivity, and specificity were not significantly changed.
Within the ICH task, we hypothesized that diagnostic aid would be more helpful in the detection of smaller hemorrhages. To assess this question, we segmented positive ICH cases and split hemorrhages into quintiles according to size. For MS, mean accuracies without aid were significantly different across hemorrhage sizes, ranging from 21.1% for the smallest hemorrhages to 75.8% for the largest hemorrhages (ANOVA P < .001; Fig 3). For all except the largest hemorrhages, accuracy improvement with aid was statistically significant (P < .05).
Overall accuracy for MS (A) and RT (B) without and with diagnostic aid across different volumes of intracranial hemorrhage.
For RT, mean accuracies without aid ranged from 33.3% for very small hemorrhages to 96.7% for the largest hemorrhages (ANOVA P < .001, Fig 3). The accuracy benefit conferred by diagnostic aid was not statistically significant within any quintile and did not vary significantly across hemorrhage sizes.
We further analyzed response times according to the truth value of the diagnostic aid response. For both MS and RT completing the ICH task, response times were significantly different, and the longest response times were observed in cases of false-positive diagnostic aid (29.3 seconds for MS, 40.4 seconds for RT, Table 2). For MS completing the LVO task, though differences in response time for different aid response types were statistically significant, actual median response times varied only by about 3 seconds. For RT completing the LVO task, differences in response time were not statistically significant (Table 2, Fig 4).
Median interpretation time by AI response type
Changes in median interpretation time for each task, without and with diagnostic aid.
Interpreter Disagreement with AI.
Both groups were less accurate when disagreeing with the AI. For MS completing the ICH task, when disagreeing compared with agreeing with the AI, accuracy dropped from 83.9% to 31.9% (P < .001, Fig 5A); sensitivity, from 77.1% to 18.7%; and specificity, from 89.1% to 59.1% (both P < .001). On the LVO task, when disagreeing with AI, accuracy dropped from 85.7% to 44.7%; sensitivity, from 86.0% to 46.0%; and specificity, from 84.3% to 44.6% (all P < .001). RT fared slightly better when disagreeing versus agreeing with the AI: On the ICH task, accuracy dropped from 91.4% to 59.3%; sensitivity, from 93.4% to 63.7%; and specificity, from 89.2% to 56.5% (all P < .05). On the LVO task, accuracy dropped from 92.0% to 68.8%; sensitivity, from 92.0% to 69.7%; and specificity, from 92.0% to 70.0% (all P < .05).
A, Accuracy for each task when agreeing versus disagreeing with the diagnostic aid. B, Accuracy for each task when the diagnostic aid interpretation was accurate versus inaccurate. All differences were statistically significant (P < . 05).
Effect of Incorrect AI Response on Interpreter Accuracy.
MS were considerably less accurate on both tasks when given incorrect diagnostic aid. For the ICH task, accuracy was 78.0% with a correct aid response and 40.0% with an incorrect response (P < .001, Fig 5B). For the LVO task, accuracy dropped from 85.2% to 42.9% (P < .001). For RT performing the ICH task, accuracy dropped from 87.5% to 67.0% with an incorrect aid response (P < .001). On the LVO task, accuracy dropped from 91.7% to 68.2% (P < .001).
DISCUSSION
Prior work suggests that physicians with less experience in radiology benefit the most from AI-assistance, while a recent large-scale study found that experience-based factors do not reliably predict the impact of AI assistance.17,18 Our study, using diagnostic aid to simulate AI, demonstrates a significant increase in accuracy with diagnostic aid for MS but no significant increase for RT. When ICH evaluations were stratified by hemorrhage size, both MS and RT were less accurate at baseline in detecting the smallest-versus-largest hemorrhages (MS, 21.1% versus 75.8%; RT, 33.3% versus 96.7%, both P < .001). However, the benefit conferred by diagnostic aid did not vary significantly across hemorrhage sizes. Diagnostic aid had no statistically significant effects on interpretation time. Both groups were significantly less accurate when disagreeing with the aid interpretation and when supplied with an incorrect aid response. MS but not RT were less accurate, even with diagnostic aid, than the simulated AI by itself.
For both MS and RT, essentially all the benefit of diagnostic aid came from increased sensitivity. This is concordant with prior studies that have demonstrated a greater improvement in sensitivity with AI assistance among radiologists.19⇓-21 It may be more difficult for AI assistance to increase specificity, because this would require users to abandon an initial positive read in favor of the true-negative AI response, which, by nature, could not be supported by a discrete finding in the scan. On the other hand, a user considering a true-positive AI result might, on second look, identify the finding that triggered the AI result and more readily change the initial response.
We expected that baseline performance would be lower for more complex tasks and that diagnostic aid would offer greater benefit in these situations. Given that the difficulty of ICH detection depends on the size of the ICH, we split positive cases into quintiles by hemorrhage volume. As expected, both MS and RT demonstrated worse baseline performance in detecting smaller hemorrhages. However, the benefit of diagnostic aid for MS was similar across the smallest 4 quintiles of hemorrhage. This finding means that for the smallest hemorrhages, MS were more likely to disregard a true-positive aid response. This possibility may have been due to anchoring bias, in which a participant remains fixed to an initial diagnostic interpretation despite being provided with new data suggesting an alternative; this issue has been shown to be a significant bias in radiology.22,23 One strategy for overcoming this bias might be to use diagnostic aids that explicitly identify the suspected abnormality. Currently, AI triage tools are prohibited by FDA regulations from annotating diagnostic images in any way, but annotation may be an important consideration in the implementation of future AI systems.
We expected that interpretation time would decrease across all tasks when a diagnostic aid was available (as seen in a recent prospective study24), but the actual effects were mixed and not statistically significant. For the ICH task, read times were greatest when the diagnostic aid response was a false-positive; this result may reflect time spent searching the entire brain volume for a finding that might have triggered the response. This would be less of an issue on the LVO task, which focused on a small anatomic area around the proximal MCAs. Again, it is likely that an AI system highlighting a suspected abnormality, in addition to providing a categoric result, would show a more robust decrease in read times across tasks.
A recent study by Yu et al18 demonstrated that contrary to what one might expect, less experienced board-certified radiologists did not benefit more from AI assistance than more experienced radiologists, and that overall, AI benefit was small. Gaube et al25 demonstrated a significant improvement in accuracy for nonexpert physicians (internal or emergency medicine) but not for radiologists with AI assistance. Our study demonstrated a significant benefit from diagnostic aid for MS, but not RT. Although the study designs and specific diagnostic tasks investigated are different, these results suggest that as the experience of the user increases, the relationship between AI assistance and accuracy becomes more complicated. However, in our study, despite clearly benefiting from diagnostic aid, MS were still less accurate, even with aid, than the simulated AI was by itself. This result may be a manifestation of the Dunning-Krueger effect, a cognitive bias in which subjects overestimate their ability to perform a task, despite having limited task-specific expertise.26
Our results have several implications for the clinical implementation of AI, particularly in an educational setting. Although AI assistance appears to be of greater benefit to trainees, given that MS with diagnostic aid were less accurate than the simulated AI itself, there may be a minimum threshold of competency required to use radiology AI tools safely and effectively. Our results further demonstrated that the primary benefit of diagnostic aid to MS and RT was to increase sensitivity, without decreasing specificity. If trainees are more likely to be influenced by a true-positive AI response than by a false-negative one, future AI algorithms might be most beneficial if calibrated to have high sensitivity, even at the expense of decreased specificity. This benefit would also accord with the perspective that, for example, when interpreting emergency department studies overnight, it is more costly for trainees to miss a real positive finding than to imagine one that is not actually there.
Our study had limitations. Different groups completed each task without or with diagnostic aid, and metrics to establish baseline proficiency were not available so that individual user competence might have affected differences in accuracy. Our group of RT was also relatively small, limiting the resolution of the study to detect differences in accuracy without and with diagnostic aid. The simulated diagnostic aid did not provide visual depictions of the suspected abnormalities, though we note that current FDA rules prohibit triage applications from marking up diagnostic images in any way. Additionally, the accuracy, specificity, and sensitivity of the simulated AI were fixed at 80%, which is low compared with currently available tools; however, our results may provide a baseline against which future studies can assess the impact of a more accurate AI. Finally, the study was not conducted during routine clinical practice using a standard PACS, possibly affecting the generalizability of the results. Future work is needed to study the integration of AI assistance into clinical workflow and to assess the effects of different baseline AI accuracies.
CONCLUSIONS
This study demonstrated improvement in ICH and LVO detection with simulated AI for MS, but not for RT, suggesting that AI may provide a greater benefit for nonexperts. However, MS were less likely than RT to overrule incorrect aid interpretations and, in fact, were less accurate than the simulated AI alone, suggesting that a threshold level of experience may be necessary for the safe and effective use of deep learning tools. To aid in optimal deployment of AI in the educational setting, future work should include additional participants from other institutions at different levels of experience as well as investigating different methods of reporting AI results.
Footnotes
This study was funded by the Kenneth T. and Eileen L. Norris Foundation.
Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.
References
- Received March 29, 2024.
- Accepted after revision June 13, 2024.
- © 2024 by American Journal of Neuroradiology











