Skip to main content
Advertisement

Main menu

  • Home
  • Content
    • Current Issue
    • Accepted Manuscripts
    • Article Preview
    • Past Issue Archive
    • Video Articles
    • AJNR Case Collection
    • Case of the Week Archive
    • Case of the Month Archive
    • Classic Case Archive
  • Special Collections
    • Quality, Safety, and Value Special Collection
    • ASFNR Stroke Perfusion Special Collection
    • ASNR Foundation Special Collection
    • Advancing NeuroMRI with High-Relaxivity Contrast Agents
    • Low-Field MRI
    • Alzheimer Disease
    • View All
  • Multimedia
    • AJNR Podcasts
    • AJNR SCANtastic
    • Trainee Corner
    • MRI Safety Corner
    • Imaging Protocols
  • For Authors
    • Submit a Manuscript
    • Submit a Video Article
    • Submit an eLetter to the Editor/Response
    • Manuscript Submission Guidelines
    • Author Policies
    • Peer-Review Policy
    • Transparency in Authorship Policy
    • Conflict-of-Interest Policy
    • Statistical Tips
    • Fast Publishing of Accepted Manuscripts
    • Graphical Abstract Preparation
    • Imaging Protocol Submission
  • About Us
    • About AJNR
    • Editorial Board
    • Editorial Board Alumni
  • More
    • Become a Reviewer/Academy of Reviewers
    • Subscribers
    • Permissions
    • Alerts
    • Feedback
    • Advertisers
    • ASNR Home

User menu

  • Alerts
  • Log in

Search

  • Advanced search
American Journal of Neuroradiology
American Journal of Neuroradiology

American Journal of Neuroradiology

ASHNR American Society of Functional Neuroradiology ASHNR American Society of Pediatric Neuroradiology ASSR
  • Alerts
  • Log in

Advanced Search

  • Home
  • Content
    • Current Issue
    • Accepted Manuscripts
    • Article Preview
    • Past Issue Archive
    • Video Articles
    • AJNR Case Collection
    • Case of the Week Archive
    • Case of the Month Archive
    • Classic Case Archive
  • Special Collections
    • Quality, Safety, and Value Special Collection
    • ASFNR Stroke Perfusion Special Collection
    • ASNR Foundation Special Collection
    • Advancing NeuroMRI with High-Relaxivity Contrast Agents
    • Low-Field MRI
    • Alzheimer Disease
    • View All
  • Multimedia
    • AJNR Podcasts
    • AJNR SCANtastic
    • Trainee Corner
    • MRI Safety Corner
    • Imaging Protocols
  • For Authors
    • Submit a Manuscript
    • Submit a Video Article
    • Submit an eLetter to the Editor/Response
    • Manuscript Submission Guidelines
    • Author Policies
    • Peer-Review Policy
    • Transparency in Authorship Policy
    • Conflict-of-Interest Policy
    • Statistical Tips
    • Fast Publishing of Accepted Manuscripts
    • Graphical Abstract Preparation
    • Imaging Protocol Submission
  • About Us
    • About AJNR
    • Editorial Board
    • Editorial Board Alumni
  • More
    • Become a Reviewer/Academy of Reviewers
    • Subscribers
    • Permissions
    • Alerts
    • Feedback
    • Advertisers
    • ASNR Home
  • Follow AJNR on Twitter
  • Visit AJNR on Facebook
  • Follow AJNR on Instagram
  • Join AJNR on LinkedIn
  • RSS Feeds

Check out our Case Collections, Special Collections, and more. Read the latest AJNR updates


Improved Turnaround Times | Median time to first decision: 12 days

Research ArticleARTIFICIAL INTELLIGENCE

Artificial Intelligence Efficacy as a Function of Trainee Interpreter Proficiency: Lessons from a Randomized Controlled Trial

David A. Fussell, Cynthia C. Tang, Jake Sternhagen, Varun V. Marrey, Kelsey M. Roman, Jeremy Johnson, Michael J. Head, Hayden R. Troutt, Charles H. Li, Peter D. Chang, John Joseph and Daniel S. Chow
American Journal of Neuroradiology September 2024, DOI: https://doi.org/10.3174/ajnr.A8387
David A. Fussell
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for David A. Fussell
Cynthia C. Tang
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jake Sternhagen
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jake Sternhagen
Varun V. Marrey
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Kelsey M. Roman
bSchool of Medicine (K.M.R., M.J.H.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Jeremy Johnson
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jeremy Johnson
Michael J. Head
bSchool of Medicine (K.M.R., M.J.H.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Hayden R. Troutt
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Charles H. Li
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Peter D. Chang
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Peter D. Chang
John Joseph
cPaul Merage School of Business (J.J.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Daniel S. Chow
aFrom the Department of Radiological Sciences (D.A.F., C.C.T., J.S., V.V.M., J.J., H.R.T., C.H.L., P.D.C., D.S.C.), University of California, Irvine, Irvine, California
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Daniel S. Chow
  • Article
  • Figures & Data
  • Supplemental
  • Info & Metrics
  • Responses
  • References
  • PDF
Loading

Abstract

BACKGROUND AND PURPOSE: Recently, artificial intelligence tools have been deployed with increasing speed in educational and clinical settings. However, the use of artificial intelligence by trainees across different levels of experience has not been well-studied. This study investigates the impact of artificial intelligence assistance on the diagnostic accuracy for intracranial hemorrhage and large-vessel occlusion by medical students and resident trainees.

MATERIALS AND METHODS: This prospective study was conducted between March 2023 and October 2023. Medical students and resident trainees were asked to identify intracranial hemorrhage and large-vessel occlusion in 100 noncontrast head CTs and 100 head CTAs, respectively. One group received diagnostic aid simulating artificial intelligence for intracranial hemorrhage only (n = 26); the other, for large-vessel occlusion only (n = 28). Primary outcomes included accuracy, sensitivity, and specificity for intracranial hemorrhage/large-vessel occlusion detection without and with aid. Study interpretation time was a secondary outcome. Individual responses were pooled and analyzed with the t test; differences in continuous variables were assessed with ANOVA.

RESULTS: Forty-eight participants completed the study, generating 10,779 intracranial hemorrhage or large-vessel occlusion interpretations. With diagnostic aid, medical student accuracy improved 11.0 points (P < .001) and resident trainee accuracy showed no significant change. Intracranial hemorrhage interpretation time increased with diagnostic aid for both groups (P < .001), while large-vessel occlusion interpretation time decreased for medical students (P < .001). Despite worse performance in the detection of the smallest-versus-largest hemorrhages at baseline, medical students were not more likely to accept a true-positive artificial intelligence result for these more difficult tasks. Both groups were considerably less accurate when disagreeing with the artificial intelligence or when supplied with an incorrect artificial intelligence result.

CONCLUSIONS: This study demonstrated greater improvement in diagnostic accuracy with artificial intelligence for medical students compared with resident trainees. However, medical students were less likely than resident trainees to overrule incorrect artificial intelligence interpretations and were less accurate, even with diagnostic aid, than the artificial intelligence was by itself.

ABBREVIATIONS:

AI
artificial intelligence
ICH
intracranial hemorrhage
LVO
large-vessel occlusion
MS
medical students
RT
resident trainees

SUMMARY

PREVIOUS LITERATURE:

Prior work suggests that physicians with less experience in radiology benefit the most from AI-assistance, while a recent large-scale study found that experience-based factors do not reliably predict the impact of AI assistance. The factors influencing use and trust across different levels of interpreter expertise remain poorly understood.

KEY FINDINGS:

Diagnostic aid simulating AI demonstrated improvement in ICH and LVO detection for medical students, but not for resident trainees. Furthermore, MS were less likely than RT to overrule incorrect aid interpretations and were less accurate than the simulated AI alone.

KNOWLEDGE ADVANCEMENT:

AI may provide a greater benefit for nonexperts; however, a threshold level of experience may be necessary for the safe and effective use of deep learning tools.

Over the past several decades, the volume of medical imaging has dramatically increased within the US health care system.1,2 Drivers of high volume include increasing population size and age, growing emphasis on cross-sectional studies, and a lack of widespread adoption of evidence-based guidelines for imaging use.3 Although imaging is intended to improve medical decision-making, increased imaging volume demands increased throughput from radiologists, which increases the risk of diagnostic error; this outcome may have devastating consequences for patient care.4,5 Moreover, medical error is expensive, accounting for an estimated $17 billion to $29 billion in annual excess spending in the United States.6

More recently, there has been an exponential increase in the number of available artificial intelligence (AI) products, which represent 1 solution for managing high study volumes. Several studies have demonstrated that these tools enhance physician performance and may prevent burnout by reducing reading time and improving diagnostic accuracy.7⇓⇓⇓-11 AI is increasingly used to support clinical decision-making and to triage acute findings. In a recent randomized clinical trial of 443 participants across 4 comprehensive stroke centers, Martinez-Gutierrez et al12 showed significantly reduced time to endovascular thrombectomy for patients with large-vessel occlusion (LVO) using an LVO detection AI algorithm that automatically alerts clinicians and radiologists. Although machine learning has demonstrated impressive performance in detecting specific imaging abnormalities, current technology is limited to simple tasks, lacks clinical decision-making capabilities, and continues to require physician oversight.13

The increasing prevalence of AI in radiology raises questions about its role in medical education and resident training. As many as 40% of imaging studies from teaching institutions are cosigned by radiology trainees.14,15 Although several studies have reported improved trainee performance with deep learning tools, the factors influencing use and trust across different levels of interpreter expertise remain poorly understood.16

The purpose of this randomized, controlled trial was to investigate how having an AI result available at the time of interpretation influences accuracy and interpretation time across different levels of medical training and task complexity. We hypothesized that such diagnostic aid will increase accuracy and decrease interpretation time for all trainees but that the effect will be greater for less experienced readers. Similarly, we expected the benefit to be greater for tasks of greater complexity. The study also investigated whether the level of training influences how trainees deal with incorrect diagnostic aid. This article follows the CONSORT reporting guidelines (https://www.bmj.com/content/340/bmj.c869).

MATERIALS AND METHODS

Study Design

This prospective study was conducted at the University of California, Irvine, and approved by our institutional review board. After providing written informed consent, medical students (MS) and resident trainees (RT) were randomized to 1 of 2 groups: 1) intracranial hemorrhage (ICH) detection without diagnostic aid and LVO detection with diagnostic aid; or 2) ICH detection with diagnostic aid and LVO detection without diagnostic aid. The primary interpretation target of LVO detection was identification of occlusions in the M1 segment of the MCA. Randomization and intervention assignment were performed following a 1:1 allocation ratio. To limit the potential for study participants to assess the fixed accuracy of the provided diagnostic aid, we presented positive and negative cases in a random sequence, and false-positive/false-negative diagnostic aid responses were randomly distributed. All medical students attended a 60-minute lecture on the fundamentals of recognizing ICH and LVO on CT scans through neuroanatomy and case examples.

Participants

The medical student group consisted of first- and second-year medical students from the University of California. RT consisted of University of California, Irvine radiology residents in their third-to-fifth postgraduate years. Recruitment occurred between January 2023 and October 2023. Participants who did not complete both assigned tasks were excluded. Participants did not know the accuracy of the AI beforehand.

Viewer

Participants were tasked with completing 2 reading sessions: 100 noncontrast head CTs and 100 CTAs of the head. Both sets were balanced (50:50) between normal and abnormal findings (presence/absence of ICH or LVO). Diagnostic aid was shown to participants as a binary yes/no for the presence of ICH or LVO. Tasks were completed on participants’ devices using an established, research-grade viewing platform offering standard functionality such as Zoom (https://www.zoom.us/download) and adjustable window/level. Responses were collected in a separate browser window. Diagnostic aid was calibrated to have both a sensitivity and specificity of 80% to ensure a robust set of false-positive/false-negative aid responses.

Data Set

The data set used for this study included 200 total de-identified CT scans: 50 CTAs with LVO, 50 noncontrast head CTs with ICH, and 100 CTAs and noncontrast head CT scans with no pathology. The same scans were used for sessions in which participants had or did not have access to diagnostic aid. Two hundred patients were included in the data set.

Ground Truth Definition

Ground truth was established by an experienced neuroradiologist (D.S.C., with 12 years of experience).

Outcome Measures

The primary outcome measures included reader accuracy, sensitivity, and specificity without or with diagnostic aid. These were determined according to whether the participant’s answer (yes or no to the presence of ICH or LVO) agreed with the ground truth. The secondary outcome measure was interpretation time, which was calculated automatically for each case using Qualtrics survey software (https://www.qualtrics.com/strategy/research/survey-software/).

Subgroup Evaluation

Primary outcomes were evaluated in several subgroups: within tasks (ICH and LVO); according to whether the user agreed or disagreed with the diagnostic aid interpretation and according to whether the supplied aid interpretation was accurate. After being segmented with a previously validated algorithm, positive ICH cases were split into quintiles according to hemorrhage size, and primary outcome measures were assessed within these quintiles.

Statistical Analysis

On the basis of an anticipated diagnostic accuracy of 70% for MS and 75% for RT, a desired power of 80%, and expected higher enrollment rate for medical student participants, we estimated that a total of at least 4000 responses, or 20 participants, would be required in each arm. Statistical analyses were performed using Python Version 3.10 (Python Software Foundation; Pandas Version 2.1.0). De-identified user and response information was stored as raw data within Excel (Microsoft). Answers with response times of >4 SDs above the mean were discarded. Mean accuracy, sensitivity, and specificity were computed for each participant and used as data points. Mean accuracy, sensitivity, and specificity were computed for each group and compared using a t test. A t test was also used to compare median response times. ANOVA was used for comparison of accuracy across different hemorrhage sizes in the ICH task.

RESULTS

Participants

A total of 93 participants expressed interest in the study and began the consent process. Sixty-eight participants provided written informed consent and were enrolled. Ultimately, 48 participants completed the study (Fig 1). This group included 37 MS and 11 RT. The final medical student group included first- and second-year medical students recruited from University of California Irvine (UCI) and Riverside medical schools. The group of RT included 11 UCI radiology residents. Given that sample size, the minimum difference in accuracy for RT without and with diagnostic aid that could have been detected at 80% power was 8.5%.

FIG 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIG 1.

Participant enrollment flow chart.

Primary Analysis

With diagnostic aid, the accuracy of MS improved 11.0 points (62.6% to 73.6%, P < .001, Fig 2), while the accuracy of RT showed no significant change. The sensitivity of MS improved from 48.0% to 68.6% with aid (P < .001, Table 1), while specificity was not significantly different. For RT, sensitivity improved from 74.0% to 86.0% (P = .025). Specificity was not significantly different.

FIG 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIG 2.

A, Overall accuracy for MS and RT without and with diagnostic aid. B–D, Accuracy, sensitivity, and specificity changes within each task without and with diagnostic aid.

View this table:
  • View inline
  • View popup
Table 1:

Accuracy, sensitivity, and specificity overall and for each individual task, without and with diagnostic aid

Task Analysis

Next, we assessed differences in the benefit of diagnostic aid across different tasks. For the ICH task, the accuracy of MS improved from 62.0% to 70.4% (P < .001, Table 1) with aid. On the LVO task, the MS accuracy of improved from 63.2% to 76.7% (P < .001).

For RT performing the ICH task, accuracy and sensitivity were not significantly changed with diagnostic aid. Specificity actually decreased from 92.0% to 81.6% (P = .041, Table 1). In the LVO task, RT accuracy, sensitivity, and specificity were not significantly changed.

Within the ICH task, we hypothesized that diagnostic aid would be more helpful in the detection of smaller hemorrhages. To assess this question, we segmented positive ICH cases and split hemorrhages into quintiles according to size. For MS, mean accuracies without aid were significantly different across hemorrhage sizes, ranging from 21.1% for the smallest hemorrhages to 75.8% for the largest hemorrhages (ANOVA P < .001; Fig 3). For all except the largest hemorrhages, accuracy improvement with aid was statistically significant (P < .05).

FIG 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIG 3.

Overall accuracy for MS (A) and RT (B) without and with diagnostic aid across different volumes of intracranial hemorrhage.

For RT, mean accuracies without aid ranged from 33.3% for very small hemorrhages to 96.7% for the largest hemorrhages (ANOVA P < .001, Fig 3). The accuracy benefit conferred by diagnostic aid was not statistically significant within any quintile and did not vary significantly across hemorrhage sizes.

We further analyzed response times according to the truth value of the diagnostic aid response. For both MS and RT completing the ICH task, response times were significantly different, and the longest response times were observed in cases of false-positive diagnostic aid (29.3 seconds for MS, 40.4 seconds for RT, Table 2). For MS completing the LVO task, though differences in response time for different aid response types were statistically significant, actual median response times varied only by about 3 seconds. For RT completing the LVO task, differences in response time were not statistically significant (Table 2, Fig 4).

View this table:
  • View inline
  • View popup
Table 2:

Median interpretation time by AI response type

FIG 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIG 4.

Changes in median interpretation time for each task, without and with diagnostic aid.

Interpreter Disagreement with AI.

Both groups were less accurate when disagreeing with the AI. For MS completing the ICH task, when disagreeing compared with agreeing with the AI, accuracy dropped from 83.9% to 31.9% (P < .001, Fig 5A); sensitivity, from 77.1% to 18.7%; and specificity, from 89.1% to 59.1% (both P < .001). On the LVO task, when disagreeing with AI, accuracy dropped from 85.7% to 44.7%; sensitivity, from 86.0% to 46.0%; and specificity, from 84.3% to 44.6% (all P < .001). RT fared slightly better when disagreeing versus agreeing with the AI: On the ICH task, accuracy dropped from 91.4% to 59.3%; sensitivity, from 93.4% to 63.7%; and specificity, from 89.2% to 56.5% (all P < .05). On the LVO task, accuracy dropped from 92.0% to 68.8%; sensitivity, from 92.0% to 69.7%; and specificity, from 92.0% to 70.0% (all P < .05).

FIG 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
FIG 5.

A, Accuracy for each task when agreeing versus disagreeing with the diagnostic aid. B, Accuracy for each task when the diagnostic aid interpretation was accurate versus inaccurate. All differences were statistically significant (P < . 05).

Effect of Incorrect AI Response on Interpreter Accuracy.

MS were considerably less accurate on both tasks when given incorrect diagnostic aid. For the ICH task, accuracy was 78.0% with a correct aid response and 40.0% with an incorrect response (P < .001, Fig 5B). For the LVO task, accuracy dropped from 85.2% to 42.9% (P < .001). For RT performing the ICH task, accuracy dropped from 87.5% to 67.0% with an incorrect aid response (P < .001). On the LVO task, accuracy dropped from 91.7% to 68.2% (P < .001).

DISCUSSION

Prior work suggests that physicians with less experience in radiology benefit the most from AI-assistance, while a recent large-scale study found that experience-based factors do not reliably predict the impact of AI assistance.17,18 Our study, using diagnostic aid to simulate AI, demonstrates a significant increase in accuracy with diagnostic aid for MS but no significant increase for RT. When ICH evaluations were stratified by hemorrhage size, both MS and RT were less accurate at baseline in detecting the smallest-versus-largest hemorrhages (MS, 21.1% versus 75.8%; RT, 33.3% versus 96.7%, both P < .001). However, the benefit conferred by diagnostic aid did not vary significantly across hemorrhage sizes. Diagnostic aid had no statistically significant effects on interpretation time. Both groups were significantly less accurate when disagreeing with the aid interpretation and when supplied with an incorrect aid response. MS but not RT were less accurate, even with diagnostic aid, than the simulated AI by itself.

For both MS and RT, essentially all the benefit of diagnostic aid came from increased sensitivity. This is concordant with prior studies that have demonstrated a greater improvement in sensitivity with AI assistance among radiologists.19⇓-21 It may be more difficult for AI assistance to increase specificity, because this would require users to abandon an initial positive read in favor of the true-negative AI response, which, by nature, could not be supported by a discrete finding in the scan. On the other hand, a user considering a true-positive AI result might, on second look, identify the finding that triggered the AI result and more readily change the initial response.

We expected that baseline performance would be lower for more complex tasks and that diagnostic aid would offer greater benefit in these situations. Given that the difficulty of ICH detection depends on the size of the ICH, we split positive cases into quintiles by hemorrhage volume. As expected, both MS and RT demonstrated worse baseline performance in detecting smaller hemorrhages. However, the benefit of diagnostic aid for MS was similar across the smallest 4 quintiles of hemorrhage. This finding means that for the smallest hemorrhages, MS were more likely to disregard a true-positive aid response. This possibility may have been due to anchoring bias, in which a participant remains fixed to an initial diagnostic interpretation despite being provided with new data suggesting an alternative; this issue has been shown to be a significant bias in radiology.22,23 One strategy for overcoming this bias might be to use diagnostic aids that explicitly identify the suspected abnormality. Currently, AI triage tools are prohibited by FDA regulations from annotating diagnostic images in any way, but annotation may be an important consideration in the implementation of future AI systems.

We expected that interpretation time would decrease across all tasks when a diagnostic aid was available (as seen in a recent prospective study24), but the actual effects were mixed and not statistically significant. For the ICH task, read times were greatest when the diagnostic aid response was a false-positive; this result may reflect time spent searching the entire brain volume for a finding that might have triggered the response. This would be less of an issue on the LVO task, which focused on a small anatomic area around the proximal MCAs. Again, it is likely that an AI system highlighting a suspected abnormality, in addition to providing a categoric result, would show a more robust decrease in read times across tasks.

A recent study by Yu et al18 demonstrated that contrary to what one might expect, less experienced board-certified radiologists did not benefit more from AI assistance than more experienced radiologists, and that overall, AI benefit was small. Gaube et al25 demonstrated a significant improvement in accuracy for nonexpert physicians (internal or emergency medicine) but not for radiologists with AI assistance. Our study demonstrated a significant benefit from diagnostic aid for MS, but not RT. Although the study designs and specific diagnostic tasks investigated are different, these results suggest that as the experience of the user increases, the relationship between AI assistance and accuracy becomes more complicated. However, in our study, despite clearly benefiting from diagnostic aid, MS were still less accurate, even with aid, than the simulated AI was by itself. This result may be a manifestation of the Dunning-Krueger effect, a cognitive bias in which subjects overestimate their ability to perform a task, despite having limited task-specific expertise.26

Our results have several implications for the clinical implementation of AI, particularly in an educational setting. Although AI assistance appears to be of greater benefit to trainees, given that MS with diagnostic aid were less accurate than the simulated AI itself, there may be a minimum threshold of competency required to use radiology AI tools safely and effectively. Our results further demonstrated that the primary benefit of diagnostic aid to MS and RT was to increase sensitivity, without decreasing specificity. If trainees are more likely to be influenced by a true-positive AI response than by a false-negative one, future AI algorithms might be most beneficial if calibrated to have high sensitivity, even at the expense of decreased specificity. This benefit would also accord with the perspective that, for example, when interpreting emergency department studies overnight, it is more costly for trainees to miss a real positive finding than to imagine one that is not actually there.

Our study had limitations. Different groups completed each task without or with diagnostic aid, and metrics to establish baseline proficiency were not available so that individual user competence might have affected differences in accuracy. Our group of RT was also relatively small, limiting the resolution of the study to detect differences in accuracy without and with diagnostic aid. The simulated diagnostic aid did not provide visual depictions of the suspected abnormalities, though we note that current FDA rules prohibit triage applications from marking up diagnostic images in any way. Additionally, the accuracy, specificity, and sensitivity of the simulated AI were fixed at 80%, which is low compared with currently available tools; however, our results may provide a baseline against which future studies can assess the impact of a more accurate AI. Finally, the study was not conducted during routine clinical practice using a standard PACS, possibly affecting the generalizability of the results. Future work is needed to study the integration of AI assistance into clinical workflow and to assess the effects of different baseline AI accuracies.

CONCLUSIONS

This study demonstrated improvement in ICH and LVO detection with simulated AI for MS, but not for RT, suggesting that AI may provide a greater benefit for nonexperts. However, MS were less likely than RT to overrule incorrect aid interpretations and, in fact, were less accurate than the simulated AI alone, suggesting that a threshold level of experience may be necessary for the safe and effective use of deep learning tools. To aid in optimal deployment of AI in the educational setting, future work should include additional participants from other institutions at different levels of experience as well as investigating different methods of reporting AI results.

Footnotes

  • This study was funded by the Kenneth T. and Eileen L. Norris Foundation.

  • Disclosure forms provided by the authors are available with the full text and PDF of this article at www.ajnr.org.

References

  1. 1.↵
    1. Chartrand G,
    2. Cheng PM,
    3. Vorontsov E, et al
    . Deep learning: a primer for radiologists. Radiographics 2017;37:2113–31 doi:10.1148/rg.2017170077 pmid:29131760
    CrossRefPubMed
  2. 2.↵
    1. Heit JJ,
    2. Iv M,
    3. Wintermark M
    . Imaging of intracranial hemorrhage. J Stroke 2017;19:11–27 doi:10.5853/jos.2016.00563 pmid:28030895
    CrossRefPubMed
  3. 3.↵
    1. Fasen BA,
    2. Heijboer RJ,
    3. Hulsmans FJ, et al
    . CT angiography in evaluating large-vessel occlusion in acute anterior circulation ischemic stroke: factors associated with diagnostic error in clinical practice. AJNR Am J Neuroradiol 2020;41:607–11 doi:10.3174/ajnr.A6469 pmid:32165362
    Abstract/FREE Full Text
  4. 4.↵
    1. Matsoukas S,
    2. Scaggiante J,
    3. Schuldt BR, et al
    . Accuracy of artificial intelligence for the detection of intracranial hemorrhage and chronic cerebral microbleeds: a systematic review and pooled analysis. Radiol Med 2022;127:1106–23 doi:10.1007/s11547-022-01530-4 pmid:35962888
    CrossRefPubMed
  5. 5.↵
    1. Rava RA,
    2. Seymour SE,
    3. LaQue ME, et al
    . Assessment of an artificial intelligence algorithm for detection of intracranial hemorrhage. World Neurosurg 2021;150:e209–17 doi:10.1016/j.wneu.2021.02.134 pmid:33684578
    CrossRefPubMed
  6. 6.↵
    1. Petry M,
    2. Lansky C,
    3. Chodakiewitz Y, et al
    . Decreased hospital length of stay for ICH and PE after adoption of an artificial intelligence-augmented radiological worklist triage system. Radiol Res Pract 2022;2022:2141839 doi:10.1155/2022/2141839 pmid:36034496
    CrossRefPubMed
  7. 7.↵
    1. Pinto Dos Santos D,
    2. Giese D,
    3. Brodehl S, et al
    . Medical students’ attitude towards artificial intelligence: a multicentre survey. Eur Radiol 2019;29:1640–46 doi:10.1007/s00330-018-5601-1 pmid:29980928
    CrossRefPubMed
  8. 8.↵
    1. Yang L,
    2. Ene IC,
    3. Arabi Belaghi R, et al
    . Stakeholders’ perspectives on the future of artificial intelligence in radiology: a scoping review. Eur Radiol 2022;32:1477–95 doi:10.1007/s00330-021-08214-z pmid:34545445
    CrossRefPubMed
  9. 9.↵
    1. Juravle G,
    2. Boudouraki A,
    3. Terziyska M, et al
    . Trust in artificial intelligence for medical diagnoses. Prog Brain Res 2020;253:263–82 doi:10.1016/bs.pbr.2020.06.006 pmid:32771128
    CrossRefPubMed
  10. 10.↵
    1. Wagner AR,
    2. Borenstein J,
    3. Howard A
    . Overtrust in the robotic age. Commun ACM 2018;61:22–24 doi:10.1145/3241365
    CrossRef
  11. 11.↵
    1. Borracci RA,
    2. Arribalzaga EB
    . The incidence of overconfidence and underconfidence effects in medical student examinations. J Surg Educ 2018;75:1223–29 doi:10.1016/j.jsurg.2018.01.015 pmid:29397355
    CrossRefPubMed
  12. 12.↵
    1. Martinez-Gutierrez JC,
    2. Kim Y,
    3. Salazar-Marioni S, et al
    . Automated large vessel occlusion detection software and thrombectomy treatment times: a cluster randomized clinical trial. JAMA Neurol 2023;80:1182–90 doi:10.1001/jamaneurol.2023.3206 pmid:37721738
    CrossRefPubMed
  13. 13.↵
    1. Skitka LJ,
    2. Mosier KL,
    3. Burdick M, et al
    . Automation bias and errors: are crews better than individuals? Int J Aviat Psychol 2000;10:85–97 doi:10.1207/S15327108IJAP1001_5 pmid:11543300
    CrossRefPubMed
  14. 14.↵
    1. Itoh M
    . Toward overtrust-free advanced driver assistance systems. Cogn Tech Work 2012;14:51–60 doi:10.1007/s10111-011-0195-2
    CrossRef
  15. 15.↵
    1. Kapoor N,
    2. Gaviola G,
    3. Wang A, et al
    . Quantifying and characterizing trainee participation in a major academic radiology department. Curr Probl Diagn Radiology 2019;48:436–40 doi:10.1067/j.cpradiol.2018.07.004 pmid:30144966
    CrossRefPubMed
  16. 16.↵
    1. Arthur W Jr.,
    2. Bennett W Jr.,
    3. Stanush PL, et al
    . Factors that influence skill decay and retention: a quantitative review and analysis. Human Performance 1998;11:57–101 doi:10.1207/s15327043hup1101_3
    CrossRef
  17. 17.↵
    1. Li D,
    2. Pehrson LM,
    3. Lauridsen CA, et al
    . The added effect of artificial intelligence on physicians’ performance in detecting thoracic pathologies on CT and chest X-ray: a systematic review. Diagnostics (Basel) 2021;11:2206 doi:10.3390/diagnostics11122206 pmid:34943442
    CrossRefPubMed
  18. 18.↵
    1. Yu F,
    2. Moehring A,
    3. Banerjee O, et al
    . Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat Med 2024;30:837–49 doi:10.1038/s41591-024-02850-w pmid:38504016
    CrossRefPubMed
  19. 19.↵
    1. Jacques T,
    2. Cardot N,
    3. Ventre J, et al
    . Commercially-available AI algorithm improves radiologists’ sensitivity for wrist and hand fracture detection on X-ray, compared to a CT-based ground truth. Eur Radiol 2024;34:2885–94 doi:10.1007/s00330-023-10380-1 pmid:37919408
    CrossRefPubMed
  20. 20.↵
    1. Watanabe Y,
    2. Tanaka T,
    3. Nishida A, et al
    . Improvement of the diagnostic accuracy for intracranial haemorrhage using deep learning-based computer-assisted detection. Neuroradiology 2021;63:713–20 doi:10.1007/s00234-020-02566-x pmid:33025044
    CrossRefPubMed
  21. 21.↵
    1. Ewals LJ,
    2. van der Wulp K,
    3. van den Borne BE, et al
    . The effects of artificial intelligence assistance on the radiologists’ assessment of lung nodules on CT scans: a systematic review. J Clin Med 2023;12:3536 doi:10.3390/jcm12103536
    CrossRef
  22. 22.↵
    1. Busby LP,
    2. Courtier JL,
    3. Glastonbury CM
    . Bias in radiology: the how and why of misses and misinterpretations. Radiographics 2018;38:236–47 doi:10.1148/rg.2018170107 pmid:29194009
    CrossRefPubMed
  23. 23.↵
    1. Lee CS,
    2. Nagy PG,
    3. Weaver SJ, et al
    . Cognitive and system factors contributing to diagnostic errors in radiology. AJR Am J Roentgenol 2013;201:611–17 doi:10.2214/AJR.12.10375 pmid:23971454
    CrossRefPubMed
  24. 24.↵
    1. Yacoub B,
    2. Varga-Szemes A,
    3. Schoepf UJ, et al
    . Impact of artificial intelligence assistance on chest CT interpretation times: a prospective randomized study. AJR Am J Roentgenol 2022;219:743–51 doi:10.2214/AJR.22.27598 pmid:35703413
    CrossRefPubMed
  25. 25.↵
    1. Gaube S,
    2. Suresh H,
    3. Raue M, et al
    . Non-task expert physicians benefit from correct explainable AI advice when reviewing x-rays. Sci Rep 2023;13:1383 doi:10.1038/s41598-023-28633-w pmid:36697450
    CrossRefPubMed
  26. 26.↵
    1. Kruger J,
    2. Dunning D
    . Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. J Pers Soc Psychol 1999;77:1121–34 doi:10.1037//0022-3514.77.6.1121 pmid:10626367
    CrossRefPubMed
  • Received March 29, 2024.
  • Accepted after revision June 13, 2024.
  • © 2024 by American Journal of Neuroradiology
PreviousNext
Back to top
Advertisement
Print
Download PDF
Email Article

Thank you for your interest in spreading the word on American Journal of Neuroradiology.

NOTE: We only request your email address so that the person you are recommending the page to knows that you wanted them to see it, and that it is not junk mail. We do not capture any email address.

Enter multiple addresses on separate lines or separate them with commas.
Artificial Intelligence Efficacy as a Function of Trainee Interpreter Proficiency: Lessons from a Randomized Controlled Trial
(Your Name) has sent you a message from American Journal of Neuroradiology
(Your Name) thought you would like to see the American Journal of Neuroradiology web site.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Cite this article
David A. Fussell, Cynthia C. Tang, Jake Sternhagen, Varun V. Marrey, Kelsey M. Roman, Jeremy Johnson, Michael J. Head, Hayden R. Troutt, Charles H. Li, Peter D. Chang, John Joseph, Daniel S. Chow
Artificial Intelligence Efficacy as a Function of Trainee Interpreter Proficiency: Lessons from a Randomized Controlled Trial
American Journal of Neuroradiology Sep 2024, DOI: 10.3174/ajnr.A8387

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
0 Responses
Respond to this article
Share
Bookmark this article
Artificial Intelligence Efficacy as a Function of Trainee Interpreter Proficiency: Lessons from a Randomized Controlled Trial
David A. Fussell, Cynthia C. Tang, Jake Sternhagen, Varun V. Marrey, Kelsey M. Roman, Jeremy Johnson, Michael J. Head, Hayden R. Troutt, Charles H. Li, Peter D. Chang, John Joseph, Daniel S. Chow
American Journal of Neuroradiology Sep 2024, DOI: 10.3174/ajnr.A8387
del.icio.us logo Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One
Purchase

Jump to section

  • Article
    • Abstract
    • ABBREVIATIONS:
    • MATERIALS AND METHODS
    • RESULTS
    • DISCUSSION
    • CONCLUSIONS
    • Footnotes
    • References
  • Figures & Data
  • Supplemental
  • Info & Metrics
  • Responses
  • References
  • PDF

Related Articles

  • PubMed
  • Google Scholar

Cited By...

  • Artificial Intelligence-assisted reader evaluation in acute CT head interpretation (AI-REACT): a multireader multicase study
  • Crossref
  • Google Scholar

This article has not yet been cited by articles in journals that are participating in Crossref Cited-by Linking.

More in this TOC Section

  • Infarcted & Non-Infarcted Brain in Stroke Outcome
  • Meta “Segment Anything Model” on Meningioma MRI
  • Automated Black Hole Sign Detection via SSL
Show more ARTIFICIAL INTELLIGENCE

Similar Articles

Advertisement

Indexed Content

  • Current Issue
  • Accepted Manuscripts
  • Article Preview
  • Past Issues
  • Editorials
  • Editor's Choice
  • Fellows' Journal Club
  • Letters to the Editor
  • Video Articles

Cases

  • Case Collection
  • Archive - Case of the Week
  • Archive - Case of the Month
  • Archive - Classic Case

More from AJNR

  • Trainee Corner
  • Imaging Protocols
  • MRI Safety Corner
  • Book Reviews

Multimedia

  • AJNR Podcasts
  • AJNR Scantastics

Resources

  • Turnaround Time
  • Submit a Manuscript
  • Submit a Video Article
  • Submit an eLetter to the Editor/Response
  • Manuscript Submission Guidelines
  • Statistical Tips
  • Fast Publishing of Accepted Manuscripts
  • Graphical Abstract Preparation
  • Imaging Protocol Submission
  • Evidence-Based Medicine Level Guide
  • Publishing Checklists
  • Author Policies
  • Peer-Review Policy
  • Transparency in Authorship Policy
  • Conflict-of-Interest Policy
  • News and Updates
  • Become a Reviewer/Academy of Reviewers

About Us

  • About AJNR
  • Editorial Board
  • Editorial Board Alumni
  • Alerts
  • Permissions
  • Not an AJNR Subscriber? Join Now
  • Advertise with Us
  • Librarian Resources
  • Feedback
  • Terms and Conditions

American Society of Neuroradiology

  • Not an ASNR Member? Join Now

© 2025 by the American Society of Neuroradiology All rights, including for text and data mining, AI training, and similar technologies, are reserved.
Print ISSN: 0195-6108 Online ISSN: 1936-959X

Powered by HighWire