TY - JOUR T1 - Qualifying Certainty in Radiology Reports through Deep Learning–Based Natural Language Processing JF - American Journal of Neuroradiology JO - Am. J. Neuroradiol. SP - 1755 LP - 1761 DO - 10.3174/ajnr.A7241 VL - 42 IS - 10 AU - F. Liu AU - P. Zhou AU - S.J. Baccei AU - M.J. Masciocchi AU - N. Amornsiripanitch AU - C.I. Kiefe AU - M.P. Rosen Y1 - 2021/10/01 UR - http://www.ajnr.org/content/42/10/1755.abstract N2 - BACKGROUND AND PURPOSE: Communication gaps exist between radiologists and referring physicians in conveying diagnostic certainty. We aimed to explore deep learning–based bidirectional contextual language models for automatically assessing diagnostic certainty expressed in the radiology reports to facilitate the precision of communication.MATERIALS AND METHODS: We randomly sampled 594 head MR imaging reports from an academic medical center. We asked 3 board-certified radiologists to read sentences from the Impression section and assign each sentence 1 of the 4 certainty categories: “Non-Definitive,” “Definitive-Mild,” “Definitive-Strong,” “Other.” Using the annotated 2352 sentences, we developed and validated a natural language-processing system based on the start-of-the-art bidirectional encoder representations from transformers (BERT), which can capture contextual uncertainty semantics beyond the lexicon level. Finally, we evaluated 3 BERT variant models and reported standard metrics including sensitivity, specificity, and area under the curve.RESULTS: A κ score of 0.74 was achieved for interannotator agreement on uncertainty interpretations among 3 radiologists. For the 3 BERT variant models, the biomedical variant (BioBERT) achieved the best macro-average area under the curve of 0.931 (compared with 0.928 for the BERT-base and 0.925 for the clinical variant [ClinicalBERT]) on the validation data. All 3 models yielded high macro-average specificity (93.13%–93.65%), while the BERT-base obtained the highest macro-average sensitivity of 79.46% (compared with 79.08% for BioBERT and 78.52% for ClinicalBERT). The BioBERT model showed great generalizability on the heldout test data with a macro-average sensitivity of 77.29%, specificity of 92.89%, and area under the curve of 0.93.CONCLUSIONS: A deep transfer learning model can be developed to reliably assess the level of uncertainty communicated in a radiology report.AUCarea under the receiver operating characteristic curveBERTbidirectional encoder representations from transformersBioBERTbiomedical variant of BERTClinicalBERTclinical variant of BERTNLPnatural language processingQC-RADqualifying certainty in radiology reports ER -