Recent Developments in the Dorfman-Berbaum-Metz Procedure for Multireader ROC Study Analysis

doi:10.1016/j.acra.2007.12.015

Academic Radiology

Volume 15, Issue 5, May 2008, Pages 647-661

https://doi.org/10.1016/j.acra.2007.12.015 Get rights and content

Rationale and objectives

The Dorfman-Berbaum-Metz (DBM) method has been one of the most popular methods for analyzing multireader receiver-operating characteristic (ROC) studies since it was proposed in 1992. Despite its popularity, the original procedure has several drawbacks: it is limited to jackknife accuracy estimates, it is substantially conservative, and it is not based on a satisfactory conceptual or theoretical model. Recently, solutions to these problems have been presented in three papers. Our purpose is to summarize and provide an overview of these recent developments.

Materials and Methods

We present and discuss the recently proposed solutions for the various drawbacks of the original DBM method.

Results

We compare the solutions in a simulation study and find that they result in improved performance for the DBM procedure. We also compare the solutions using two real data studies and find that the modified DBM procedure that incorporates these solutions yields more significant results and clearer interpretations of the variance component parameters than the original DBM procedure.

Conclusions

We recommend using the modified DBM procedure that incorporates the recent developments.

Section snippets

Original DBM Method

The DBM method is typically used with the test × reader × case factorial study design where each case (ie, patient) undergoes each of several diagnostic tests and the resulting images are interpreted once by each reader. Throughout this paper, we assume that the data have been collected using this factorial design. The competing modalities can be compared using the DBM method; in particular, the null hypothesis of no test effect can be tested and confidence intervals for test differences can be

Simulation Study

In a simulation study we examined the performance of the three DBM approaches—original DBM, new model simplification, and new model simplification plus ddf_H—with respect to the empirical type I error rate for testing the null hypothesis of no test effect. The simulation model of Roe and Metz (2) provided continuous decision-variable outcomes generated from a conventional binormal model that treats both cases and readers as random. We used this simulation model to create discrete rating data by

Discussion

We have summarized recently proposed solutions for the various drawbacks of the original DBM method and examined the performance of these solutions in a simulation study. The solutions include using normalized pseudovalues that allow DBM results to be based on either the original or the jackknife accuracy estimates; using less data-based model reduction and ddf_H to make DBM less conservative with a type I error rate much closer to the nominal level; and showing that the DBM model can be viewed

Acknowledgments

The authors thank Carolyn Van Dyke, MD, for sharing her data set.

References (24)

C.A. Roe et al.
Dorfman-Berbaum-Metz method for statistical analysis of multireader, multimodality receiver operating characteristic data: Validation with computer simulation
Acad Radiol
(1997)
D.D. Dorfman et al.
Monte Carlo validation of a multireader method for receiver operating characteristic discrete rating data: Factorial experimental design
Acad Radiol
(1998)
K.S. Berbaum
God, like the devil, is in the details
Acad Radiol
(2006)
S.L. Hillis et al.
Power estimation for the Dorfman-Berbaum-Metz method
Acad Radiol
(2004)
D.D. Dorfman et al.
Maximum likelihood estimation of parameters of signal-detection theory and determination of confidence intervals: Rating method data
J Math Psychol
(1969)
D.D. Dorfman et al.
Receiver operating characteristic rating analysis: Generalization to the population of readers and patients with the jackknife method
Investig Radiol
(1992)
M.H. Quenoille
Approximate tests of correlation in time series
J R Stat Soc Ser B
(1949)
M.H. Quenoille
Notes on bias in estimation
Biometrika
(1956)
J.W. Tukey
Bias and confidence in not quite large samples
Ann Math Stat
(1958)
S.L. Hillis et al.
A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data
Stat Med
(2005)

S.L. Hillis et al.

Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification

Acad Radiol

(2005)

S.L. Hillis

A comparison of denominator degrees of freedom methods for multiple observer ROC analysis

Stat Med

(2007)

Cited by (183)

A deep-learning model for intracranial aneurysm detection on CT angiography images in China: a stepwise, multicentre, early-stage clinical validation study
2024, The Lancet Digital Health
Artificial intelligence (AI) models in real-world implementation are scarce. Our study aimed to develop a CT angiography (CTA)-based AI model for intracranial aneurysm detection, assess how it helps clinicians improve diagnostic performance, and validate its application in real-world clinical implementation.
We developed a deep-learning model using 16 546 head and neck CTA examination images from 14 517 patients at eight Chinese hospitals. Using an adapted, stepwise implementation and evaluation, 120 certified clinicians from 15 geographically different hospitals were recruited. Initially, the AI model was externally validated with images of 900 digital subtraction angiography-verified CTA cases (examinations) and compared with the performance of 24 clinicians who each viewed 300 of these cases (stage 1). Next, as a further external validation a multi-reader multi-case study enrolled 48 clinicians to individually review 298 digital subtraction angiography-verified CTA cases (stage 2). The clinicians reviewed each CTA examination twice (ie, with and without the AI model), separated by a 4-week washout period. Then, a randomised open-label comparison study enrolled 48 clinicians to assess the acceptance and performance of this AI model (stage 3). Finally, the model was prospectively deployed and validated in 1562 real-world clinical CTA cases.
The AI model in the internal dataset achieved a patient-level diagnostic sensitivity of 0·957 (95% CI 0·939–0·971) and a higher patient-level diagnostic sensitivity than clinicians (0·943 [0·921–0·961] vs 0·658 [0·644–0·672]; p<0·0001) in the external dataset. In the multi-reader multi-case study, the AI-assisted strategy improved clinicians' diagnostic performance both on a per-patient basis (the area under the receiver operating characteristic curves [AUCs]; 0·795 [0·761–0·830] without AI vs 0·878 [0·850–0·906] with AI; p<0·0001) and a per-aneurysm basis (the area under the weighted alternative free-response receiver operating characteristic curves; 0·765 [0·732–0·799] vs 0·865 [0·839–0·891]; p<0·0001). Reading time decreased with the aid of the AI model (87·5 s vs 82·7 s, p<0·0001). In the randomised open-label comparison study, clinicians in the AI-assisted group had a high acceptance of the AI model (92·6% adoption rate), and a higher AUC when compared with the control group (0·858 [95% CI 0·850–0·866] vs 0·789 [0·780–0·799]; p<0·0001). In the prospective study, the AI model had a 0·51% (8/1570) error rate due to poor-quality CTA images and recognition failure. The model had a high negative predictive value of 0·998 (0·994–1·000) and significantly improved the diagnostic performance of clinicians; AUC improved from 0·787 (95% CI 0·766–0·808) to 0·909 (0·894–0·923; p<0·0001) and patient-level sensitivity improved from 0·590 (0·511–0·666) to 0·825 (0·759–0·880; p<0·0001).
This AI model demonstrated strong clinical potential for intracranial aneurysm detection with improved clinician diagnostic performance, high acceptance, and practical implementation in real-world clinical cases.
National Natural Science Foundation of China.
For the Chinese translation of the abstract see Supplementary Materials section.
A Multireader Multicase (MRMC) Receiver Operating Characteristic (ROC) Study Evaluating Noninferiority of Quantitative Transmission (QT) Ultrasound to Digital Breast Tomosynthesis (DBT) on Detection and Recall of Breast Lesions
2024, Academic Radiology
Quantitative transmission (QT) imaging is an emerging volumetric ultrasound modality for women too young for mammography. QT images tissue without overlap seen in mammography, thereby can potentially improve breast mass detection and characterization and noncancer recall. We compared radiologists’ interpretation of QT vs digital breast tomosynthesis (DBT) with a multireader multicase observer performance study.
Study subjects received screening DBT and QT scans in HIPAA-compliant, institutional review board-approved prospective case-collection studies at four clinical sites. Twenty-four Mammography Quality Standards Act-qualified radiologists interpreted 177 cases (66 with cancer, atypia, or solid mass and 111 normal or with nonsolid benign abnormality), first QT, then 2 weeks later DBT synthesized 2D-views. Readers reported up to three findings per case and for each finding a recall or no recall decision and confidence of that decision. The study hypothesis was area under receiver operating characteristic curve (AUC) of QT was noninferior to DBT. Sensitivity and specificity were also compared.
AUC of QT (0.746 ± 0.028, mean ± SD) was noninferior to DBT (0.700 ± 0.028) for AUC difference margin of −0.05 (P < .05). AUC difference was 0.046 ± 0.028 (95% CI: [−0.008, 0.101]). Sensitivity was 70.6 ± 7.2% for QT and 85.2 ± 6.4% for DBT, specificity was 60.1 ± 12.3% vs 37.2 ± 11.0%, and both differences were statistically significant. Of a total of 21 cases of cysts, readers recommended recall, on average, in 1.1 ± 1.4 cases with QT, but not with DBT, and 10.6 ± 2.2 cases with DBT, but not with QT.
QT can be a potential alternative to mammography for breast cancer screening of women too young to undergo mammography.
Automatic temporomandibular disc displacement diagnosis via deep learning
2023, Displays
Temporomandibular joint (TMJ) disc displacement is a common condition that required magnetic resonance imaging (MRI) for diagnosis. However, it has occasionally been challenging for doctors to make a firm diagnosis based on TMJ MRI due to imaging concerns and the significant requirement for clinical competence. As a result, a Multimodal Stepped Attention Net (MSANet) was built in this study, and a deep learning network was used to train the assisted diagnosis AI model (TMJ MRI-Net).
A total of 600 patients were recruited, including 1200 lateral joint MRI sequences, which are made up of eight consecutive images for each lateral joint. 11 sides of cases with poor image quality were excluded. MSANet combining multimodal technology and attention mechanism was proposed and designed for this study, which mainly included Area Detection Module and Feature Network.
The whole experiment was designed according to stard standard. There is statistically significant difference between the least square mean of diagnosis time for the physicians (from TMJ, Orthodontics and General Dentistry) with AI (TMJ MRI-Net) assistance group (16.15, 95% CI:10.88–21.41) and the physician only group (21.01, 95%CI: 15.74–26.28). The AUC, sensitivity and specificity of patients whether is disc displacement and patients whether the disc displacement is with/without reduction for the physicians' diagnoses were all statistically improved by the assistance of AI. In addition, AUC and specificity were improved for three different specialties with the AI assistance. Meanwhile, AI can help to save reading time for physicians from all three departments, and the increment was statistically significant.
To conclude, the AI-assisted strategy significantly improved the diagnostic accuracy of physicians (especially in General Dentistry) on anterior disc displacement in TMJ MRI and improve diagnostic efficiency.
Multi-observer concordance and accuracy of the British Thoracic Society scale and other visual assessment qualitative criteria for solid pulmonary nodule assessment using FDG PET-CT
2020, Clinical Radiology
To compare the interobserver reliability and diagnostic accuracy of the British Thoracic Society (BTS) scale and other visual assessment criteria in the context of 2-[¹⁸F]-fluoro-2-deoxy-d-glucose (FDG) positron-emission tomography (PET)-computed tomography (CT) evaluation of solid pulmonary nodules (SPNs).
Fifty patients who underwent FDG PET-CT for assessment of a SPN were identified. Seven reporters with varied experience at four centres graded FDG uptake visually using the British Thoracic Society (BTS) four-point scale. Five reporters also scored SPNs according to three- and five-point visual assessment scales and using semi-quantitative assessment (maximum standardised uptake value [SUV_max]). Interobserver reliability was assessed with the intra-class correlation coefficient (ICC) and weighted Cohen's kappa (κ). Diagnostic performance was evaluated by receiver operator characteristic (ROC) analysis.
Good interobserver reliability was demonstrated with the BTS scale (ICC=0.78, 95% confidence interval [CI]: 0.69–0.85) and five-point scale (ICC=0.78, 95 CI 0.68–0.86), whilst the three-point scale demonstrated moderate reliability (ICC=0.70, 95% CI: 0.59–0.80). Almost perfect agreement was achieved between two consultants (κ=0.85), and substantial agreement between two other consultants (κ=0.78) using the BTS scale. ROC curves for the BTS and five-point scales demonstrated equivalent accuracy (BTS area under the ROC curve [AUC]=0.768; five-point AUC=0.768). SUV_max was no more accurate compared to the BTS scale (SUV_max AUC=0.794; BTS AUC=0.768, p=0.43).
The BTS scale can be applied reliably by reporters with varied levels of PET-CT reporting experience, across different centres and has a diagnostic performance that is not surpassed by alternative scales.
Contrast Enhanced Digital Mammography (CEDM) Helps to Safely Reduce Benign Breast Biopsies for Low to Moderately Suspicious Soft Tissue Lesions
2020, Academic Radiology
To preliminarily asses if Contrast Enhanced Digital Mammography (CEDM) can accurately reduce biopsy rates for soft tissue BI-RADS 4A or 4B lesions.
Eight radiologists retrospectively and independently reviewed 60 lesions in 54 consenting patients who underwent CEDM under Health Insurance Portability and Accountability Act compliant institutional review board-approved protocols. Readers provided Breast Imaging Reporting & Data System ratings sequentially for digital mammography/digital breast tomosynthesis (DM/DBT), then with ultrasound, then with CEDM for each lesion. Area under the curve (AUC), true positive rates and false positive rates, positive predictive values and negative predictive values were calculated. Statistical analysis accounting for correlation between lesion-examinations and between-reader variability was performed using OR/DBM (for SAS v.3.0), generalized linear mixed model for binary data (proc glimmix, SAS v.9.4, SAS Institute, Cary North Carolina), and bootstrap.
The cohort included 49 benign, two high-risk and nine cancerous lesions in 54 women aged 34–74 (average 50) years. Reader-averaged AUC for CEDM was significantly higher than DM/DBT alone (0.85 versus 0.66, p < 0.001) or with US (0.85 versus 0.75, p = 0.001). CEDM increased true positive rates from 0.74 under DB/DBT, and 0.89 with US, to 0.90 with CEDM, (p = 0.019 DM/DBT versus CEDM, p = 0.78 DM/DBT + US versus CEDM) and decreased false positive rates from 0.47 using DM/DBT and 0.61 with US to 0.39 with CEDM (p = 0.017 DM/DBT versus CEDM, p = 0.001 DM/DBT+ US versus CEDM). For an expected cancer rate of 10%, CEDM positive predictive values was 20.5% (95% CI: 16%–27%) and negative predictive values 98.3% (95% CI: 96%–100%).
Addition of CEDM for evaluation of low-moderate suspicion soft tissue breast lesions can substantially reduce biopsy of benign lesions without compromising cancer detection.
Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study
2020, The Lancet Digital Health
Mammography is the current standard for breast cancer screening. This study aimed to develop an artificial intelligence (AI) algorithm for diagnosis of breast cancer in mammography, and explore whether it could benefit radiologists by improving accuracy of diagnosis.
In this retrospective study, an AI algorithm was developed and validated with 170 230 mammography examinations collected from five institutions in South Korea, the USA, and the UK, including 36 468 cancer positive confirmed by biopsy, 59 544 benign confirmed by biopsy (8827 mammograms) or follow-up imaging (50 717 mammograms), and 74 218 normal. For the multicentre, observer-blinded, reader study, 320 mammograms (160 cancer positive, 64 benign, 96 normal) were independently obtained from two institutions. 14 radiologists participated as readers and assessed each mammogram in terms of likelihood of malignancy (LOM), location of malignancy, and necessity to recall the patient, first without and then with assistance of the AI algorithm. The performance of AI and radiologists was evaluated in terms of LOM-based area under the receiver operating characteristic curve (AUROC) and recall-based sensitivity and specificity.
The AI standalone performance was AUROC 0·959 (95% CI 0·952–0·966) overall, and 0·970 (0·963–0·978) in the South Korea dataset, 0·953 (0·938–0·968) in the USA dataset, and 0·938 (0·918–0·958) in the UK dataset. In the reader study, the performance level of AI was 0·940 (0·915–0·965), significantly higher than that of the radiologists without AI assistance (0·810, 95% CI 0·770–0·850; p<0·0001). With the assistance of AI, radiologists' performance was improved to 0·881 (0·850–0·911; p<0·0001). AI was more sensitive to detect cancers with mass (53 [90%] vs 46 [78%] of 59 cancers detected; p=0·044) or distortion or asymmetry (18 [90%] vs ten [50%] of 20 cancers detected; p=0·023) than radiologists. AI was better in detection of T1 cancers (73 [91%] vs 59 [74%] of 80; p=0·0039) or node-negative cancers (104 [87%] vs 88 [74%] of 119; p=0·0025) than radiologists.
The AI algorithm developed with large-scale mammography data showed better diagnostic performance in breast cancer detection compared with radiologists. The significant improvement in radiologists' performance when aided by AI supports application of AI to mammograms as a diagnostic support tool.
Lunit.

View all citing articles on Scopus

¹: This research was supported by Grant R01EB000863 from the National Institutes of Health, Bethesda, MD. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs.

View full text

Original investigationRecent Developments in the Dorfman-Berbaum-Metz Procedure for Multireader ROC Study Analysis1

Rationale and objectives

Materials and Methods

Results

Conclusions

Section snippets

Original DBM Method

Simulation Study

Discussion

Acknowledgments

Acad Radiol

Acad Radiol

Acad Radiol

Acad Radiol

J Math Psychol

Receiver operating characteristic rating analysis: Generalization to the population of readers and patients with the jackknife method

Investig Radiol

Approximate tests of correlation in time series

J R Stat Soc Ser B

Notes on bias in estimation

Biometrika

Bias and confidence in not quite large samples

Ann Math Stat

A comparison of the Dorfman-Berbaum-Metz and Obuchowski-Rockette Methods for receiver operating characteristic (ROC) data

Stat Med

Monte Carlo validation of the Dorfman-Berbaum-Metz method using normalized pseudovalues and less data-based model simplification

Acad Radiol

A comparison of denominator degrees of freedom methods for multiple observer ROC analysis

Stat Med

Original investigation
Recent Developments in the Dorfman-Berbaum-Metz Procedure for Multireader ROC Study Analysis1