Multisite Concordance of DSC-MRI Analysis for Brain Tumors: Results of a National Cancer Institute Quantitative Imaging Network Collaborative Project

DSC-MR imaging data were collected after a preload and during a bolus injection of gadolinium contrast agent using a gradient recalled-echo-EPI sequence. Forty-nine low-grade and high-grade glioma datasets were uploaded to The Cancer Imaging Archive. Datasets included a predetermined arterial input function, enhancing tumor ROIs, and ROIs necessary to create normalized relative CBV and CBF maps. Seven sites computed 20 different perfusion metrics. For normalized relative CBV and normalized CBF, 93% and 94% of entries showed good or excellent cross-site agreement. All metrics could distinguish low- from high-grade tumors. BACKGROUND AND PURPOSE: Standard assessment criteria for brain tumors that only include anatomic imaging continue to be insufficient. While numerous studies have demonstrated the value of DSC-MR imaging perfusion metrics for this purpose, they have not been incorporated due to a lack of confidence in the consistency of DSC-MR imaging metrics across sites and platforms. This study addresses this limitation with a comparison of multisite/multiplatform analyses of shared DSC-MR imaging datasets of patients with brain tumors. MATERIALS AND METHODS: DSC-MR imaging data were collected after a preload and during a bolus injection of gadolinium contrast agent using a gradient recalled-echo–EPI sequence (TE/TR = 30/1200 ms; flip angle = 72°). Forty-nine low-grade (n = 13) and high-grade (n = 36) glioma datasets were uploaded to The Cancer Imaging Archive. Datasets included a predetermined arterial input function, enhancing tumor ROIs, and ROIs necessary to create normalized relative CBV and CBF maps. Seven sites computed 20 different perfusion metrics. Pair-wise agreement among sites was assessed with the Lin concordance correlation coefficient. Distinction of low- from high-grade tumors was evaluated with the Wilcoxon rank sum test followed by receiver operating characteristic analysis to identify the optimal thresholds based on sensitivity and specificity. RESULTS: For normalized relative CBV and normalized CBF, 93% and 94% of entries showed good or excellent cross-site agreement (0.8 ≤ Lin concordance correlation coefficient ≤ 1.0). All metrics could distinguish low- from high-grade tumors. Optimum thresholds were determined for pooled data (normalized relative CBV = 1.4, sensitivity/specificity = 90%:77%; normalized CBF = 1.58, sensitivity/specificity = 86%:77%). CONCLUSIONS: By means of DSC-MR imaging data obtained after a preload of contrast agent, substantial consistency resulted across sites for brain tumor perfusion metrics with a common threshold discoverable for distinguishing low- from high-grade tumors.

tumor progression from treatment response. [2][3][4][5][6] Because of these difficulties, patients are often precluded from switching to potentially more effective therapies within treatment windows of 3-5 months. 1 Clearly, better indications of tumor response that are not confounded by these treatment adverse effects are needed.
Perfusion MR imaging methods, which have repeatedly demonstrated the ability to provide biologically relevant information for treatment management, have the potential to overcome these limitations. For brain perfusion, the DSC-MR imaging methods have been most commonly used. With DSC-MR imaging, T2-or T2*-weighted images are acquired with high temporal resolution during a bolus administration of a gadolinium contrast agent. 7 The derived relative CBV (rCBV) maps have demonstrated the ability to predict tumor grade 8,9 and survival, 10 distinguish treatment effects from recurrent tumor, 11,12 and predict response to antiangiogenic therapy more reliably than standard MR imaging. [13][14][15] Despite this promise, the translation of DSC-MR imaging for routine clinical use has been hindered by a lack of consistency in the methods used and the rCBV values reported to make the noted distinctions. However, often a threshold determined for one purpose, such as distinguishing low-from high-grade tumor, 16 is used for another purpose such as predicting outcomes. 17 Consequently, the present confusion may be due to the lack of well-defined studies performed under carefully controlled conditions that test a specific outcome. This study addresses these limitations by providing carefully curated DSC-MR imaging datasets of low-grade glioma (LGG) and high-grade glioma (HGG) to multiple sites that participate in the National Cancer Institute Quantitative Imaging Network. With this approach, variations in image acquisition and preprocessing are eliminated and postprocessing methods can be directly compared in their ability to distinguish LGGs from HGGs. In addition, the threshold for this distinction can be identified both for each individual site and as a consensus recommendation.

Patients
All subjects recruited from a single site provided informed written consent according to institutional review board policy. Subjects with histologically confirmed, newly diagnosed, and treatmentnaïve glial tumors who had preoperative DSC-MR imaging were included in this study. Subjects with purely oligodendroglial lesions were not included due to demonstrated differences in rCBV compared with astrocytic tumors. 18 Due to the disparity in the number of subjects histologically diagnosed with low-and highgrade tumors, consecutive subjects with low-grade tumors between 2008 and 2014 and high-grade tumors from 2010 to 2014 were identified. Subjects were excluded if anatomic images were not available for lesion delineation or when DSC-MR imaging data were of poor quality.

Central Preprocessing
The preprocessing workflow is schematized in Fig 1. All preprocessing was performed in OsiriX Imaging Software (http://www. osirix-viewer.com) using the IB Delta Suite (Imaging Biometrics, Elm Grove, Wisconsin). Six datasets provided for each case included the following: 1) T1ϩC images; 2) DSC-MR imaging time-series co-registered to the T1ϩC images; 3) an arterial input function (AIF), which included 3 AIF locations in each file; 4) a whole-brain mask and ROIs of 5) normal-appearing white matter (NAWM) and 6) tumor. The NAWM was used to compute normalized rCBV (nRCBV) and normalized CBF (nCBF) maps.
The DSC-MR imaging volume was co-registered to the T1ϩC images via the reference scan obtained with the same slice prescription as the DSC-MR imaging. The AIF locations were determined semiautomatically using IB Neuro (Imaging Biometrics) with manual adjustments when necessary. An average signal generated from 3 pixels constituted the AIF. The whole-brain mask was made available to prevent disparity in values that could result from threshold variations that each software platform might use. Using the IB Delta Suite, we determined tumor ROIs from deltaT1 maps, which are standardized difference maps 22 that enable clear visualization of enhancing lesions free of bright signal from blood products or proteinaceous material. Nonenhancing lesions, apparent as areas of dark signal on T1WIs, were delineated by a neuroradiologist with Ͼ20 years of experience. Each ROI was created as a 16-bit binary DICOM file that included only whole voxels rather than contoured points. This approach prevents differences in the applied ROIs because each software platform applies different rules regarding whether a voxel is considered inside or outside an ROI. Anonymized datasets were uploaded to The Cancer Imaging Archive, 23,24 where they were further vetted to ensure the compatibility of complete datasets for the analysis platform of each site. All sites were blinded to tumor grade.

Image and Statistical Analyses
Seven sites (1-7) using 7 different software platforms provided 20 different rCBV measurements (Table 1) and 12 different CBF measurements (Table 2). Details for each software platform are listed in the tables. Several sites used Ͼ1 platform or analysis method. When applicable, the analysis for measurements of standardized rCBV was grouped separately from nRCBV due to image-scale differences. Agreement between each pair of values was assessed by computing the Lin concordance correlation coefficient (LCCC). An LCCC Ͼ 0.8 indicates good agreement, and LCCC Ͼ 0.9 indicates excellent agreement.
The ability of each metric to distinguish LGG from HGG was determined using the Wilcoxon rank sum test, with P Ͻ .05 indicat-ing significance. A receiver operating characteristic analysis was performed to identify the threshold that provides the optimal sensitivity (SN) and specificity (SP) to distinguish LGG from HGG. The De-Long test for comparing Ն2 receiver operating characteristic curves was used to determine whether there were differences in the ability of each measurement to classify tumors.
To determine "consensus" cutoff points, we created boxplots of the sum of SN and SP. Optimal thresholds were identified as those with maximum SN ϩ SP mean, according to the Youden selection criteria with small variance. The random effects model was used to assess the reliability of measurements across sites and platforms. The reliability is quantified by the following: Preprocessing workflow. Forty-nine subjects were identified, 13 of whom had a diagnosis of low-grade glioma and 36 with a diagnosis of high-grade glioma. The DSC-MR image volume was co-registered to the T1ϩC images via the reference scan obtained with the same slice prescription as the DSC-MR imaging. Six datasets were provided for each case including the following: 1) T1ϩC images; 2) DSC-MR imaging time-series registered to the T1ϩC images; 3) an AIF, which included 3 AIF locations in each file; 4) a whole brain (WB) mask and ROIs of 5) normal-appearing white matter (NAWM), and 6) tumor. Each ROI was created as a 16-bit binary DICOM file that included only whole voxels rather than contoured points. Anonymized datasets were uploaded to The Cancer Imaging Archive. QIN indicates Quantitative Imaging Network. where is the SD within or between software platforms. Finally, to assess the clinical relevance of the study observations, we determined the false-positive rate from T1ϩC images in comparison with each of the perfusion parameters (nRCBV, standardized rCBV, nCBF). The false-positive rate is defined as the proportion of low-grade tumors thought to be aggressive and/or high-grade, as indicated by the decision for tumor resection, relative to all tumors resected. The false-negative rate was not determined because data from all patients, including those who did not undergo an operation, were not available. By means of the T1ϩC images, the false-positive rate was defined as the proportion of tumors that are low-grade and had contrast agent-enhancing lesions. For the perfusion parameters, a false-positive rate is defined as a low-grade tumor with a value above the threshold determined for distinguishing low-from high-grade tumors.

RESULTS
Sixty-three subjects met inclusion criteria for this study, with 14 excluded for the following reasons: The contrast agent bolus was delayed during DSC acquisition, preventing capture of the postbolus steady-state signal (n ϭ 4); the contrast agent bolus was injected too slowly and in an irregular pattern (n ϭ 4); the contrast agent bolus was not present during acquisition of images (n ϭ 2); there were severe ghosting and motion artifacts of DSC images (n ϭ 2); or anatomic images were not available for lesion delineation (n ϭ 2). Forty-nine coregistered LGG (n ϭ 13) and HGG (n ϭ 36) DSC-MR imaging datasets were preprocessed, anonymized, and uploaded to The Cancer Imaging Archive (Fig 1). Tumor grade was confirmed with histopathology a median of 3 days (range, 0 -4 days) following MR imaging. Examples of postprocessed datasets are shown in Fig 2. LCCC results are displayed in a matrix listing each nRCBV (Fig 3) or nCBF (Fig 4) entry on both the x-and y-axes. For tumor nRCBV, 75% of the entries showed excellent agreement with LCCC Ն 0.9 and 19% with good agreement (0.80 Յ LCCC Ͻ 0.90), leaving only 6% with poor concordance (LCCC Ͻ 0.80). The concordance was best for nRCBV values determined with leakage correction. For nCBF, only 59% had 0.90 Յ LCCC Յ 1.0 and 34% had 0.80 Յ LCCC Յ 0.89.
For all software platforms, both nRCBV and nCBF showed statistically significant differences between LGG and HGG (Tables 3 and 4), with a mean nRCBV ϭ 1.4 Ϯ 0.13 and mean nCBF ϭ 1.57 Ϯ 0.24. The SN/SP for nRCBV ranged from 81%-97%/77%-85% and was slightly worse for nCBF with SN/SP ϭ 64%-97%/69%-85%. By means of the DeLong test, no significant differences were found among the 18 nRCBV (P ϭ .72) metrics to distinguish LGG from HGG. While differences among the nCBF metrics were borderline significant (P ϭ .05), if the entry with the lowest area under the curve (0.658) was excluded, there was no significant difference between the remaining measures (P ϭ .49). The De-Long test for the standardized rCBV showed no significant distinction between the 2 submissions for this measure (P ϭ .23).
Alternatively, the data can be pooled, as shown by the boxplots of SN ϩ SP (Fig 5), for which median and quartile values are indicated. The maximum sums were the following: nRCBV ϭ 1.4  (SN/SP ϭ 90%/77%) and nCBF ϭ 1.58 (SN/SP ϭ 86%/77%). For these consensus thresholds, the minimum individual SN/SP was 83%/77% for nRCBV and 80%/70% for nCBF. For the 18 nRCBV measurements, the reliability was determined to be 0.93, indicating that 93% of the variation can be attributed to differences in tumors, and 7%, to differences in analysis methods. The reliability was 95% for nRCBV determined with leakage correction and 93% for the group without leakage correction. For the nRCBV computed with one of the most common leakage-correction algorithms (Boxerman-Schmainda-Weisskoff 21 ), the reliability improved to 98%. The reliability of standardized rCBV was 96%. For the 12 nCBF measurements, the reliability was 61%.

DISCUSSION
By means of carefully curated DSC-MR imaging datasets, obtained with a single acquisition approach, all nRCBV and nCBF metrics, processed by 7 different sites, could distinguish LGG from HGG. The optimal nRCBV and nCBF thresholds varied by only 9% and 15%, respectively. Unique to this study, consensus thresholds of nRCBV ϭ 1.4 and nCBF ϭ 1.58 were determined, indicating good accuracy overall and for each individual site. These results should bolster confidence in the ability of DSC-MR imaging to provide reliable and consistent crossplatform perfusion metrics for the evaluation of brain tumors and, specifically, for distinguishing low-from high-grade gliomas.
The range of nRCBV threshold values determined in this study is much tighter than the 0.7-3.0 range previously reported for distinguishing tumor grade, 25,26 predicting differences in survival, 10,17,[27][28][29] and distinguishing true progression from pseudoprogression 30 and tumor from treatment effect. 11,12,31 While this large range of threshold values has been attributed to different acquisition and postprocessing schemes, 20 differences in patient populations and the clinical questions addressed also contribute to the variabilities. While it is unlikely that a single threshold can be universally applied for all clinical questions, these studies suggest that with well-defined studies to address a specific outcome under carefully controlled conditions, it is possible to reach consensus.
The present study also demonstrates a greater cross-platform concordance than that previously reported. For example, in one study, 32 2 commercial software packages (nordicICE; NordicNeuroLab, Bergen, Norway and Brain-STAT; GE Healthcare) were compared. Like the present study, 1 dataset of 24 patients with de novo glioblastoma was used and ROIs of tumor and reference brain were predetermined. However, unlike the present study, vastly different mathematic algorithms were applied, resulting in very disparate definitions for nRCBV and CBF; thus, a wide range of values was reported. In the present study, most algorithms involved the integration of the concentration-time course and the application of Boxerman-Schmainda-Weisskoff leakage correction, 21 which, in a subanalysis, also showed better reliability. In the previous study, 5 of 10 algorithms relied on the determination of the AIF. 32 Using AIF to compute nRCBV resulted in coefficients of variation of 15%, but only 2% when AIF was not used. The challenges of reliably determining the AIF are well-known and may largely explain the poor repeatability. 33,34 Most software platforms in this study did not incorporate AIF for nRCBV calculation and may therefore also explain the excellent concordance across sites. Yet, the computa-  tion of CBF requires the determination of AIF and is likely a primary reason for the greater variance in comparison with nRCBV (Figs 3 and 4). Also, the individual nCBF thresholds calculated using IB Neuro varied across sites because some sites chose to use circular deconvolution of the AIF for processing while others did not.
Five of 10 analysis methods in the previous software comparison study often used ␥-variate fitting. 32 Several studies reported a lower SNR 35 as well as greater inaccuracy when ␥-variate fitting was used for brain tumor DSC-MR imaging data, especially in the presence of contrast agent leakage. 19,20 Although ␥-variate fitting suppresses the postbolus baseline, making it appear that leakage has been corrected, there is no physiologic basis for this correction and it does not appropriately consider leakage that can occur during the bolus. 36 Gamma-variate fitting was not used by any of the software platforms evaluated in the present study.
In another study, 37 nRCBV values were generated from 3 FDA-approved software packages including IB Neuro 1.1, FuncTool software 4.5.3 (GE Healthcare) and nordicICE 2.3.13 and 1 in-house software platform. While effort was made to use the tools in a similar way, more user interaction was required of some (FuncTool, nordicICE), and FuncTool did not have the op-tion for leakage correction. The largest differences between the in-house and commercial software occurred with the tool that required the most user interaction (nordicICE), further motivating the development of more automated workflows with less need for user interaction. Yet another study comparing these same 3 packages also found significant differences, with the outlying package depending heavily on the type of rCBV metric used. 38 This finding again suggests that it is imperative that the same output metric be used when making such comparisons.
Of relevance to the current study, rCBV maps generated with IB Neuro showed superior leakage correction and stronger correlation with image-guided microvessel quantification as well as higher accuracy in distinguishing tumor recurrence from pseudoprogression/ radiation necrosis compared with other software platforms. 39 These results are relevant, given the number of sites in the present study that chose to use IB Neuro for their processing.
A limitation of the current study is the use of a DSC-MR imaging dataset that was obtained at a single center using a single approach. Use of a range of acquisition methods would likely result in greater variation in the DSC-MR imaging perfusion results. A previous study confirmed this by comparing a range of acquisition and analysis methods, which also influenced the ability to distinguish high-grade tumor from reference brain. 20 However, a consensus regarding best practices for DSC-MR imaging data acquisition is being reached, as described in a recent review, 40 and includes the approach used for this study. Specifically, use of a preload of contrast agent and a flip angle Ͻ90°is proving to be one of the most accurate approaches, further confirmed by 2 recent studies, 19,41 both incorporating sophisticated simulations of DSC-MR imaging data representative of brain tumor. Use of a preload might also be an important reason for greater consistency across postprocessing methods in this study compared with previous studies (eg, Orsingher et al 32 ). Collecting DSC-MR imaging data after preload was shown to decrease the dependence of tumor rCBV on the chosen method of analysis. 20 An additional limitation of this study is the use of laboratory or proprietary commercial packages for which many of the details of the algorithmic implementation are not available and thus cannot be further evaluated as potential sources of differences. Also, the software platforms used for this study were dictated entirely by platforms being used at each participating site. Consequently, this is not a comprehensive comparison of  all available software platforms with DSC-MR imaging postprocessing capabilities. The general application of the results of this study is somewhat limited because the preprocessing steps were carefully controlled so that consistent input data were provided to all sites and software platforms. In practice, subjective manipulation of the preprocessing steps is common; therefore, consistency is less likely, as the discussion of the previous studies reveals. Yet the identification of preprocessing as a key confound should not inhibit use of DSC-MR imaging but rather motivate improving automation of the preprocessing steps. In fact, several efforts to automate tumor segmentation are well underway, 42,43 which remove this source of discrepancy entirely.

CONCLUSIONS
This study demonstrates that nRCBV and nCBF can be used to distinguish LGG from HGG in a consistent fashion and using a single consensus threshold. This result should increase confidence in using nRCBV primarily, but also nCBF, on a routine basis, potentially motivating its incorporation into the updated Response Assessment in Neuro-Oncology criteria. Finally, these results provide strong motivation for the development of more automated preprocessing workflows that are less dependent on subjective user interaction.