Stable and Discriminatory Radiomic Features from the Tumor and Its Habitat Associated with Progression-Free Survival in Glioblastoma: A Multi-Institutional Study

BACKGROUND AND PURPOSE: Glioblastoma is an aggressive brain tumor, with no validated prognostic biomarkers for survival before sur-gical resection. Although recent approaches have demonstrated the prognostic ability of tumor habitat (constituting necrotic core, enhancing lesion, T2/FLAIR hyperintensity subcompartments) derived radiomic features for glioblastoma survival on treatment-naive MR imaging scans, radiomic features are known to be sensitive to MR imaging acquisitions across sites and scanners. In this study, we sought to identify the radiomic features that are both stable across sites and discriminatory of poor and improved progression-free survival in glioblastoma tumors. MATERIALS AND METHODS: We used 150 treatment-naive glioblastoma MR imaging scans (Gadolinium-T1w, T2w, FLAIR) obtained from 5 sites. For every tumor subcompartment (enhancing tumor, peritumoral FLAIR-hyperintensities, necrosis), a total of 316 three-dimensional radiomic features were extracted. The training cohort constituted studies from 4 sites ( n ¼ 93) to select the most stable and discriminatory radiomic features for every tumor subcompartment. These features were used on a hold-out cohort ( n ¼ 57) to evaluate their ability to discriminate patients with poor survival from those with improved survival. RESULTS: Incorporating the most stable and discriminatory features within a linear discriminant analysis classi ﬁ er yielded areas under the curve of 0.71, 0.73, and 0.76 on the test set for distinguishing poor and improved survival compared with discriminatory features alone (areas under the curve of 0.65, 0.54, 0.62) from the necrotic core, enhancing tumor, and peritumoral T2/FLAIR hyperintensity, respectively. CONCLUSIONS: Incorporating stable and discriminatory radiomic features extracted from tumors and associated habitats across multisite MR imaging sequences may yield robust prognostic classi ﬁ ers of patient survival in glioblastoma tumors. interrogated the stability of radiomic across inherent site-specific acquisition variations within a multisite retrospective cohort in across this to address 2 in the of building robust prognostic for GBM –

G lioblastoma (GBM) is an aggressive brain tumor in which survival without treatment is typically 3 months. A conven-tional first-line treatment regimen involves maximally safe surgical resection followed by concomitant chemotherapy and radiation therapy. Despite this aggressive treatment, patient outcomes are highly variable with .40% of patients with GBM inevitably developing recurrence within 6 months following treatment initiation. While GBM heterogeneity may be reflected in the MR imaging phenotypes with different intensity profiles across the enhancing tumor and its associated habitat (comprising necrotic core, peritumoral T2/FLAIR hyperintensity), these image-based differences may not be visually appreciable to build prognostic models of survival in patients with GBM using routine MR imaging scans (Gadolinium-enhanced T1w [Gd-T1w], T2w, FLAIR).
Radiomic approaches (ie, high-throughput extraction of quantitative features) have provided a surrogate mechanism for capturing image-based phenotypes of disease heterogeneity on routine MR imaging (ie, Gd-T1w, T2w, FLAIR) toward outcome prediction in GBM and other tumors. Specifically, in GBM tumors, the availability of large, multi-institutional imaging cohorts such as The Cancer Imaging Archive-GBM (TCIA-GBM) has enabled development of prognostic models using a variety of radiomic feature families (ie, Gray, Haralick, 1 Gradient, 2 Laws, 3 Gabor, 4 Co-occurrence of Local Anisotropic Gradient Orientations [CoLlAGe] 5 ). These prognostic models have leveraged radiomic features extracted from the tumor habitat (necrotic core, enhancing tumor, peritumoral T2/FLAIR hyperintensity) using routine MR imaging toward outcome prediction in GBM tumors. 6,7 However, a key challenge in enabling the clinical utility of these radiomic approaches is to demonstrate their generalizability to variations in image-acquisition protocols across scanners and sites. Sources of variations in MR imaging acquisition can include differing slice thicknesses, image contrast, voxel resolutions, magnetic field strengths, echo times, and repetition times. [8][9][10] This issue is of particular concern when including data from publicly available repositories such as TCGA-GBM, where imaging scans are pooled-in from across different institutions (Fig 1A).
While preprocessing steps including bias field correction and intensity standardization, to some extent, have allowed correction of site-and scanner-specific variations in MR imaging, multiple studies have demonstrated 9,11,12 that these steps may not be sufficient to ensure stability of radiomic features during downstream analysis. More recently, studies have used stability measures such as the preparation-induced instability (PI) score 13 to identify radiomic features that are stable (across sites) in the context of lung 14 and prostate cancer. 13,15,16 However, in the context of GBM, most studies have focused on "controlled" test-retest studies 12 or identifying stable features across lesion segmentations obtained from different expert readers. 9,17,18 No work, to our knowledge, has explicitly Overview of the methodology and overall workflow. MR imaging scans (Gd-T1w, T2w, and FLAIR) were preprocessed and segmented into 3 tumor subcompartments (necrosis, enhancing tumor, and peritumoral edema). Next, a set of radiomic features was obtained for each of the MR imaging scans across the 3 tumor subcompartments. The most stable and discriminatory features were identified using the PI score and AUC measures across every subcompartment and every MR imaging protocol. The identified stable and discriminatory features were used for stability assessment across tumor subcompartments (with the associated habitat), individual imaging protocols (Gd-T1w, T2w, FLAIR), and MR imaging acquisition parameters (magnetic field strength, pixel resolution, and slice spacing).
interrogated the stability of radiomic features across inherent sitespecific acquisition variations within a multisite retrospective cohort in GBM tumors. Additionally, existing approaches have not explicitly interrogated the stability of radiomic features from across different subcompartments of the GBM tumor habitat. In this study, we sought to address 2 specific questions in the context of building robust prognostic image-based markers for GBM tumors-Can we identify the following: 1) a subset of radiomic features from across imaging protocols such as Gd-T1w, T2w, and FLAIR scans that are most stable across sites; and 2) a set of cross-site stable radiomic features from each tumor subcompartment that are also discriminatory of progression-free-survival (PFS) in GBM tumors?

Data Description
A total of 150 retrospectively analyzed treatment-naive multiparametric GBM MR imaging scans (Gd-T1w, T2w, FLAIR) along with their PFS information were obtained from 5 different institutions, including studies from TCGA-GBM (n ¼ 60) 19 and the Ivy Glioblastoma Atlas Project (n ¼ 33) 20 cohorts, as illustrated in Table 1. We used median PFS obtained from our training cohort to segregate the studies into improved (PFS . 6.5 months) and poor (PFS # 6.5 months) GBM survivors. Additional details regarding the data curation and preprocessing are provided in the Online Supplemental Data.
Segmentation of the Tumor Habitat. Following preprocessing, each 2D MR imaging slice with visible tumor was annotated by expert readers (V.B.H. with .15 years of experience, V.S. with 12 years of experience, and K.B. with .3 years of experience in neuroradiology) into 3 subcompartments: 1) enhancing tumor, 2) FLAIR hyperintensities (edema and nonenhancing tumor), and 3) necrosis. Tumor lesion annotations were also used to select the nontumor ROI for every patient volume. The nontumor region was defined as the region obtained by mirroring the tumor lesion (all 3 subcompartments in conjunction) onto the contralateral hemisphere. The nontumor region for each study was manually inspected by expert readers to verify that it did not contain any proportion of the tumor region or skull. An overview of the entire radiomics analysis workflow (including preprocessing and segmentation) is provided in Fig 1. Radiomic Feature Extraction. For every tumor subcompartment (enhancing tumor, peritumoral FLAIR hyperintensities, and necrosis), a set of 1264 radiomic features were extracted across all 3 multiparametric MR imaging (Gd-T1w, T2w, and FLAIR) scans. These radiomic features included Laws energy, 3 Haralick, 1 Gabor, 4 CoLlAGe, 5 Gradient, 2 and first-order gray level co-occurrence matrices (additional details provided in the Supplemental Online Data, A.3). A total of 316 three-dimensional radiomic textural features were extracted on a per-voxel basis with window sizes of 3 and 5 from each tumor subcompartment per the MR imaging scan. Four first-order statistics (median, variance, skewness, kurtosis) per feature descriptor were computed within each tumor subcompartment, yielding 1264 statistical features per region (enhancing tumor, peritumoral FLAIR hyperintensity, and necrosis) per the MR imaging protocol. When statistical features were concatenated across all 3 MR imaging sequences (Gd-T1w, T2w, and FLAIR), a total of 3792 texture features were obtained for analysis from each tumor subcompartment. Additionally, to identify stable features across multiple sites, we extracted 3792 statistical features from the nontumor brain parenchyma (mirroring the tumor mask on the opposite brain hemisphere). Following feature extraction, z score feature normalization 21 was applied to ensure that radiomic features extracted from different sites lie within a comparable range of values for leave-one-site-out (LOSO) analysis. The mean and standard deviation (SD) feature values obtained from the training data were further used to apply z score normalization on the independent holdout validation set.
Computation of Feature Instability across Sites. As described in Leo et al, 13 the PI uses a statistical t test to compare the number of feature distributions that are significantly different across data sets from different sites. The PI score ranges between 0 and 1 with a high PI score for a feature representing low stability across sitespecific variations, while a low PI score (closer to 0) indicates that the feature may not be affected by acquisition variations and is likely stable across sites. 13

Statistical Analysis
To perform a rigorous evaluation of feature stability across sites, our multi-institutional GBM cohort was segregated into 3 groups: training set (T ), validation set (VS), and test set (TS) (Fig 2). Using these 3 groups, we assessed radiomic feature stability across subcompartments of the tumor and its associated habitat, individual imaging protocols (Gd-T1w, T2w, and FLAIR), and common sources of variance in MR imaging parameters (magnetic strength, pixel spacing, slice thickness). Specifically, we used sites S1-S4 in a LOSO fashion so that studies from 3 sites at a time consisted of T where i [ (1,...,4), while studies from the remaining fourth site were used as VS i . Finally, the TS consisted of studies obtained from our hold-out site (Cleveland Clinic Foundation), which was used for independent evaluation. For every training set T , i [ (1,...,4), we computed the PI score to measure the variability in radiomic feature values engendered due to acquisition variations across the 3 sites within the T. The features with PI . 0 indicated that they were significantly different in the nontumor regions across different sites in the TS and hence were excluded from further analysis. After this triage, the remaining stable feature set was used to identify the most discriminatory features using minimum redundancy maximum relevance (mRMR) feature selection within a linear discriminant analysis classifier across 100 iterations of bootstrapping. This LOSO analysis was performed separately for each of the 3 tumor subcompartments (enhancing tumor, peritumoral hyperintensity, necrotic core) to identify the most stable and discriminatory features from the training set for every MR imaging protocol (Gd-T1w, T2w, and FLAIR). The average area under the curve (AUC) across different iterations of LOSO was used to obtain discriminability values in identifying poor and improved survival across each of the tumor subcompartments. Furthermore, for each tumor subcompartment, we concatenated and rank-ordered stable and discriminatory features in terms of their frequency of occurrence across 400 trials (100 bootstrapped iterations across 4 runs of LOSO evaluation as shown in Fig 2) of cross-validation run on our training set. The 5 most-frequently selected features using Tand VS were obtained and evaluated on the TS for their efficacy in distinguishing poor from improved survival in GBMs using the AUC and accuracy measures. Additionally, for comparison, we selected radiomic features using only minimum Redundancy-Maximum Relevance (discriminatory features) without inclusion of the PI metric for feature pruning. We performed the LOSO runs on discriminatory features using a linear discriminant analysis classifier to evaluate their performance (for distinguishing poor and improved survival) in terms of the AUC and accuracy on our hold-out validation set.
Lastly, we interrogated the feature distributions of stable and discriminatory radiomic features and compared them with discriminatory features using box-and-whisker plots. Specifically, the comparisons were made using the Wilcoxon rank-sum test in the context of evaluating variability in feature distributions across 3 commonly varying MR imaging parameters: 1) slice thickness (#2 mm, 2-5 mm, $5 mm), 2) pixel spacing (,0.5 mm, $0.5 mm), and 3) magnetic field strength (1.5T versus 3T), across our multi-institutional cohort.

Experiment 1: Assessing the Stability of Radiomic Features across Different Subcompartments of the Tumor and Its Associated Habitat
Figure 3A-C shows 2D scatterplots of discriminability (average AUC on the y-axis) versus instability (PI on the x-axis), where each data point in the 2D space corresponds to 1 of the 3792 radiomic features across every tumor subcompartment (necrosis, enhancing tumor, and FLAIR hyperintensity, respectively). The size of the data point on the scatterplot was used to represent the SD in AUC values for a specific feature across different LOSO runs. Our results suggested that the features with PI closer to zero also had a consistent diagnostic performance (in terms of AUC values) across sites as evident from the small size of the data points (reflecting a small deviation in the AUC) in Fig 3. The top 5 most frequent stable and discriminatory radiomic features per subcompartment selected using LOSO subexperiments are listed in Table 2. Figure 4 illustrates the AUC values for training (T 1 -T 4 ) and validation (VS 1 -VS 4 ) cohorts using the top 5 stable and discriminatory radiomic features versus discriminatory features alone (without inclusion of the stability metric) from each of the 3 tumor subcompartments using LOSO analysis. Confidence intervals derived from across the bootstrapped experiments for the AUC values in the training set are available in the Online Supplemental Data. The AUC values on the training set did not show much improvement between stable and discriminatory and discriminatory classifiers.
However, on the validation set (VS 1 -VS 4 ), the compartmentspecific linear discriminant analysis classifiers trained using the top 5 stable and discriminatory radiomic features yielded at least 5%-10% improvement in AUC values (0.78, 0.64, 0.66, 0.71 for VS 1 -VS 4 respectively), compared with the linear discriminant analysis classifiers trained using discriminatory features alone (0.57, 0.51, 0.52, 0.32 for VS 1 -VS 4 , respectively) in the enhancing tumor and T2/FLAIR hyperintensity subcompartments (0.76, Our compartment-specific classifiers trained using the top 5 stable and discriminatory radiomic features yielded AUCs of 0.71, 0.73, and 0.76, while discriminatory-alone features yielded AUCs of 0.65, 0.54, and 0.62 for necrotic core, enhancing tumor, and FLAIR hyperintensities, respectively, on the hold-out testing cohort. Additionally, combining stable and discriminatory features from across the tumor habitat yielded an AUC of 0.78 compared with using discriminatory features alone, which yielded an AUC of 0.69 on the hold-out test set. Experiment 2: Assessing the Stability of Radiomic Features across Individual Imaging Protocols (Gd-T1w, T2w, and FLAIR) Figure 3D shows stacked barplots corresponding to the total number of features (basic set) along with the proportion of features that were identified as cross-site stable with a PI ¼ 0 (stability-informed set) from each feature family across Gd-T1w, T2w, and FLAIR sequences. Of a total of 3792 features for every protocol (1264 features per subcompartment), a total of 240 features from Gd-T1w, 302 features from T2w and 306 features from FLAIR were found to be stable (using PI ¼ 0) across the studies from the training set (T ) sets. While 20% of features from every feature family were identified as stable across all the 3 MR imaging protocols, the greatest number of stable features was found to belong to Laws, CoLlAGe, and Gabor feature families (Online Supplemental Data).

DISCUSSION
Estimation of PFS often serves as a surrogate marker for predicting therapeutic efficacy in GBM tumors. Previous approaches have investigated radiomic features extracted from different subcompartments of the tumor habitat (constituting necrotic core, enhancing tumor, and T2/FLAIR hyperintensity subcompartments) on routine MR imaging in prognosticating PFS in patients with GBM. However, most of these studies do not explicitly take into account the sensitivity of radiomic features to site-and scanner-specific variations in MR imaging parameters, including variations in slice thicknesses, image contrast, voxel resolutions, and magnetic field strengths, which ultimately impact the generalizability and clinical applicability of these approaches.
More recently, a few approaches have focused on identifying stable and discriminatory features (using preparation-induced instability score 13 ) across multi-institutional studies in the context of other tumor sites (lung, 14 prostate 13,15,16 ). Our work leveraged a similar approach to identify radiomic features that are stable as well as discriminatory of poor-and improved-survival groups in the context of brain tumors. To rigorously evaluate the reproducibility of our selected radiomic features, we performed a comparison of stable and discriminatory features with discriminatoryalone features in a LOSO fashion (across 4 training sites) for distinguishing poor and improved GBM survivors on the basis of their PFS. This analysis was performed across each of the tumor subcompartments of the tumor habitat on routine multiparametric pretreatment MR imaging scans (Gd-T1w, T2w, FLAIR). Lastly, for each tumor subcompartment, the most frequent stable and discriminatory radiomic features in the training set were used to risk-stratify patients with GBM into poor and good survivors on an independent cohort obtained from a collaborating institute (Cleveland Clinic). Our findings on training and test sets are in line with those in previous work, 13,14 in which the use of PI score within feature selection allowed building improved diagnostic and prognostic classifiers using stable and discriminatory features.
In the context of GBM prognosis, a few studies that have interrogated variations in radiomic features have focused on the reproducibility across segmentations of the tumor obtained from across different experts/institutions 9,12,17,18 or, in terms of repeatability, using test-retest studies. 12   from the histogram and co-occurrence matrices were the most robust with an intraclass correlation coefficient value of $0.8. Similarly, Lee et al 22 compared radiomic features (textural, morphometric, and first-order) across segmentations of the tumor habitat obtained from 2 semiautomated segmentation software programs on 45 studies and identified the first-order statistic-feature family (across all 3 tumor subcompartments) to be the most stable across the automated segmentations. In a recent study, Shiri et al 12 evaluated the repeatability of radiomic features across 17 patients with GBM using T1-and T2-weighted MR imaging scans obtained within the same imaging unit on 2 consecutive days. By means of the intraclass correlation coefficient, the study demonstrated that the textural features from the gray-level run length matrix and gray-level dependence matrix were highly repeatable (intraclass correlation coefficient, .95%) with respect to image preprocessing, different image-registration algorithms, and test-retest analysis. Similar features, including difference variance, inverse different moment, fraction, and long-and short-run emphasis were found to be highly reproducible among different field sizes and phantom positions in Rastegar et al. 23 In Suter et al, 24 the authors applied multiple perturbations, including variability in the imaging protocols (voxel size and axial slice spacing), artificial deformation in the segmentation labels, k-space subsampling on MR imaging, and selected robust features based on the intraclass correlation coefficient from a single site. Selected robust features along with machine learning classifiers evaluated on multicenter data demonstrated an improvement in the prognostic power of the models compared with the machine learning classifiers trained with nonrobust features.
Our work is different from previous related studies [25][26][27][28][29] in a few ways. First, our work presented the first approach in the context of GBM tumors to jointly explore the stability and discriminability of radiomic features. Our results suggest that the radiomic features from T2/FLAIR hyperintensity are the most stable and discriminatory across different tumor subcompartments. Second, we carefully interrogated a set of stable features from across different MR imaging sequences (Gd-T1w, T2w, FLAIR). Finally, we leveraged retrospective MR imaging scans acquired from 5 different institutions to build a robust prognostic classifier that incorporated both stable and discriminatory features. Uniquely, we demonstrated that feature distributions across different slice thicknesses, pixel spacing, and magnetic field strengths were more consistent across stable and discriminatory radiomic features than the feature distributions obtained for the discriminatory features.
While the performance accuracy across stable and discriminatory features was similar to that of discriminatory features alone on the training set, there are 2 key points worth noting: 1) The model trained on stable and discriminatory was robust to the variability in training data because the SD of the AUC values was much higher for discriminatory-alone compared with stable and discriminatory features (Online Supplemental Data). 2) We observed an improvement in AUC values on the independent hold-out test set, suggesting that the model trained on stable and discriminatory features may be more generalizable and hence more amenable to providing reliable AUC assessments on the hold-out set, compared with the model trained on discriminatory features alone.
Our work did have some limitations. While our analysis included 150 patients from 5 different institutions, it was limited to a small cohort of studies per site. The retrospective nature of our cohort led to an unequal number of patients per site, which may have further impacted our analysis of radiomic variations in slice thickness, pixel spacing, and magnetic strengths across sites. Notably, our model yielded the lowest AUCs when validated on VS 2 (the Ivy Glioblastoma Atlas Project cohort). A possible reason may be the high variability in acquisition parameters across scanners (GE Healthcare and Siemens), variations in the magnetic field strength, and a high range of slice thicknesses (0.9-6.5 mm) and pixel spacing (0.43-1.05 mm), compared with data sets from the other cohorts. Additional work is warranted to rigorously validate the reproducibility of the identified features across multiple different imaging parameters on a much larger multisite cohort. We will additionally also investigate the repeatability of the radiomic features from across the tumor habitat via a prospectively collected test-retest study for the GBM cohort.

CONCLUSIONS
Our approach demonstrated that the radiomic features from the T2/FLAIR hyperintentensity subcompartment were the most stable and discriminatory of PFS in GBM tumors compared with the features from enhancing tumor and necrotic core. Identification of stable and discriminatory radiomic features across multisite multiparametric MR imaging (Gd-T1w, T2w, FLAIR) sequences from within the tumor and its associated habitat may yield robust prognostic classifiers for patient survival in GBM tumors.