Deep Transfer Learning and Radiomics Feature Prediction of Survival of Patients with High-Grade Gliomas.

BACKGROUND AND PURPOSE
Patient survival in high-grade glioma remains poor, despite the recent developments in cancer treatment. As new chemo-, targeted molecular, and immune therapies emerge and show promising results in clinical trials, image-based methods for early prediction of treatment response are needed. Deep learning models that incorporate radiomics features promise to extract information from brain MR imaging that correlates with response and prognosis. We report initial production of a combined deep learning and radiomics model to predict overall survival in a clinically heterogeneous cohort of patients with high-grade gliomas.


MATERIALS AND METHODS
Fifty patients with high-grade gliomas from our hospital and 128 patients with high-grade glioma from The Cancer Genome Atlas were included. For each patient, we calculated 348 hand-crafted radiomics features and 8192 deep features generated by a pretrained convolutional neural network. We then applied feature selection and Elastic Net-Cox modeling to differentiate patients into long- and short-term survivors.


RESULTS
In the 50 patients with high-grade gliomas from our institution, the combined feature analysis framework classified the patients into long- and short-term survivor groups with a log-rank test P value < .001. In the 128 patients from The Cancer Genome Atlas, the framework classified patients into long- and short-term survivors with a log-rank test P value of .014. For the mixed cohort of 50 patients from our institution and 58 patients from The Cancer Genome Atlas, it yielded a log-rank test P value of .035.


CONCLUSIONS
A deep learning model combining deep and radiomics features can dichotomize patients with high-grade gliomas into long- and short-term survivors.

very poor prognosis and a low rate of treatment response. The standard combined treatment of surgery, temozolomide, and chemoradiation has improved the median overall survival (OS) of GBM to roughly 2 years. Presently, there is no method to reliably predict the OS of patients with GBM as a response to treatment. The absence of such reliable prediction is a barrier in designing clinical trials and selecting optimal treatments for patients.
In the existing literature, MR imaging features of a brain tumor, including its volume, intensity, shape, and texture of contrast enhancement and evidence of tumor necrosis, diffusivity, infiltration, and blood volume, have been demonstrated to correlate with the OS of HGG. 1-6 A large number of these features were included for radiomics, techniques that leverage the wealth of information in images by extracting semiquantitative or quantitative predefined image features to derive a relationship between the features and clinical outcomes of interest.
Radiomics feature analysis has been shown to correlate with molecular and histologic tissue types and outcomes, such as response and OS of HGG, but the correlation remains imperfect. A major weakness that likely constrains the performance of radiomics is that predefined features are loworder features selected on the basis of heuristic knowledge about oncologic imaging. Therefore, extracting higher level and more complicated features and integrating these additional features into the framework of radiomics may result in improved predictive power.
Recently, a deep convolutional neural network (CNN) achieved outstanding performance in many areas of medical image analysis, such as segmentation, 7,8 classification, 9,10 prediction of tumor grade, 11 and patient survival. 12,13 A typical CNN structure is a feed-forward network that includes an input layer, multiple hidden layers, and an output layer. Because the convolutional filters and other parameters are adjusted automatically during the training process, the parameters of CNN are learned in a way that optimizes the use of information contained in the input images. In this process, a CNN may create and select a large number of features at its hidden layers. These features, termed "deep features" in this work, may be exploited for predicting the OS of patients with HGG in an unbiased fashion that does not require any prior definition and may contain extensive abstract information from the hidden layers. In training a CNN, when the available dataset is of limited size, one can apply a pretrained CNN for the task on hand. This practice, called "transfer learning," has been shown to be an effective way of using deep learning in many cases. 12,14,15 In this study, we integrated deep transfer learning and traditional radiomics techniques to explore a very large number of features in brain MR imaging of patients with HGGs. We then classified patients into longer term and shorter term survivors by training a machine learning model to predict OS on the basis of these image features.

MATERIALS AND METHODS
Our general workflow is depicted in Fig 1A, consisting of ROImarking, image-preprocessing, feature extraction by traditional radiomics and deep learning, and statistical analysis.
From the 2 cohorts, we constituted 3 data groups. The first group is the 50 patients from Brigham and Women's Hospital. Scans for this group were all acquired using the SE sequence. The second group is the 128 patients from TCGA, and scans were acquired using SE or gradient recalled-echo sequences. The third group comprised the 50 SE scans from Brigham and Women's Hospital and the 58 selected SE scans from TCGA.

Tumor Segmentation and Image Preprocessing
For both patient cohorts, ROIs were manually traced by a radiologist on the section with the largest tumor area.
Before the extraction of quantitative features, several preprocessing techniques were applied to improve texture discriminations. First, intensity normalization was performed in a nonlinear way to convert MR images into standardized intensity ranges for all subjects. 16 Second, to improve the computational performance and the signal-to-noise ratio of the texture outcome, we used gray-level quantization, which maps the full intensity range of the tumor region to different levels of gray. 17 Two gray-level quantization algorithms (equal-probability quantization, uniform quantization) and 2 numbers of gray levels (16 and 32) were adopted. Finally, all images were resampled to an isotropic pixel size using bilinear interpolation. Scale values of 1 mm (pixel size 4 1 Â 1 mm 3 ) and initial in-plane resolution were both tested.
For deep features, we cropped the MR images by finding a rectangular ROI that enclosed the outlined tumor. Then we resized the tumor patch to a 224 Â 224 square to fulfill the requirement for the input size of the pretrained CNN model that we used. Also considering that the CNN model that we used was pretrained on natural images with a color range of 0-255, we normalized the intensity of tumor patch images to the same color range.

Feature Extraction
Two types of features were extracted. The first type is handcrafted features that were manually extracted from an ROI. Hand-crafted features were divided into 3 groups: 1) nontexture features, including volume, size, and intensity features (such as solidity, eccentricity); 2) first-order histogram-based texture features, including skewness, kurtosis, variance, and others; and 3) second-order texture features, including features from the graylevel co-occurrence matrix, gray-level run length matrix, graylevel size zone matrix, and neighborhood gray-tone difference matrix. In total, we calculated 348 radiomics features for each ROI.
The second type of features were deep features. We chose the VGG-19, 18 which was pretrained on the natural image dataset ImageNet (http://www.image-net.org/), 19 which contains .1.2 million images as our CNN. VGG-19 has 19 layers with weights, formed by 16 convolutional layers and 3 fully connected layers. All the convolutional layers are built with a fixed kernel size of 3 Â 3, and the stride and padding are fixed at 1. The network has 5 max-pooling layers with a window size of 2 Â 2 and uses rectified linear units as the nonlinear activation function. The first 2 fully connected layers have 4096 features each, while the last FC layer has 1000 features with SoftMax activation (https://www.moleculardevices. com/products/microplate-readers/acquisition-and-analysissoftware/softmax-pro-software) ( Fig 1B). After we ran the front propagation of the VGG-19 model with pretrained weights as the initialization, a total number of 8192 deep features were extracted from the first 2 fully connected layers. All features were normalized by transforming the data into z scores with a mean of 0 and an SD of 1.

Feature Reduction
Feature reduction is a critical step because with 8192 deep features and 348 radiomics features, the number of features may result in overfitting in OS prediction. In addition, some features may have zero variance, have high correlations with other features, or have little relevance to the goal of OS prediction. Thus, the number of features needs to be reduced. We adopted 3 steps for feature reduction, namely, median absolute deviation, concordance indices (C-indices), and the Pearson coefficient correlation, to improve the generalizability and interpretability of our model.

Statistical Analysis
For censored survival data, we used the Elastic Net-Cox proportional hazards model 20 to analyze the OS of patients with HGG. We denoted the sample size as N. To maximize the use of the limited data, we applied leave-one-out cross validation N times as the outer loop to split the data into training and test sets. Each pair of training and test sets was examined independently. The model was fit on the basis of the training set by maximizing the penalized partial log-likelihood function for the Cox model with a penalty term. In this model, the penalty parameter l was optimized within the 10-fold cross-validation loop.; a, which was used to determine the influence of the L1 penalty on the L2 penalty was set to be 0.1. After optimization, the model output a survival score for each patient in the training and test sets. We used the median of survival scores in each training set as the threshold to classify each patient in the test set into either a longer term or a shorter term survivor cohort. Finally, Kaplan-Meier analysis was used to estimate survival probabilities of each cohort, 21 and a logrank test was implemented to test the hypothesis that the survival curves differed statistically significantly between the 2 cohorts. A P value representing the statistical significance of the curve separation was used as an index to model predictive performance. For all analyses, a P value , .05 was indicative of statistical significance. All testing was 2-tailed.

Clinical Characteristics of Patients
The demographic and clinical characteristics of patients in all 3 datasets are shown on Table 1. The median and mean OS were 503 days and 690 days for the patients in group one, 352 days and 449 days for patients in group 2, and 490 days and 642 days for group 3.

Feature Extraction
An example of a contrast-enhanced T1-weighted MR image of a longer term survivor and that of a shorter term survivor are shown in Fig 2A, -C. With different quantization algorithms, different numbers of gray-levels, and different scales for isotropic pixel resampling, we collected 348 hand-crafted quantitative radiomics features: 4 nontexture features, 24 first-order histogram-based texture features, 72 second-order texture features from the gray-level co-occurrence matrix, 104 second-order texture features from the gray-level run length matrix, 104 secondorder texture features from the gray-level size zone matrix, and 40 second-order texture features from neighborhood gray-tone difference matrix. Table 2 summarizes the radiomics features described in Chmelik et al. 9 Meanwhile, we generated and preprocessed the tumor patch images (Fig 2B, -D) as the input for the deep CNN architecture. Then we extracted 8192 features from the first 2 fully connected layers of the pretrained CNN model. Finally, a set of 8540 features was generated for each ROI.

Feature Reduction and Multivariate Statistical Analysis
Feature reduction was performed on the training set of each leave-one-out 10-fold cross-validation loop. We noticed that the deep feature matrix is relatively sparse and there are many uninformative deep features with zero variance. We set zero as the threshold of median absolute deviation to reduce about 60% of the total features, which were all from deep  features. The remaining deep features are less noisy, and some of them have high C-indices. The ranges of C-indices among the 3 datasets vary, so the threshold of C-indices was slightly different. Because the distribution of C-indices for groups 2 and 3 was similar, the threshold of C-indices for these 2 datasets were the same at 0.66. The C-indices in the group 1 dataset were relatively higher at 0.685. The threshold for the Pearson coefficient correlation was set at 0.85 for all 3 datasets. Because the feature reduction was performed within each cross-validation loop and was based only on the subselected training set for the loop, the number of surviving features and the names of those features vary among the different training loops. In all loops, the number of surviving features was less than 100 and most of the surviving features were deep features. In this study, individuals who were lost to follow-up or were still alive at the end of the study were right-censored. The ratio of noncensored-to-censored data in the 3 datasets was 45:5 for DATA 1, 97:31 for DATA 2, and 90:18 for DATA 3. The Elastic Net-Cox proportional hazards model was used as the multivariate statistical model to generate survival scores for each patient. On the basis of the survival scores, we were able to dichotomize patients into 2 groups: predicted longterm and short-term survivors. In Fig 3, we demonstrate the overall performance of our method in the group 1 dataset ( Fig  3A) with a log-rank test P value , .001 (hazard ratio = 3.26; 95% CI, 1.7-6.0), group 2 dataset (Fig 3B) with a log-rank test P value = .014 (hazard ratio = 1.65; 95% CI, 1.1-2.4), and the group 3 dataset (Fig 3C) with a log-rank test P value = .035 (hazard ratio = 1.71; 95% CI, 1.0-2.3). The ratio of shorter term-to-longer term survivors and the percentage of shorter term survivors was 22:28 (44%) for DATA 1, 63:65 (49%) for DATA 2, and 57:51 (53%) for DATA 3.

DISCUSSION
In this study, we demonstrated that machine learning-based statistical analysis of an image feature set comprising both radiomics and deep learning features extracted from gadolinium-enhanced brain MR imaging of patients with GBM can be used to distinguish longer term from shorter term survivors. The model was proved efficient to analyze anonymized brain MR imaging data from both publicly available sources and our hospital.
Previous literature on quantitative image-based prediction of OS has suggested that deep features play a complementary role to radiomics features. 12,14,22,23 Our study demonstrates that this remains true when using deep features extracted by VGG-19, an advanced model with documented excellent performance in image classification. Deep features are not limited to previously identified image attributes or even to those understandable by humans. This is an advantage because it leads to the possibility of discovering information in medical images that is not observable to human readers, which, in turn, raises the rational hope of adding diagnostic value beyond simple quantification of information already accessible in MR images. The abstract "features" represented in the weights of the deep CNN have a number of limitations. The meaning of individual features is not easy for humans to clearly understand. Also, it remains uncertain how reproducible the deep feature output is from the current CNN operating with available dataset sizes and processing power. As such, radiomics features were integrated to the pipeline, which were well-defined and selected a priori to comprise image attributes known or rationally expected by human experts to contain predictive information. At least in the near-term, we believe that the combination of radiomics and deep features may be rationally expected to provide greater value than deep feature-based analyses alone.  Besides deep features and radiomics features, we also investigated the effect of demographic features by adding sex and age information into the feature set of the TCGA dataset. This investigation did not change the performance substantially, likely because the areas under the curve of sex and age were both so low (area under the curve of sex = 0.53; area under the curve of age = 0.62) that these features were eliminated during feature reduction. We chose to include a deliberately heterogeneous cohort treated with a wide range of therapies to make the training set and model more readily generalizable to the heterogeneous patient mixture encountered in the clinic. Most of the patients received a standard treatment protocol of chemoradiation with temozolomide. Many received bevacizumab. Additional patients received a range of investigational therapies, including chemotherapies, targeted molecular therapies, and a few immunotherapies. Tumor genomics such as Isocitrate dehydrogenase 1 (IDH-1) mutation, MGMT methylation, and epidermal growth factor receptor application are powerful tumor markers that merit inclusion in future models. Because, unfortunately, tumor genomic data were unavailable for many patients in our cohort, we did not attempt to include this in our modeling. Twelve of 50 patients in our institutional dataset were IDH-1 wild-type, but IDH-1 status was not assessed in the remainder. Similarly, 11 of 50 were epidermal growth factor receptor-amplified, 20 of 50 were epidermal growth factor receptor not highly amplified, and 19 were unknown.
Like any statistical correlation approach, the CNN depends on the comprehensiveness of the training dataset, which is a statistically robust subset of the whole dataset. In general, the CNN model performance improves with increasing size of the dataset. Unfortunately, in the field of medical imaging and particularly in HGG brain MR imaging, annotating a large number of medical images remains challenging and time-consuming. While 50-150 fully annotated GBM datasets with known patient data represent a large amount of data to acquire, it is a small dataset from the perspective of the CNN training set and may result in overfitting. Under this circumstance, the transfer learning technique is introduced to apply CNN models from one field to another. On the other hand, transfer learning can reduce the number of training sets used by several orders of magnitude, but it introduces certain bias into the resulting CNN in the form of the pretrained parameters. It is likely that the success of the transferred model depends, in large part, on which CNN is selected, but to date, there is no systematic way to know which of the large number of available pretrained neural networks is best suited to a given task, which layers are optimal for choosing the features, or which pretrained dataset should be used. We chose VGG-19 to generate deep features for this study because it was trained on a large image data base and validated to provide excellent accuracy in the many applications; and it has been successfully applied to many medically related problems. In future work, it may be important to compare our results with the performance of other models such as VGG-16 (https://neurohive. io/en/popular-networks/vgg16/) 18 or ResNet-50 (https://neurohive. io/en/popular-networks/resnet/). 25 Transfer learning is not a fail-proof solution to the overfitting problem because overfitting may also occur in the retraining of the transferred CNN. Our initial output data matrix was very unbalanced and sparse, with only several dozen patients but 8540 output features. Our applied feature-reduction method eliminated roughly 99% of features and reduced the probability of divergence and computational cost. Reducing the number of features is essential to avoid overfitting in such small datasets, but feature reduction remains complicated and somewhat controversial. 26,27 A number of methods exist. We chose not to use principal component analysis, a well-established method, because it complicates further the interpretability problems intrinsic to radiomics and deep learning models by generating new features in a new coordinate system, yet 1 more step removed from the original image.
In our experiments, we found that there were more downstream features left in DATA 1 than in DATA 2 and DATA 3. One reason could be the data homogeneity, ie, some predictors in the homogeneous dataset may lead to an overfitting problem because they may not be very predictive in a heterogeneous dataset. In fact, in our data, DATA 1 was the most homogeneous one of all 3 datasets because it included only SE MR images and the data were acquired from a single institution only. In addition, the hazard ratio in DATA 1 was found to be higher than in the second and third datasets, suggesting that data homogeneity may affect the performance of a machine learning model in some way.
On a large-scale and high-throughput data mining field, especially medical imaging analysis, machine learning-based statistical analysis techniques are widely used. For example, de Carvalho Filho et al 28 used the support vector machine algorithm for lung nodule classification. Lao et al 12 used the lasso Cox regression model to find a useful subset of reduced features, then constructed radiomics signatures to predict the OS of patients with GBM. Yin et al 27 compared 3 feature-selection methods and 3 classification methods for differentiation of sacral chordoma and sacral giant-cell tumor. Yu et al 29 used an Elastic Net-Cox hazard ratio model to predict survival of patients with squamous cell carcinoma and stage I adenocarcinoma. In our study, we did not model OS as a binomial classification problem because binary classifiers do not consider censored information and are highly dependent on the manually determined survival threshold. Instead we chose Cox regression, a time-to-event model that is better suited to handle censored data and model a continuous range of the survival probabilities. Because the traditional Cox model does not work well on high-dimensional data in which a number of covariates are much larger than the sample size, we selected the Elastic Net-Cox hazard ratio model for statistical analysis. This penalized Cox model has been proved able to handle high-dimensional data and obtain reliable survival prediction. 29 It is possible that other statistical approaches may improve our ability to detect correlations in the data. This is an area that will benefit from future study.
Because postcontrast T1-weighted images are the critical mainstay of routine clinical brain tumor imaging, 30 we chose these as the initial image type around which to build our pipeline. Addition of more sequences (precontrast T1, T2, FLAIR T2, SWI, DWI, PWI) with independent image contrast can be expected to improve model performance and is planned for future work.
Other factors that may affect the final model performance include interoperator variation in ROI selection. To investigate, we had our 2 radiologists label our largest TCGA dataset independently. The intraclass correlation coefficient on the downstream features yielded a median of 0.76. Most individual-feature intraclass correlation coefficients were .0.7. This high reproducibility implies that inter-operator variation in ROI selection had little effect on our dataset. We also performed a supplemental analysis to test the effect of normal tissue in the training set input. In the patients in the TCGA, we removed normal voxels from inside the bounding box and retrained the deep learning model. The result was similar to the performance with the normal tissue voxels retained in the bounding box, with a P value of .016. It remains unknown whether this result would be true using larger bounding boxes comprising more normal tissue. These analyses suggest that our model performance was robust to at least some variation in preprocessing. More investigation is needed with more heterogeneous and independent datasets to determine whether this is generally true.

CONCLUSIONS
We report successful production and initial validation of a deep transfer learning model combining radiomics and deep features to predict OS of patients with GBM from postcontrast T1weighed brain MR images. Further optimization of these results will require systematic attention to each of the critical components of the pipeline, including choice and modification of the model, optimization of feature-reduction method, selection of statistical correlation strategy, incorporation of additional MR imaging data types, and inclusion of tumor genomics data. Necessary future steps before clinical translation will need to include interpretability analysis to determine the clinical significance of the surviving features in the model and testing in a dichotomized design that allows assessment or prediction performance on an individual patient basis.