Article Text
Abstract
Background and purpose Delayed cerebral ischemia (DCI) is a severe complication in patients with aneurysmal subarachnoid hemorrhage. Several associated predictors have been previously identified. However, their predictive value is generally low. We hypothesize that Machine Learning (ML) algorithms for the prediction of DCI using a combination of clinical and image data lead to higher predictive accuracy than previously applied logistic regressions.
Materials and methods Clinical and baseline CT image data from 317 patients with aneurysmal subarachnoid hemorrhage were included. Three types of analysis were performed to predict DCI. First, the prognostic value of known predictors was assessed with logistic regression models. Second, ML models were created using all clinical variables. Third, image features were extracted from the CT images using an auto-encoder and combined with clinical data to create ML models. Accuracy was evaluated based on the area under the curve (AUC), sensitivity and specificity with 95% CI.
Results The best AUC of the logistic regression models for known predictors was 0.63 (95% CI 0.62 to 0.63). For the ML algorithms with clinical data there was a small but statistically significant improvement in the AUC to 0.68 (95% CI 0.65 to 0.69). Notably, aneurysm width and height were included in many of the ML models. The AUC was highest for ML models that also included image features: 0.74 (95% CI 0.72 to 0.75).
Conclusion ML algorithms significantly improve the prediction of DCI in patients with aneurysmal subarachnoid hemorrhage, particularly when image features are also included. Our experiments suggest that aneurysm characteristics are also associated with the development of DCI.
- aneurysm
- subarachnoid
- hemorrhage
Statistics from Altmetric.com
Introduction
Delayed cerebral ischemia (DCI) is one of the most severe complications in patients with aneurysmal subarachnoid hemorrhage (aSAH) and is related to worsening of functional outcome. DCI occurs in 20–30% of patients who suffered from aSAH.1 The selection of patients with a high risk of developing DCI may improve patient outcome as well as reduce the costs related to futile intensive care monitoring for DCI.2
Several studies have identified risk factors associated with the development of DCI, including World Federation of Neurosurgical Societies (WFNS) grade, age, aneurysm treatment (clipping or coiling), intraparenchymal and intraventricular hemorrhage, total blood volume (TBV),3 hypertension, diabetes mellitus, history of smoking, alcohol use, hyperglycemia and Hunt and Hess grade on admission.4
Most studies searching for DCI predictors have relied on univariable and multivariable logistic regression analysis. The accuracy of these regression models is generally low (area under the curve (AUC) 0.635 and 0.656) and their approaches often do not correct for over-optimistic results by applying bootstrapping or cross-validation strategies.7
The volume and availability of (digital) clinical and image data have enormously increased over the past years, opening up new possibilities for predictive modelling. The integration and interpretation of data from multiple sources of information can be quite challenging.8 Machine Learning (ML) is a field of computer science whose algorithms can learn patterns from large datasets with multiple variables. An advantage of ML algorithms is that, once the outcome label is defined, the algorithms can automatically optimize (learn) their parameters with minimal oversight.9 Unlike regression models, ML algorithms can handle large amounts of data and patient characteristics while taking all their interactions into account.9 Therefore, ML algorithms yield a potential predictive gain in accuracy over regression models.9 10
Recent works that applied ML algorithms to heterogeneous data (data from different sources, such as image and clinical characteristics) presented positive results for classifying Alzheimer’s disease and predicting patients at risk for aortic stenosis.11 12
We hypothesize that ML algorithms can increase the accuracy of DCI prediction compared with traditional logistic regression models. Moreover, since the TBV and blood location present in baseline CT scans have already been proven to be associated with DCI,3 4 6 we hypothesize that the addition of automatically extracted image features from baseline CT scans to clinical data improves the accuracy of DCI prediction. To test these hypotheses, we explored three approaches for predicting the development of DCI in patients with aSAH: (1) using known predictors from the literature and logistic regression; (2) using ML algorithms with all available variables; and (3) combining imaging and clinical data.
Materials and methods
Population
Patients were included from a prospectively collected cohort consisting of consecutive aSAH patients admitted to the Academic Medical Center, Amsterdam, The Netherlands between December 2011 and December 2015. Inclusion criteria were: (1) aSAH with subarachnoid blood visible on admission non-contrast CT or confirmed by xantochromic cerebrospinal fluid after lumbar puncture, and (2) causative aneurysm proven on angiographic imaging. Patients who were included in the ongoing Ultra-Early Tranexamic Acid After Subarachnoid Hemorrhage (ULTRA) trial were excluded from the analysis because data from ongoing trials should not be used prematurely. Furthermore, we excluded patients for whom the admission CT scan presented severe artifacts. As a result, a total of 317 were used for analysis. From the included 317 patients, 97 (30%) developed DCI. DCI was strictly defined as the occurrence of new focal neurological impairment or a decrease of two points or more on the Glasgow Coma Scale (GCS) (with or without new hypodensity on CT) that could not be attributed to other causes, according to Vergouwen et al.13 All patients received nimodipine orally (6×60 mg daily) as prophylaxis for DCI. The diagnosis was assessed by the treating neurosurgeon and patients were treated with hypertension induction. The medical ethics committee of the Academic Medical Center, Amsterdam, The Netherlands waived ethics approval for this retrospective analysis of pseudonymised patient data. The database has been pseudonymised and patients have given consent for the use of data for research.
Because of the sensitive nature of the data, it is available on request to the corresponding author. All codes used are publicly available at the author’s Github page.
Machine learning algorithms
We selected the following ML algorithms: Logistic Regression (LR), Support Vector Machine (SVM),14 Random Forest Classifier (RFC),15 Multi-layer Perceptron (MLP),16 Stacked Convolutional Denoising Auto-encoders,17 and Principal Component Analysis (PCA). These algorithms have shown state-of-the-art results in several studies on disease prediction, image segmentation and image feature representation.17 18 The parameters used for these algorithms are presented in online supplemental tables I–III. For the development of ML models, datasets are generally split in two: a training and a testing dataset. ML models are first trained using a training dataset to optimize the prediction. Subsequently, the accuracy of the ML algorithms is evaluated on the testing dataset. The separation of training and testing data adopted in cross-validation7 is important to assess the model performance and generalization to unseen data. In this study we used Monte-Carlo cross-validation with 100 random splits (with 75% for training and 25% for testing) of the dataset into training and testing data and five fold cross-validation for optimizing the parameters of each model.
Supplemental material
Clinical data
A total of 48 variables were included in this study. The full list of available demographic and clinical variables is presented in online supplemental table IV. Collected radiological variables were: modified Fisher scale on admission, number, location, height and width of aneurysm were determined based on CT angiography image data. Furthermore, data on treatment (clipping, coiling or no treatment) were also collected.
The percentage of missing values per variable is presented in online supplemental table IV. Missing values in the dataset were imputed using the incremental attribute regression imputation with Random Forest (RF). This imputation technique has shown high accuracy rates in several datasets.19 After data imputation, data normalization was performed by subtracting the mean and scaling to unit variance. For the nominal data, dummies were created. It has been shown that data normalization increases convergence rates (time and number of iterations for training the models) and it is necessary for many ML algorithms.20
Image data
The available baseline non-contrast CT image data consists of 512×512 xN voxels (where N is the number of slices) with an average voxel spacing of 0.45±0.05 mm and an average slice thickness of 4.9 mm±0.6 mm. Some image-derived features that are well known for being associated with DCI are the TBV and the blood location.3 The manual extraction of features from medical images is a time-consuming task and these features might not be the only important ones available in the images.21
Potentially, each voxel can be considered a feature, therefore the number of features is too large to be efficiently used in ML algorithms. If the number of training samples is small compared with the number of features, the accuracy of ML algorithms can be strongly reduced; this problem is known as the ’curse of dimensionality'.22 To avoid this problem and to account for variations in the image data (rotation and translation), in this work we applied image data downsampling and data augmentation following the approach adopted in a previous study.23
Therefore, since the number of image features (voxels) is very large and relevant unknown image features might still be present in the baseline CT scans, we opted for an unsupervised feature learning technique.24 Feature learning is a technique used to automatically extract useful information from image data when building ML models.21 The Stacked Denoising Convolutional Auto-encoder (SDCAE)17 is an unsupervised feature learning technique designed to automatically learn the most relevant features of an image. The parameters used for the auto-encoder are presented in online supplemental table III.
Prediction models
In our experiments we used the implementations of LR, SVM, RFC and MLP algorithms available in the Scikit-learn toolkit.20 The parameters used for optimization are presented in online supplemental tables I and II. The Microsoft Cognitive Toolkit (CNTK)25 was used for the auto-encoder algorithm. In this study we explored three approaches described below.
Prior knowledge variables with logistic regression
We built two models using clinical variables for which the association with DCI has previously been established in the literature3 4 using multivariable logistic regression. The dataset is randomly split into training (75%) and testing (25%) set to prevent overoptimistic results. Model 1 included the following variables: WFNS, age, aneurysm treatment (clipping or coiling), intraparenchymal and intraventricular hemorrhage and TBV.3 Model 2 included the following variables: hypertension, diabetes mellitus, history of smoking, alcohol use, hyperglycemia and Hunt and Hess grade on admission.4
Clinical variables with ML
We built four predictive models using only clinical variables and ML algorithms (SVM, RFC, LR and MLP) and determined the most important variables. Figure 1 provides an overview of the workflow. First, the dataset is randomly split into training (75%) and testing (25%) set to prevent overoptimistic results and prevent overfitting. Subsequently, the training set is randomly split into training and validation using fivefold cross-validation for feature selection and parameter optimization. RF was used to assess feature importance since it is easily interpretable.20 Based on the RF feature importance, variables were recursively eliminated. The variables left after each elimination were used to optimize the models. Finally, the ML models were applied to the testing set and their accuracy was measured. Steps c–e (figure 1) were repeated until only one feature was left. Steps a–e were repeated 100 times using Monte-Carlo cross-validation.7 The averages and 95% CI of the accuracy measures were computed for the 100 cross-validation iterations (figure 1, step f).
Image features and clinical data with ML
We built four models using ML algorithms and a combination of the best clinical variables (determined with RFC) and features automatically extracted from CT images using the auto-encoder (for implementation details see online supplement section 1). The number of features generated by the auto-encoder was much higher than the number of features in the clinical dataset (2048 vs 48). Therefore, to preserve the value of the clinical features, the dimension of image features was reduced using PCA, which transforms the data into a smaller set of features based on the variance, as proposed by Zhang et al.18 The number of PCA components was optimized based on the AUC. The image features obtained with PCA were added to the clinical features (most relevant ones obtained from the ML approach) and the dataset containing the combination of features was used with the workflow presented in figure 1.
Model predictive performance assessment
To evaluate the performance of each approach, we computed the average of the AUC of the receiver operating characteristic curve (ROC) and the sensitivity and specificity with 95% CI. Differences in accuracy were considered significant if the CI did not overlap, and if the 95% CI of the difference between AUC distributions did not contain the null value. The specificity and sensitivity were calculated based on the upper left corner of the ROC curve.
Model interpretation
ML models are often seen as black boxes. However, for clinical decision making it is of utmost importance to understand what variables are considered important for the model and, at a deeper level, what variable influenced each individual prediction. To increase the interpretability of our results we explored the best performing model (RF) by computing the average feature importance and ranking them (from most important to least important) to provide more insight into the impact of those features in the models.
For this purpose we applied a model explanation technique named Local Interpretable Model-agnostic Explanations (LIME).26 LIME automatically creates an interpretable model locally around the prediction boundary of a given model (in our case the ML methods SVM, RF, LR, NN), providing an interpretation of each individual prediction and how the value of each variable affects it. To stress the importance of image features, we compared the models with and without image features and assessed the impact on DCI prediction using LIME. More details about LIME can be found in the online supplement section 2.
Results
The AUC values for the models built with variables manually chosen based on the prior knowledge approach are shown in table 1. The combination of TBV, age, WFNS, treatment (clipping, coiling or no treatment), presence of intraparenchymal and intraventricular hemorrhage (Model 1) yielded the best average AUC of 0.63 (95% CI 0.62 to 0.63).
The most relevant clinical features (with the best AUC) selected by the ML models were, in order of relevance for the model: TBV, presence of intraparenchymal blood, time from ictus to CT, age, GCS, aneurysm height, presence of subdural blood, aneurysm width, treatment (clipping, coiling or no treatment,) and aneurysm location. The AUC measures for the ML methods are shown in table 1 and the ROC curves are displayed in figure 2 (top). The RFC had the highest accuracy with an AUC of 0.68 (95% CI 0.65 to 0.69).
The AUCs for the image features and clinical data with the ML approach are shown in table 1 and the ROC curves in figure 2 (bottom). Again, the RFC had the highest accuracy with an AUC of 0.74 (95% CI 0.72 to 0.75), which was the highest accuracy obtained in our experiments. The most relevant features for this approach were, in order of relevance for the model: two automatically extracted image features, TBV, presence of intraparenchymal blood, time from ictus to CT, two other automatically extracted image features, age, aneurysm height, presence of subdural blood, aneurysm width, and GCS. The 95% CI of the difference between the AUC distributions of the clinical variables with the ML approach and the image features and clinical data with the ML approach were 0.04 to 0.07. We can therefore conclude that there is a statistically significant difference in the two distributions, suggesting that the image features extracted using an auto-encoder improved DCI prediction.
Online supplement figure 1 presents the feature importance for the best performing model (RF) using only clinical variables and using the combination of clinical variables and auto-encoder image features.
Figure 3 was created using the model explanation technique LIME using clinical features (top) and the combination of clinical and image features (bottom) to explain the decision of the RF model for a specific DCI patient.
We can note that the model without images suggests a lower risk of DCI (0.36), even though some variables point to a higher risk. This occurs because most of the variables point to a lower risk.
After combining the clinical and image features, many still point to a lower risk of DCI, although the majority of image features point to a higher risk of DCI. The combined features increase the total risk of DCI for this patient. More examples can be found in online supplement figures 2.1 and 2.2.
Discussion
In this dataset, most ML methods showed higher accuracy in predicting DCI compared with logistic regression. The prediction accuracy of the models was improved when image features extracted automatically with the auto-encoder were combined into the model. The highest average accuracy was obtained using the RFC and the combination of clinical and image features. Using LIME, we have shown how each feature used in the model affects an individual prediction, providing insight into the ‘black-box’ ML models. This visualization provides insight into a model’s risk prediction. We have further provided a visualization of how the combination of image features improved the accuracy of DCI risk prediction (figure 3).
Our results suggest that TBV, blood location, age, GCS, and treatment are associated with the occurrence of DCI, which is in accordance with previous studies.3 4 6 These previous studies relied mostly on multivariable LR. Notably, the accuracy obtained by LR models were the lowest in our study. With the use of ML algorithms, we found variables that increased the predictive accuracy, which have not been associated with DCI before (time from ictus to CT, presence of subdural blood, GCS, treatment, and aneurysm height, width and location). However, a causal relationship between these features and DCI was not further explored in this study. In our analysis, some of the parameters with value in the prediction of DCI were not identified as risk factors in previous studies. For example, in contrast to previous studies,3 4 aneurysm width and height were included in our ML models. ML models use different mechanisms than commonly used linear regression techniques, which may put these parameters forward in the predictions. However, these parameters were not the most relevant ones in our ML models, which is expressed by the relative low importance compared with other parameters (see online supplement figure 1).
In the paper by de Toledo et al,27 the outcome of SAH patients using ML was the main topic. Their family of methods was restricted to decision trees, while in our work we included multiple families of methods such as Neural Networks, Support Vector Machines, Logistic Regression and Ensemble methods. Since the learning processes of these families of methods differ from each other, a higher range of feature relationships could be explored in our set-up. The major contribution of our study comes from the combination of clinical and image data. With an automatic unsupervised feature extraction approach, image features were extracted from baseline CT scans and their combination with clinical features showed significant improvement in DCI prediction. Since our approach is unsupervised, the images do not require any sort of annotation and are less prone to bias from labels and overfitting. A downside of our approach is that the multiple downsampling steps hamper the interpretation of these image features.
There was a significant increase in sensitivity when comparing prior knowledge model 1 to RFC using all clinical variables. This shows that the RFC was better at identifying patients at risk of DCI than the other models, although the models were not statistically significantly better at identifying patients not at risk. The combination of clinical data with image features increased the specificity of the models, making them more precise at identifying patients not at risk of developing DCI, which for clinical practice may be more useful to reduce the costs related to futile intensive care monitoring for DCI.2
A limitation of common regression models is that the number of features that can be included is limited. Based on the Rule of Ten, one should have at least 10 events per feature included in the LR model. Note that it has already been proven that this rule is not so strict and that models with fewer events per feature5–9 can still be used with good predictive results.28 In our dataset we had less than three events per feature, which makes the LR model prone to overfitting. The NCCT images contained a large number of voxels. Using the whole image for training the auto-encoder increases the risk of overfitting due to the large number of input image features and parameters to optimize. We reduced this risk by downscaling and augmenting the scans and applying cross-validation. The ML algorithms used in this study are able to handle such high dimensional feature spaces with less risk of overfitting, provided that proper approaches, such as data augmentation, cross-validation and regularization, are taken into account.7 8 Even though Monte-Carlo cross-validation was used with 100 iterations, it does not replace the need for validation on an external dataset. Moreover, the loosely formulated definitions used for DCI make external validation even harder, since two datasets with the same definition are needed. In our study, however, DCI was strictly defined according to the definition of Vergouwen et al 13 and consistently adopted throughout the dataset.
To determine the best parameter configurations to build the ML models can be computationally expensive and time consuming. Moreover, selecting the range of values used for fine-tuning is difficult. In this study, the selection of the range of values for the parameters was based on previous studies and the Scikit-learn toolkit implementation suggestions.20 29 Nevertheless, it may be worthwhile to study models with different (number of) parameters.
The interpretation of these 3D-image features is challenging, as discussed in other studies.30 This will be the subject of future work which will investigate other feature extraction techniques for the image data that are easier to visualize, to provide insight into the interpretation of the image features.
Conclusion
Our findings indicate that ML algorithms improved prediction of DCI in patients with aSAH in the population studied. We show that features that have not been considered before may increase the accuracy of DCI prediction. Feature visualization using LIME provides a better understanding of the models and might improve clinical decision-making. Imaging features extracted automatically using ML techniques further improve the accuracy in predicting DCI.
References
Footnotes
Contributors LAR: Machine learning, deep learning, programming, data pre-processing, statistical analysis, study design, literature review and manuscript writing and review. WEvdS: Study design, data pre-processing, literature and manuscript review. RSB: Machine learning, deep learning, programming and manuscript and literature review. CBLMM, RvdB, IJAZ: Study design, data collection, and manuscript review. DV: Data management, study design, manuscript review. WPV: Study design and manuscript review. AHZ: Machine learning, deep learning, statistical analysis, study design and supervision and manuscript review. GJS: Study design, supervision and manuscript review. SDO: Machine learning, deep learning, data analysis, study design and supervision, manuscript writing and review. HAM: Machine learning, deep learning, data analysis, study design and supervision, manuscript writing and review.
Funding This work was supported by ITEA3 grant number 14003 Medolution.
Competing interests None declared.
Patient consent Not required.
Ethics approval The medical ethics committee of Academic Medical Center.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Because of the sensitive nature of the data, it is available upon request to the corresponding author. All code used is publicly available at the authors Github page (https://github.com/L-Ramos/DCI_Prediction.git).