Prediction of Clinical Outcome in Patients with Large-Vessel Acute Ischemic Stroke: Performance of Machine Learning versus SPAN-100

BACKGROUND AND PURPOSE: Traditional statistical models and pretreatment scoring systems have been used to predict the outcome for acute ischemic stroke patients (AIS). Our aim was to select the most relevant features in terms of outcome prediction on the basis of machine learning algorithms for patients with acute ischemic stroke and to compare the performance between multiple models and the Stroke Prognostication Using Age and National Institutes of Health Stroke Scale (SPAN-100) index model. MATERIALS AND METHODS: A retrospective multicenter cohort of 1431 patients with acute ischemic stroke was subdivided into recanalized and nonrecanalized patients. Extreme Gradient Boosting machine learning models were built to predict the mRS score at 90days using clinical, imaging, combined, and best-performing features. Feature selection was performed using the relative weight and frequency of occurrence in the models. The model with the best performance was compared with the SPAN-100 index model using area under the receiver operating curve analysis. RESULTS: In 3 groups of patients, the baseline NIHSS was the most signi ﬁ cant predictor of outcome among all the parameters, with relative weights of 0.36 (cid:1) 0.69; ischemic core volume on CTP ranked as the most important imaging biomarker with relative weights of 0.29 (cid:1) 0.47. The model with the best-performing features had a better performance than the other machine learning models. The area under the curve of the model with the best-performing features was higher than SPAN-100 model and reached statistical signi ﬁ cance for the total ( P , .05) and the nonrecanalized patients ( P , .001). CONCLUSIONS: Machine learning – based feature selection can identify parameters with higher performance in outcome prediction. Machine learning models with the best-performing features, especially advanced CTP data, had superior performance of the recovery

I schemic stroke still ranks as the fifth leading cause of death and the second leading cause of disability in the United States. 1 Although recent reports show a trend toward a decreasing incidence of ischemic stroke for individuals 65 years of age or older, the incidence remains stable for individuals 1865 years of age. 1 Revascularization therapies such as endovascular thrombectomy have extended the treatment window up to [16][17][18][19][20][21][22][23][24] hours after symptom onset as demonstrated in selected patients in Endovascular Therapy Following Imaging Evaluation for Ischemic Stroke (DEFUSE 3) 2 and Clinical Mismatch in the Triage of Wake Up and Late Presenting Strokes Undergoing Neurointervention with Trevo (DAWN) trials. 3 However, up to 55% of patients in the endovascular therapy group and 83% in the medical therapy group remained functionally dependent, with 90-day mRS scores of .2. 2 Therefore, physicians taking care of patients with acute ischemic stroke (AIS) not only need to predict the individual benefit of endovascular treatment but should also be able to estimate prognosis in both treated and untreated patients and to select patients for acute treatment, inform all involved persons about the prognosis, and plan for rehabilitation and long-term care. 4 Many publications have addressed the issues of predicting outcome in patients with acute large-vessel ischemic stroke. These include (but are not limited to) traditional logistic regression statistical models and pretreatment scoring systems such as the DRAGON score (Dense cerebral artery sign/early infarct signs on admission CT scan, prestroke modified Rankin Scale, Age, Glucose level at baseline, Onset-to-treatment time, and baseline National Institutes of Health Stroke Scale score), [5][6][7] the Stroke Prognostication Using Age and National Institutes of Health Stroke Scale (SPAN-100) index, 8,9 the Acute Stroke Registry and Analysis of Lausanne (ASTRAL) score, 7 the Pittsburgh Response to Endovascular Therapy (PRE) score, 10 the Totaled Health Risks in Vascular Events (THRIVE) score, 11 the Houston Intra-Arterial Therapy (HIAT) score, and the HIAT2 score. 12 The components considered in these predicting scoring systems were either clinical parameters only such as age and the NIHSS or non-contrast-enhanced CT (NECT) parameters such ASPECTS. None of these models take into account advanced imaging parameters. In addition, these models were built on the basis of the hypothesis of a linear relationship between the parameters and the outcome, but some studies have highlighted a nonlinear correlation. 13,14 In comparison with traditional modeling methods, machine learning algorithms have much higher scalability, allowing large numbers of features and parameters to be incorporated into the models. Machine learning models have been trained not only for outcome prediction following intravenous thrombolysis 15 and intra-arterial therapy 16,17 after AIS but also for subtype classification, 18 hemorrhagic transformation, 19 and clot-characteristic identification. 20 All the above-mentioned models use clinical features as input; 2 studies also used baseline NECT 14,16 or MR imaging gradient recalled-echo sequence features, 20 and 1 study used MR perfusion. 19 The hypothesis of our study was that machine learning algorithms can help select the most powerful features in outcome prediction, and the model with features from advanced perfusion CTP data would have more robust prognostic ability in comparison with the other machine learning models and SPAN-100 model. 9

Study Population
This retrospective study was conducted using a registry of 1782 patients with AIS from January 2008 to December 2018 at the Lausanne University Hospital (1310 patients) and Stanford University (472 patients). Institutional review board approval was obtained from both institutional review boards, with a waiver of informed consent due to the retrospective nature of the study. Inclusion criteria were the following: 18 years of age or older; clinical examination and baseline CT imaging confirming acute ischemic infarction with the infarct area within the ICA/MCA territory; availability of complete clinical (onset-to-baseline time; baseline NIHSS; glucose, lipid, and blood pressure levels at admission; history of cardiac disease, statin use, smoking status; stroke mechanism according to the Trial of Org 10172 in Acute Stroke Treatment [TOAST] trial; 21 and treatment and 90-day mRS) and imaging parameters (baseline NECT, CTP, and CTA; early [,72 hours from baseline] recanalization CTA). Patients with subacute, chronic, remote, and/or hemorrhagic infarctions were excluded from this study. The type of revascularization treatment (intravenous thrombolysis and endovascular treatment) was recorded if performed on the basis of the treating physician's decision.

Initial Clinical and Imaging Data
All the clinical and imaging parameters assessed in our study are summarized in Online Table 1. The 90-day mRS was dichotomized into favorable (mRS 0-2) and unfavorable outcome (mRS 3-6).
NECT, CTP, and CTA data were collected at admission as baseline studies. A blinded neuroradiologist evaluated the imaging features for all of the imaging studies. Features including the ASPECTS and hyperdense middle cerebral artery sign were extracted from the NECT. CTP datasets were processed on a workstation (Brain Perfusion, Version 6.0.0; Philips Healthcare). Automatic segmentation of ischemic core and penumbra volumes was performed on the basis of previously published thresholds. 22 The sidedness of cerebral ischemia was evaluated as well. The site of occlusion, Thrombolysis in Myocardial Infarction (TIMI) score, and collateral status were interpreted on the MIP CTA images. The TIMI 23 score was assessed as follows: 0, complete occlusion; 1, subocclusion with no distal branch filling; 2, subocclusion with incomplete or slow distal branch filling; and 3, completely open artery. A previously reported scoring system 24 was used for grading the collaterals into 4 levels in comparison with the normal side on baseline CTA. In addition, the clot burden score 25 (CBS), reflecting the extent of intracranial clot, and degree of stenosis of the carotid bifurcation according to the NASCET criteria were assessed on baseline CTA images. The total cohort was divided into 2 subgroups depending on the recanalization status. A TIMI score of $2 on recanalization studies was considered recanalization, while ,2 was considered persistent arterial occlusion.

Model Construction
Our dataset had 2 distinctive characteristics: low dimensionality with ,100 features and high nonlinearity for both qualitative and quantitative clinical/imaging features. We, therefore, decided to use Extreme Gradient Boosting (XGB), which is a specialized Gradient Boosting Machine (GBM), for our dataset. There are 2 core elements of the GBM. The first is a decision tree, which is the approach to generate and approximate non-linear-relationship mapping between input features and final outcome. The second is boosting. Initially raised by the authors of Adaptive Boosting (AdaBoost), 26 the concept of boosting consists of first creating many weaker, simpler machine learning classifiers during training. Then, the final model is constructed by pooling the results from all weaker models and creating a fine-tuned, stronger classifier. XGB was developed on the basis of the GBM with superiority of performance in multiple data science contests, and its multicore algorithms allow multiple computations to run simultaneously in parallel, thus enabling the algorithm to scale to large datasets. 27 A previous study 28 using GBM demonstrated that machine learning methods with decision tree and boosting algorithms were capable of predicting patient outcomes after AIS. In that study, both XGB and GBM were used, and XGB was found to have a relatively better performance when the cohort was divided into subgroups. XGB was also shown to perform very well in another study when segmenting stroke infarct regions using both clinical and imaging features. 29 Sixteen clinical and 11 imaging parameters were introduced in our models (Online Table 1). The dataset was broken down into 5 groups with a relatively equal number of patients in each group for 5-fold cross-validations. Data of each patient were randomly enrolled into 1 of the 5 folds as a testing set. In the remaining 4 folds, the patient data were used as a training set. For each model's training and testing phase, 5 identical models were trained, each using 1 group as the test set, with the remaining 4 groups as a training set. Then the overall model performance was evaluated on the basis of results from all 5 models on 5 test sets. At first, 3 types of feature group combinations, clinical features, imaging features, and clinical plus imaging features, were used in the XGB models to predict the 90-day mRS of the entire cohort and recanalized and nonrecanalized subgroups, respectively, creating 9 total combinations. To improve the performance of the machine learning models, we selected a subset of clinical and imaging features from all the predictors according to their contributions to the models. Features were selected on the basis of the following criteria: They had a relative weight of $0.2 or a relative weight of $0.1 and were in the top 5 highest weights in the 9 above-mentioned models. The SPAN-100 XGB model was built by introducing age and the NIHSS at admission based on the definition.

Statistical Analysis
Overall and by recanalization status, continuous characters were summarized as medians and interquartile ranges (IQRs) and as counts and percentages for categoric characters. For each of the 3 cohorts, measures of prediction sensitivity, specificity, accuracy, and area under the receiver operating curve (AUC) were estimated for the machine learning models, as well as for the reference SPAN-100 index model, with SPAN-100 defined as the sum of patient age and the NIHSS score. 9 The machine learning model with the highest AUC was then compared with the SPAN-100 index model, with the Delong test of pair-wise AUCs assessed using the pROC R package (https://www.rdocumentation.org/ packages/pROC/versions/1.16.2). 30,31 Finally, confusion matrices for 90-day mRS prediction were constructed, by cohort, for all models on the basis of 7-fold crossvalidation and visualized as heatmaps. All analyses were conducted in the R statistical computing framework, 32 Version 3.6 (http://www.r-project.org/), and statistical significance was assessed at the .05 a level.

RESULTS
There were 1431 patients included in this study, including 899 patients with recanalization and 532 patients with no recanalization (Online Fig 1). Online Table 1 illustrates the clinical and imaging characteristics for the total cohort and for the 2 subgroups.

Feature Selection with Machine Learning
Among the clinical and imaging parameters, the baseline NIHSS was the most important predictor of outcome for the whole cohort, as well as in the recanalized and nonrecanalized groups, with relative weights ranging from 0.36 to 0.69. Age and glucose levels at admission ranked as the next most important parameters in both the model using only clinical parameters and the model using all the clinical and imaging parameters (Online Table 2). The NIHSS and age are both components of the SPAN-100 scoring system.
Among the imaging parameters, ischemic core volume on CTP came in first place for all 3 groups of patients, with relative weights of 0.290.47 (Online Table 2). The CTA-CBS score, penumbra volume on CTP, and infarct side were the second strongest imaging predictors in the full cohort, the recanalized patients, and the nonrecanalized patients, respectively.
Clinical features such as baseline NIHSS score and age outweighed all the imaging features in importance in all 3 groups. Glucose level at admission appeared to be the third most important clinical biomarker in the total cohort and in recanalized patients, but not in nonrecanalized patients. In the nonrecanalized group, infarct and penumbra volume on CTP and time from onset to the baseline study came before the glucose level. Accordingly, the model with the best-performing features (total of 6 features) was built by including 3 clinical features (baseline NIHSS, age, glucose at admission) and 3 imaging features (ischemic core volume on CTP, penumbra volume on CTP, and CTA-CBS) (Online Table 3).

Model Performance in the Full Cohort and Recanalized and Nonrecanalized Cohorts
The sensitivity, specificity, accuracy, AUC, and heatmap of each model in the full cohort, as well as in the recanalized and the nonrecanalized subgroups are demonstrated in the Table, Figure,

Comparison between Machine Learning Models and the SPAN Scoring Model
Our best model, the model with the best-performing features, was compared with the SPAN-100 index (Figure and Online Fig  2). The AUCs for the machine learning models with the 6 bestperforming features in the total cohort and recanalized and nonrecanalized groups were 0.80, 0.79, and 0.82, respectively. The AUCs for SPAN-100 were 0.78, 0.76, and 0.78, respectively. The optimal cutoff values of SPAN-100 were 85, 94, and 64 for the total, recanalized, and nonrecanalized cohorts, respectively. The AUCs of the XGB models with the 6 best-performing features were higher than those of SPAN-100 and reached the statistical significance for the total cohort (P , .05) and the nonrecanalized patients (P , .001). In the recanalized group, the difference was not significant (P ¼ .05).

DISCUSSION
Our study shows that machine learning models trained with bestperforming clinical and imaging features, including advanced CTP parameters, can predict the outcome of patients with stroke more accurately than a conventional scoring system.
Bacchi et al 33 used deep learning models to predict the outcome in patients with AIS who underwent intravenous thrombolysis. The combined convolutional-plus-artificial neural network model based on both clinical and imaging data performed best in predicting patient outcomes. Heo et al 34 attempted to predict favorable outcome in a large group of 2043 patients with stroke using 3 machine learning models. By incorporating 38 demographic/clinical variables into their models, they found that the deep neural network model performed better than the other 2 models (random forest and logistic regression) and the ASTRAL score, while the performance of the deep neural network did not differ significantly from the ASTRAL score when trained on only the same 6 variables used for calculating the ASTRAL score. Nishi et al 17 built 9 models, including 5 previously reported scoring models, 1 logistic regression statistical model, and 3 machine learning models to predict the clinical outcome in a cohort of 387 patients with stroke who underwent endovascular treatment. Machine learning models were superior to the other models. These above-mentioned models used ASPECTS as the only imaging variable to make the outcome prediction, and the overwhelming clinical variables in these models seemed not quite practical in an emergency scenario because a physician has to input many variables to get valuable prognostic information. Our models with the bestperforming features were trained on more advanced imaging data such as CTP and CTA parameters, which provide improved accuracy compared with models using only parameters from the NECT. Furthermore, clinical features are important predictors, but when they are broken down into recanalized and nonrecanalized groups, CTP imaging data were a more potent contributor, especially for those nonrecanalized patients.
The commonly used machine learning models in cerebrovascular diseases include random forest, support-vector machines, the neural network, decision trees, and logistic regression. In this study, we used a supervised XGB model, which is a decision treebased machine learning method. Previous publications 28,29,35 highlighted the adaptability of XGB in dealing with redundant and nonlinear datasets. Compared with other machine learning models, XGB makes more powerful predictions with less chance of overfitting, especially in predictions of binary outcomes.
Our modeling filtered 6 parameters that best predicted the 90-day mRS score. Baseline NIHSS, age, and glucose on admission are clinical components of most of the conventional pretreatment prognostic systems developed for patients with stroke. [5][6][7][8][9][10][11][12] Previous studies have shown that baseline NIHSS and age are strongly associated with prognosis. 13,36,37 Hyperglycemia on admission is known to be an independent predictor of worse outcome because of its association with lactic acidosis and accelerated conversion of penumbra to infarct. 38, 39 The relevant imaging features (CTP ischemic core volume, penumbra volume, and CTA-CBS) are also well-established stroke imaging biomarkers. 13 Collateral scores and the CBS have been reported to be equally important in outcome prediction. 40 In our study, collaterals played an important role in the recanalized group, but not in the nonrecanalized group.
It is beneficial to have a simple model because it makes clinical deployment faster and easier. A model requiring few features to yield a useful prediction is also less prone to overfitting. In addition, the 3 imaging features used in our model can be automatically extracted within a machine learning pipeline embedded in the daily workflow. It is practical for our best-performing model to provide a prompt outcome prediction.
The SPAN-100 index has been shown to have the ability to predict patient outcome and the risk of complications after endovascular therapy in several stroke cohorts. 9 that the patients positive on the basis of SPAN-100 demonstrated a 9-fold increase in the odds ratio of poor outcome compared with those negative on the basis of SPAN-100, with an AUC of 0.74. The NIHSS ranked as the most highly relevant parameter among all of the clinical and imaging biomarkers in our study, while age was the second-best predictor in nonrecanalized patients and the third-best predictor in all and recanalized patients. When combined with imaging features, the ability of outcome prediction improved from 0.78, 0.76, and 0.78 to 0.80, 0.79, and 0.82 for all and recanalized and nonrecanalized patients. The major limitation of SPAN-100 is its inapplicability to younger patients, for it cannot reach a positive status because of the age component. However, our model overcomes this limitation and is applicable to any patient with AIS older than 18 years of age. There are several limitations to this study. First, this was a retrospective study, and our model will need to be validated prospectively. Second, we used only XGB models in this machine learning study, and other machine learning algorithms need be considered in future study designs. Third, prognostic models other than the SPAN-100 may have superior long-term predictive values for handicap and mortality, which will be incorporated into our future study design. 41

CONCLUSIONS
Machine learning-based feature selection can identify parameters with higher performance in long-term recovery-outcome prediction for patients with stroke at admission, while removing redundant and less predictive parameters. Moreover, the models with input from the best-performing features had better predictive value than the other models using clinical features only, imaging features only, both clinical and imaging features, and the SPAN-100 index. Finally, the prognostic FIGURE. Receiver operating characteristics (ROCs) of XGB prediction models with clinical features, imaging features, both clinical and imaging features, best-performing features, and SPAN-100 for predicting a 90-day mRS score of .2. For all patients and recanalized and nonrecanalized patients, the AUCs of models with the best-performing features were higher than those in SPAN-100, and statistical significance was reached in the total and nonrecanalized groups. The AUCs for machine learning models with the 6 best-performing features in the total cohort and recanalized and nonrecanalized groups were 0.80, 0.79, and 0.82, respectively. The AUCs for SPAN-100 were 0.78, 0.76, and 0.78, respectively. The AUCs of XGB models with the best-performing features were higher than those in SPAN-100 and reached statistical significance for the total cohort (P , .05) and the nonrecanalized patients (P , .001). In the recanalized group, the difference was not significant (P ¼ .05).
ability of machine learning models with advanced imaging features such as CTP data can be improved, especially for nonrecanalized patients.