Artificial Intelligence and Acute Stroke Imaging

SUMMARY: Arti ﬁ cial intelligence technology is a rapidly expanding ﬁ eld with many applications in acute stroke imaging, including ischemic and hemorrhage subtypes. Early identi ﬁ cation of acute stroke is critical for initiating prompt intervention to reduce morbidity and mortality. Arti ﬁ cial intelligence can help with various aspects of the stroke treatment paradigm, including infarct or hemorrhage detection, segmentation, classi ﬁ cation, large vessel occlusion detection, Alberta Stroke Program Early CT Score grading, and prognostication. In particular, emerging arti ﬁ cial intelligence techniques such as convolutional neural networks show promise in performing these imaging-based tasks ef ﬁ ciently and accurately. The purpose of this review is twofold: ﬁ rst, to describe AI meth-ods and available public and commercial platforms in stroke imaging, and second, to summarize the literature of current arti ﬁ cial intelligence – driven applications for acute stroke triage, surveillance, and prediction.

as basic logistic or linear regression could be effective. 7 However, if nonindependent, nonlinear relationships are expected between the various chosen features, a more complex model is required. Many such ML classifiers exist, and the most popular include random forest (RF), support vector machine (SVM), k-nearest neighbor clustering, and neural networks. 8 In general, these techniques are modeled by an underlying finite number of adjustable parameters. As a given set of features is passed through the model, these adjustable parameters act to convert the input descriptors into a predicted output class. Starting with randomly initialized parameters, a series of iterative updates is performed until an accurate mapping between numeric features and correct class is achieved, thus "training" the ML model. 9 Deep Learning DL through neural networks is distinguished by the ability to independently learn abstract, high-order features from data without requiring feature selection. Artificial neural networks (ANNs) are a subtype of DL that mimic biologic neurons and are composed of an input, 1 or more hidden layers, and an output. Generally, in computer vision, convolutional neural networks (CNNs) are most successful and popular for image classification in medical imaging. CNNs represent all recent winning entries within the annual ImageNet Classification challenge, consisting of more than 1 million photographs in 1000 object categories, with a 3.6% classification error rate. 10,11 CNNs are distinguished from traditional ML approaches by automatically identifying patterns in complex imaging datasets, thus combining both feature selection and classification into 1 algorithm and removing the need for direct human interaction during the training process. Recent advances in CNNs have achieved human accuracy in identification of everyday objects such as cats and dogs, which had previously been impossible to model using rigid mathematical formulas. 12 CNNs have already shown promise in the detection of pulmonary nodules, 13 colon cancer, 14 and cerebral microbleeds. 15 Table 1 details performance metrics and limitations of AI methods.

Accuracy
It is imperative that evaluation of ML models assess the accuracy of algorithms. Often, when testing large numbers of potential features, a few numeric descriptors meet the threshold for statistical significance between 2 target classes. However, P values are more often a reflection of the underlying power (sample size) of an experiment and may or may not relate to the clinical significance of the identified difference in features. As a result, it is critical not only to prove that a difference in features exists but also to assess the sensitivity, specificity, and accuracy of the feature(s) to predict a given end point. For classification, receiver operating characteristic curves can evaluate a model's performance, with the area under the curve (AUC) representing an aggregate measure for performance across all possible classification thresholds of a receiver operating characteristic curve. For segmentation analysis, Dice similarity coefficients and Pearson correlation coefficients are typically used. The Dice score measures the spatial overlap between the manually segmented and neural networkderived segmentations. Dice scores range from 0 (no overlap) to 1 (perfect overlap) and are commonly used to evaluate segmentation performance. 16 Limitations ML and DL approaches have limitations that should be considered. First, the development of algorithms requires data sets that are large, organized, well-classified, and accurate. Interpretability is challenging, especially for DL algorithms. To mitigate this "black box" effect, explainable AI models incorporate tools such as saliency maps. Overfitting is a limitation for ML, when a model mistakenly learns the "noise" instead of the "signal" in a training dataset and thus does poorly with unseen data and is limited in generalizability. 17 More training data, regularization, and batch normalization are ways to mitigate overfitting. Differences in image acquisition and data storage among institutions and difficulties in sharing data are obstacles to collecting enough data to obtain useful models. Standardization of imaging methods and open-source data collection can address this issue. Additionally, several proprietary ML software platforms have recently been introduced in the market that incorporate various aspects of the stroke pathway into their algorithms; however, comparison and validation of their performance are still necessary to ensure their robustness in routine use. 18 Despite limitations, ML remains a powerful tool for detection and management of stroke and hemorrhage.

Open-Source Datasets
Large datasets are required for ML algorithms to perform optimally. However, the availability of high-quality large-scale data remains a challenge given barriers in data sharing across institutions, the complexity of building imaging processing pipelines, and the time and cost of data annotation. To address these challenges, many publicly available imaging datasets are now available for ML in stroke (Table 2). [19][20][21][22][23][24] These datasets are valuable because they are already anonymized, postprocessed, and annotated, and they can be used for testing and comparing algorithms in diagnosing ischemic stroke and hemorrhage. Many of these datasets are initiated as AI challenges such as the RSNA (Radiology Society of North America) Head CT Challenge for Hemorrhage, ASFNR (American Society of Functional Neuroradiology) Head CT Challenge for Ischemic and Hemorrhagic Stroke, and ISLES  (Ischemic Stroke Lesion Segmentation) Challenge for Ischemic Stroke, supporting worldwide collaboration and new algorithm development.

Commercially Available Software Platforms
Increasingly, commercially available platforms providing automated information about various components of the acute stroke triage pathway are being integrated into routine clinical practice and clinical trials. [25][26][27][28] These tools offer fast and efficient analyses that seek to optimize the delivery of stroke care at spoke and hub hospitals and reduce turnaround times in the clinical workflow. 29 Table 3 lists some of the most popular commercially available stroke platforms and highlights their capabilities and AI-based algorithms. Figs 2-6 show the various web and mobile interfaces of these software platforms.

AI EVALUATION OF ISCHEMIC STROKE
Online Tables 1-4 provide an overview of the AI-based models of evaluating ischemic stroke discussed in this section, including detection and core infarct segmentation, identification of large-vessel occlusion (LVO), Alberta Stroke Program Early CT Score (ASPECTS) grading and additional factors in treatment selection, and prognostication.

Detection Methods
Rapid detection of ischemic infarction is important for triaging patients as potential candidates for thrombolysis because of the narrow window of therapeutic efficacy. Several studies have used ML algorithms for identification of ischemic infarction on CT or MR imaging. Tang et al 30 developed a computer-automated detection (CAD) scheme using a circular adaptive region of interest (CAROI) method on noncontrast head CT to detect subtle changes in attenuation in patients with ischemic stroke. They found that CAD improved detection of stroke for emergency physicians and radiology residents (AUC of 0.879 improved to 0.942 for emergency physicians and AUC of 0.965 improved to 0.990 for radiology residents) but did not improve significantly detection for experienced radiologists who already had high stroke detection rates. 30 Another study showed that an ANN was able to distinguish acute stroke from stroke mimics within 4.5 hours of onset (which was verified by clinical and CT and MR imaging data), with a mean sensitivity of 80.0% and specificity of 86.2%. 31

Core Infarct Volume Segmentation
Establishing infarct volumes is important to triage patients for appropriate therapy. AI has been able to establish core infarct volumes on DWI through automatic lesion segmentation. For example, 1 study used an ensemble of 2 CNNs to segment DWI lesions of any size and remove false positives. 32 This combined CNN approach had a Dice score of 0.61 for small lesions (,37 pixel size) and 0.83 for large lesions and outperformed other CNNs. 32 Guerrero et al 33 developed a CNN (uResNet) that segmented and differentiated white matter hyperintensities (WMHs) caused by chronic small-vessel disease from cortically or subcortically based strokes. The uResNet CNN mean Dice scores were 0.7 for white matter hyperintensities and 0.4 for strokes. 33 The uResNet slightly outperformed the DeepMedic CNN in distinguishing white matter hyperintensities and strokes compared with expert analysis (R 2 values 0.951 and 0.791 for white matter hyperintensities and strokes, respectively, using uResNet and 0.942 and 0.688 using DeepMedic). 33 One limitation of the study was the reliance on FLAIR and T1 images that do not fully account for  timing of stroke occurrence, and the value of uResNet in detection of acute strokes needs evaluation. The first study to use a DL approach on CTA source images to detect acute middle cerebral artery ischemic stroke, a 3D CNN (DeepMedic), performed with a sensitivity of 0.93, specificity of 0.82, AUC of 0.93, and Dice score of 0.61. 34 Specificity was maximized when the contralateral cerebral hemisphere on CTA was included, and a marginal reduction in false positives was seen when NCCT was included in the algorithm. 34 Limitations of this CNN were its tendency to overestimate the volume of small infarcts and underestimate large infarcts compared with manual segmentation by expert radiologists and difficulty in distinguishing old versus new strokes. 34  The largest cohort using CTP for core infarct determination based on an ANN was able to accurately identify core infarct volume (AUC ¼ 0.85; sensitivity ¼ 0.9; specificity ¼ 0.62) and was not significantly different from a model incorporating clinical data (AUC ¼ 0.87; sensitivity ¼ 0.91; specificity ¼ 0.65). 35 Although the study minimized the time between CTP and MR imaging DWI reference standard acquisition, any time delay between the CTP and MR imaging may have limited accurate core infarct determination because of core expansion or reversal. A model incorporating a U-net architecture CNN and RF classifier segmented acute ischemic stroke on NCCT with high concordance with manually segmented DWI core volumes (r ¼ 0.76, P , .001) and manually segmented DWI ASPECTS scores (r ¼ À0.65, P , .001). Furthermore, the agreement approached significance when dichotomizing infarcts using a volume threshold of 70 mL (McNemar test, P ¼ .11). Discrepancies in volumes were attributed to nondetectable early ischemic findings, partial volume averaging, and stroke mimics on CT. 36

Large Vessel Occlusion
Diagnosing LVO is essential for identifying candidates who could potentially benefit from mechanical thrombectomy. On NCCT, an SVM algorithm detected the MCA dot sign in patients with acute stroke with high sensitivity (97.5%). 37 A neural network that incorporated various demographic, imaging, and clinical variables in predicting LVO outperformed or equaled most other prehospital prediction scales with an accuracy of 0.820. 38 A CNN-based commercial software, Viz-AI-Algorithm v3.04, detected proximal LVO with an accuracy of 86%, sensitivity of 90.1%, specificity of 82.5, AUC of 86.3% (95% CI, 0.83-0.90; P # .001), and intraclass correlation coefficient (ICC) of 84.1% (95% CI, 0.81-0.86; P # .001), and Viz-AI-Algorithm v4.1.2 was able to detect LVO with high sensitivity and specificity (82% and 94%, respectively). 39,40 No study has yet shown whether AI methods can accurately identify other potentially treatable lesions such as M2, intracranial ICA, and posterior circulation occlusions.

ASPECTS Grading
ASPECTS is a widely used clinical grading system for assessing extent of early ischemic stroke on NCCT and has been used in randomized clinical trials to select thrombectomy candidates. 26,41,42 However, grading can be challenging, and interobserver agreement is variable. One commercial software platform with automated ASPECTS scoring (e-ASPECTS, Brainomix) performed as well as neuroradiologists when scoring ASPECTS on NCCT in patients with acute stroke (P , .003). 43 However, e-ASPECTS did not perform as well as neuroradiologists when scoring ASPECTS in patients with acute stroke with baseline non-normal-appearing CT (eg, leukoencephalopathy, old infarcts, or other parenchymal defects), demonstrating a correlation coefficient of 0.59 versus 0.71-0.80 for experts. 44 One study found that an automated ASPECTS detection algorithm on NCCT using texture feature extraction to train a RF classifier generated ASPECTS values that had high agreement with expertgenerated DWI ASPECTS scores (ICC ¼ 0.76 and k ¼ 0.6 when used for all 10 ASPECTS regions). 45 Another commercial software platform with automated ASPECTS scoring (Rapid ASPECTS, version 4.9; iSchemaView) showed higher agreement with a consensus ASPECTS grade that takes into account follow-up DWI (k ¼ 0.9) compared with neuroradiologists' moderate agreement (k ¼ 0.56-0.57), and the software performed well in the immediate time interval 1 hour after stroke onset (k ¼ 0.78) and even better 4 hours after stroke onset (k ¼ 0.92). 46 This platform had better agreement of ASPECTS grading with DWI infarct volume in patients with large hemispheric infarct compared with experienced readers (median DWI ASPECTS, 3 [IQR, [2][3][4]; Rapid ASPECTS, 3 [1][2][3][4][5][6]; and CT ASPECTS for the clinicians, 5 [4][5][6][7]. 47

Additional Factors in Treatment Selection
Various factors, including collaterals, penumbra, and stroke onset time, are important for evaluating potentially salvageable tissue and determining treatment eligibility. An automated commercial software program (e-CTA; Brainomix) combining deep and traditional ML techniques for CTA collateral status determination improved consensus scoring among expert neuroradiologists compared with visual inspection alone, with an ICC of 0.58 (0.46-0.67) improving to 0.77 (0.66-0.85; P ¼ .003). 48 Penumbra prediction on a noncontrast MR imaging pseudocontinuous arterial spin labeling technique using a DL model performed well (AUC ¼ 0.958). 49 This algorithm outperformed traditional ML algorithms and was able to predict endovascular treatment eligibility based on DEFUSE 3 (Endovascular Therapy Following Imaging Evaluation for Ischemic Stroke) trial criteria. Another study evaluating various traditional ML models in predicting stroke onset time demonstrated that incorporation of DL features to the models improved AUC compared with the ground truth (ie, a DWI-FLAIR mismatch), with the optimal AUC of 0.765 incorporating logistic regression and DL features of MR imaging and MR perfusion (MRP) images. 50 Lee et al 51 used DWI-FLAIR mismatch to predict stroke onset time ,4.5 hours and found that traditional ML models were more sensitive than stroke neurologists (sensitivity ¼ 48.5% for stroke neurologists vs 75.8% for logistic regression; P ¼ .020; 72.7% for SVM, P ¼ .033; 75.8% for RF, P ¼ .013).

Prognostication
Various ML algorithms have been used to predict imaging and clinical outcomes after ischemic stroke. An early classical ML study found that a generalized linear model combining DWI and perfusion-weighted imaging MR images was better than DWI (P ¼ .02) or PWI (P ¼ .04) alone at predicting voxelwise tissue outcomes. 52 A CNN-based patch sampling of the Tmax feature on MRP outperformed a single voxel-based regression model in predicting final infarct volume, with a mean accuracy of 85.3 6 9.1% compared with 78.3 6 5.5%, respectively. 53 Another CNN performed better than other ML methods in predicting final infarct volume by incorporating MR imaging DWI, MRP, and FLAIR data, with an AUC of 0.88 6 0.12. 54 This CNN could predict tissue fate based on whether intravenous tissue plasminogen activator was administered, showing significantly different final infarct volumes (P ¼ .048). 54 A CNN based on MRP source images was able to predict final infarct volume with an AUC of 0.871 6 0.024. 55 A multicenter study showed that an attentiongated U-Net DL algorithm with DWI and MRP as inputs could predict final infarct volume regardless of reperfusion status, with a median AUC of 0.92 (IQR, 0.87-0.96) and significant overlap with the ground truth of a FLAIR sequence obtained 3-7 days after baseline presentation (Dice score, 0.53; IQR, 0.31-0.68). 56 The e-ASPECTS software was able to predict poor clinical outcomes after thrombectomy (Spearman correlation ¼ À0.15; P ¼ .027) and was an independent predictor of poor outcome in a multivariate analysis (OR, 0.79; 95% CI, 0.63-0.99) while also demonstrating high consensus with 3 expert ASPECTS readers (ICC ¼ 0.72, 0.74, and 0.76). 57 Traditional ML techniques combining clinical data and core-penumbra mismatch ratio derived from MR imaging and MRP to determine postthrombolysis clinical outcomes performed with an AUC of 0.863 (95% CI, 0.774-0.951) for short-term (day 7) outcomes and 0.778 (95% CI, 0.668-0.888) for long-term (day 90) outcomes. 58 Decision treebased algorithms including extreme gradient boosting and gradient boosting machine were able to predict 90-day modified Rankin scale (mRS) . 2 using imaging and clinical data with AUC of 0.746 (extreme gradient boosting) and 0.748 (gradient boosting machine), and performance improved when incorporating NIHSS at 24 hours and recanalization outcomes. 59 ML techniques, including regularized logistic regression, linear SVM, and RF, outperformed existing pretreatment scoring methods in predicting good clinical outcomes (mRS #2 at 90 days) of patients with LVO who will undergo thrombectomy, with AUC 0.85-0.86 for ML models compared with 0.71-0.77 for pretreatment scores. 60 A combination CNN and ANN approach incorporating clinical and NCCT data predicted functional thrombolysis outcomes with accuracy 0.71 for 24-hour NIHSS improvement of $4 and accuracy 0.74 for 90-day mRS of 0-1. 61 Finally, traditional ML techniques and neural networks were used to predict hemorrhagic transformation of acute ischemic stroke before treatment from MRP source images and DWI, with the highest AUC of 0.837 6 2.6% using a kernel spectral regression ML technique. 62 One limitation of this study was the variable recanalization of the participants, which may have confounded results.

AI EVALUATION OF HEMORRHAGE
This section focuses primarily on DL methods that have been used for intracranial hemorrhage (ICH) detection and classification, quantification, and prognostication (Online Table 5).

Detection and Classification
A study using two 2D convolutional neural networks, GoogLeNet and AlexNet, to detect basal ganglia hemorrhages on NCCT found that GoogLeNet with augmented data in a pretrained network was the most accurate (AUC ¼ 1.0; sensitivity and specificity ¼ 100%) compared with the highest performing augmented, untrained AlexNet (AUC ¼ 0.95; sensitivity ¼ 100%; and specificity ¼ 80%). 63 False positive results from basal ganglia calcification were seen in some of the methods, and sensitivity of detection of small basal ganglia hemorrhages remains to be investigated.
One of the largest cohorts for detection and classification of ICH examined more than 30,0000 NCCTs from different hospitals in India using DL algorithms. 64 The algorithm performed well on 2 different validation datasets, Qure25k and CQ500, achieving AUCs of 0.92 (95% CI, 0.91-0.93) and 0.94 (CI, 0.92-0.97), respectively, for detecting ICH. The algorithm was also able to classify subtypes of hemorrhage (parenchymal, intraventricular, subdural, extradural/epidural, and subarachnoid) with AUCs ranging from 0.90 to 0.96 for the Qure25K dataset and 0.93 to 0.97 for the CQ500 dataset. An additional feature of the algorithm was its ability to recognize associated pertinent CT findings, such as calvarial fracture, midline shift, and mass effect.
Another study using a fully 3D CNN with a large patient cohort was able to detect ICH and reprioritize studies as "stat" (defined as a positive ICH study) versus "routine." 65 The AUC was 0.846 (95% CI, 0.837-0.856), specificity was 0.80 (0.790-0.809), and sensitivity was 0.73 (0.713-0.748). The algorithm was integrated into the radiologist's workflow, and time to detection was reduced from 512 to 19 minutes.
An explainable pretrained 2D convolutional neural networks system performed at a similar level to expert neuroradiologists on a relatively small cohort of cases when detecting acute ICH and classifying the 5 ICH subtypes on NCCT. 66 The algorithm incorporated techniques such as attention maps and prediction based modules to help mitigate the "black box" of the DL system. The system displayed a robust performance when detecting ICH on a retrospective dataset of 200 cases (AUC ¼ 0.99; sensitivity ¼ 98%; and specificity ¼ 95%) and prospective dataset of 196 cases (AUC ¼ 0.96; sensitivity ¼ 92%; and specificity ¼ 95%). Furthermore, the overall localization accuracy of the attention maps was 78.1% compared with bleeding points annotated by expert neuroradiologists.

Quantification
A custom DL-trained hybrid 3D-2D CNN was able to detect and quantify ICHs on NCCT in a retrospective training cohort and a prospective testing cohort from the emergency department. 67 Accuracy, AUC, sensitivity, specificity, positive predictive value, and negative predictive value for ICH detection for the training cohort were 0.975, 0.983, 0.971, 0.975, 0.793, and 0.997, respectively, and for the prospective cohort were 0.970, 0.981, 0.951, 0.973, 0.829, and 0.993. For ICH quantification, Dice scores were 0.931, 0.863, and 0.772, and Pearson correlation coefficients were 0.999, 0.987, and 0.953 for intraparenchymal hemorrhage, epidural or subdural hemorrhage, and SAH, respectively, compared with semiautomated segmentation by a radiologist. This study used real-life prospective testing of the algorithm and quantified hemorrhage volume during segmentation. The study also addresses the black box critique with the use of a custom mask ROI-based CNN architecture.
A patch-based fully DL CNN simultaneously classified and quantified hemorrhages at a level equal to or above that of expert radiologists (AUC ¼ 0.991 6 0.006). 68 The algorithm was able to identify some small hemorrhages that were missed by radiologists and performed well on a relatively small dataset. The strongly supervised approach took into account the heterogeneous morphology of hemorrhages and showed perfect sensitivity (1.00) while maintaining high specificity (0.87).

Prognostication
Identifying patients at risk for ICH expansion is important for prognostication. One study showed good performance when applying a SVM that incorporated various clinical and imaging variables to predict hematoma expansion on NCCT (AUC ¼ 0.89; mean sensitivity ¼ 81.3%; and mean specificity ¼ 84.8%). 69 Rapid and accurate identification of ICH by AI methods could aid with triaging of positive studies.

CONCLUSIONS
Prompt detection and treatment of acute cerebrovascular disease is critical to reduce morbidity and mortality. The current application of AI in this field has allowed for vast opportunities to improve treatment selection and clinical outcomes by aiding in all parts of the diagnostic and treatment pathway, including detection, triage, and outcome prediction. Future studies validating AI techniques are needed to allow for more widespread use in various practice environments.