Ensemble of Convolutional Neural Networks Improves Automated Segmentation of Acute Ischemic Lesions Using Multiparametric Diffusion-Weighted MRI

Convolutional neural networks were trained on combinations of DWI, ADC, and low b-value-weighted images from 116 subjects. The performances of the networks (measured by the Dice score, sensitivity, and precision) were compared with one another and with ensembles of 5 networks. An ensemble of convolutional neural networks trained on DWI, ADC, and low b-value-weighted images produced the most accurate acute infarct segmentation over individual networks. Automated volumes correlated with manually measured volumes for the independent cohort. BACKGROUND AND PURPOSE: Accurate automated infarct segmentation is needed for acute ischemic stroke studies relying on infarct volumes as an imaging phenotype or biomarker that require large numbers of subjects. This study investigated whether an ensemble of convolutional neural networks trained on multiparametric DWI maps outperforms single networks trained on solo DWI parametric maps. MATERIALS AND METHODS: Convolutional neural networks were trained on combinations of DWI, ADC, and low b-value-weighted images from 116 subjects. The performances of the networks (measured by the Dice score, sensitivity, and precision) were compared with one another and with ensembles of 5 networks. To assess the generalizability of the approach, we applied the best-performing model to an independent Evaluation Cohort of 151 subjects. Agreement between manual and automated segmentations for identifying patients with large lesion volumes was calculated across multiple thresholds (21, 31, 51, and 70 cm3). RESULTS: An ensemble of convolutional neural networks trained on DWI, ADC, and low b-value-weighted images produced the most accurate acute infarct segmentation over individual networks (P < .001). Automated volumes correlated with manually measured volumes (Spearman ρ = 0.91, P < .001) for the independent cohort. For the task of identifying patients with large lesion volumes, agreement between manual outlines and automated outlines was high (Cohen κ, 0.86–0.90; P < .001). CONCLUSIONS: Acute infarcts are more accurately segmented using ensembles of convolutional neural networks trained with multiparametric maps than by using a single model trained with a solo map. Automated lesion segmentation has high agreement with manual techniques for identifying patients with large lesion volumes.

ing multiple b-values, up to 2000 s/mm 2 (which are typically not acquired in the acute setting), but whether the data were acquired in the acute or subacute stage was not reported, and the effects of using combinations of parameters were not investigated. 7e hypothesize that a multimodal approach can improve the performance of automated segmentation algorithms.Indeed, most radiologists use other sequences in addition to DWI when assessing the extent of acute infarction.We tested this hypothesis by comparing the accuracy of fully automated acute infarct segmentation algorithms that use solo diffusion parametric maps with the performance of algorithms that combine multiple parametric maps.We also posit that ensemble models that aggregate segmentation results from multiple algorithms will surpass single algorithms.The superior accuracy of ensemble algorithms has been shown for tumor applications, 8 but not yet for acute infarct segmentation.Finally, we assessed the generalizability of our approach by evaluating its performance on an independent cohort.We also tested the clinical utility of automated approaches for triaging patients with large infarct volumes who might not benefit from endovascular treatment. 9,10

Subjects
All analyses were performed retrospectively under Partners Human Research Committee review board approval.MR imaging from patients with acute ischemic stroke admitted at a single academic medical center between 2005 and 2007, imaged within 12 hours of when the patient was last known to be well (LKW), and who did not receive either thrombolysis before MR imaging or experimental therapy were used for training the convolutional neural networks (CNNs). 11n independent cohort 12,13 consisting of nonoverlapping patients admitted to the same center between 1996 and 2012 for whom imaging was performed within 24 hours of LKW and for whom follow-up MR imaging datasets were available was used for the evaluation group.Both cohorts were drawn from separate repositories for which manual outlines were available that had been drawn several years ago for a study of early-stage stroke patterns 11 or for studies predicting lesion expansion. 12,13

MR Imaging
Diffusion-weighted MR imaging was acquired on 1.5T scanners (GE Genesis SIGNA, SIGNA Excite, SIGNA HDx, SIGNA HDxt; GE Healthcare, Milwaukee, Wisconsin) with the following parameters for most subjects: b-value ϭ 1000 s/mm 2 , TR ϭ 5000 ms, TE ϭ 88.9 ms, FOV ϭ 220 mm, 23 5-mm thick-slices and 1-mm gap, and 6 diffusion directions (see the On-line Appendix and On-line Table 1 for details).Diffusion-weighted MR imaging were corrected for eddy current distortions before calculation of isotropic trace DWI maps (geometric mean of the high-b-value acquisitions) and ADC maps (slope of the linear regression fit of the log of the DWI and LOWB images using techniques described previously). 14Manual outlines had been drawn for prior studies [11][12][13]  No a priori thresholds were used for manual segmentation, but concomitant ADC and LOWB maps were referenced to avoid inclusion of susceptibility artifacts and chronic lesions with elevated ADC values.Tissue was considered an acute infarct if it exhibited hyperintensity on DWI, with hypointensity on the ADC or abnormal T2 prolongation on LOWB.To assess interrater agreement, we randomly selected 10 subjects from the Evaluation Cohort and outlines drawn by reader 1, and 2-way intraclass correlation was calculated.
A neuroradiologist with 12 years of experience (W.A.C.) assigned each patient to 1 of the following categories based on lesion location: brain stem, cerebellum, supratentorial/cortical, or supratentorial/subcortical.The "supratentorial/cortical" designation was used if any portion of Ն1 infarct involved the cortex.Patients with both supra-and infratentorial lesions or lesions involving both the brain stem and cerebellum were assigned to a fifth category, "multiple."

Image Preprocessing
DWI, ADC, and LOWB images were resampled to an isotropic voxel size of 1 mm 3 .The LOWB brain mask was computed using the Brain Extraction Tool (FSL, Version 5.0.9;(http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/BET). 15,16 Mean and SD were calculated from intensities within the brain mask limited to the 1 to 99 percentile range to normalize values to a mean of 0 and SD of 1.0.

CNN Training
CNNs were trained to classify voxels as lesion or non-lesion on a NVIDIA Tesla K40 GPU (NVIDIA, Santa Clara, California) using the DeepMedic (Version 0.7.0; https://biomedia.doc.ic.ac.uk/ software/deepmedic/) framework with 2 pathways (see the original publication 17 and the On-line Appendix).On-line Fig 1 shows the architecture.DeepMedic is a 3D-CNN that operates on multiresolution pathways to allow efficient and accurate supervised segmentation.This framework was chosen over other approaches because it performed best in the Ischemic Stroke Lesion Segmentation Challenge (ISLES) 2015 study. 18Additional studies have also shown that DeepMedic had better or comparable performance compared with other neural network architectures (Online Appendix).Separate CNNs were trained on single or different combinations of diffusion parametric maps (DWI, ADC, and LOWB individually, and DWIϩADC, ADCϩLOWB, DWIϩLOWB, DWIϩADCϩLOWB).To generate ensemble segmentations, we averaged voxelwise the class posteriors from the softmax layers of 5 independent CNNs.
The results of all models were resampled back to the original image resolution, thresholded at 50%, and masked with the resampled brain masks created at the normalization step.Performance within the training data was assessed via 5-fold cross-validation.For subjects in each fold, lesion segmentations were generated using a CNN that was trained on data from the other 4 folds.Training a single CNN with DWIϩADCϩLOWB maps on the full Training Cohort of 116 subjects required approximately 16 hours.Applying the trained CNN to an individual subject to segment the lesion took on average 35 seconds.With sequential evaluation of 5 CNNs, merging their output, and resampling, we estimate that a full segmentation would require Ͻ5 minutes.
To evaluate the generalizability of the approach, we retrained the best performing network on the full Training Cohort and applied it to the independent cohort.The Evaluation Cohort was also segmented with an approach that has been used in clinical trials. 19In brief, the technique combined thresholding of ADC (Ͻ615 ϫ 10 Ϫ6 mm 2 /s), DWI, and exponential attenuation maps with morphologic operations (opening with a 2-voxel structural element).ADC images were first masked with a LOWB brain mask before thresholding.We evaluated the algorithm on images that had been resampled to 1-mm resolution for processing and on images that were segmented at their original resolution.Segmented outputs from all algorithms were evaluated at 1-mm resolution to reduce potential confounds from different MR imaging acquisition resolutions.Effects of lesion volume and location on performance were investigated using univariable and multivariable regression analysis as a function of the manually segmented lesion volumes (MLVs).We also compared algorithm accuracy between very small MLVs of Ͻ1 cm 3 (group I-A) and larger MLVs Ն 1 cm 3 (group I-B).
To assess the accuracy of using automatically segmented lesion volumes (ALVs) in place of MLVs for identifying patients who have lesion volumes that are too large to likely benefit from endovascular treatment, we explored the agreement between ALV and MLV for MLV Ͻ21 cm 3 (group II-A) versus Ն21 cm 3 (group II-B), MLV Ͻ31 cm 3 (group III-A) versus Ն31 cm 3 (group III-B), MLV Ͻ51 cm 3 (group IV-A) versus Ն51 cm 3 (group IV-B), and MLV Ͻ70 cm 3 (group V-A) versus Ն70 cm 3 (group V-B) to determine potential misclassification rates of patients with large lesions using automated algorithms compared with manual volumes.The thresholds (21, 31, 51, and 70 cm 3 ) were selected on the basis of values that had been used for enrollment in prospective endovascular clinical trials of expanded-window interventions. 9,10To be eligible for endovascular treatment using the DWI or CTP Assessment with Clinical Mismatch in the Triage of Wake-Up and Late Presenting Strokes Undergoing Neurointervention With Trevo (DAWN) trial criteria, 10 patients had to meet the inclusion and exclusion criteria of 1 of the following 3 groups: group A, 80 years of age or older, NIHSS score Ն 10, and infarct volume of Ͻ21 cm 3 ; group B, younger than 80 years of age, NIHSS score Ն 10, and infarct volume of Ͻ31 cm 3 ; group C, younger than 80 years of age, NIHSS score Ն 20, and infarct volume of 31 to Ͻ51 cm 3 .For the MR imaging cohort, the infarct volume was measured on DWI.Similarly, to be eligible for late window endovascular treatment using the Endovascular Therapy Following Imaging Evaluation for Ischemic Stroke 3 (DEFUSE) 3 MR imaging criteria, 9 patients had to exhibit an infarct volume on DWI of Ͻ70 cm 3 .Although there may be other volume thresholds that might be useful for patient selection, 20

RESULTS
Subject demographics for training (n ϭ 116) and Evaluation Cohort (n ϭ 151) are shown in Table 1.Although there were imbalances in sex and time to MR imaging likely due to different inclusion and exclusion criteria of the 2 cohorts (ie, patients for whom follow-up MR imaging is ordered clinically who made up the Evaluation Cohort tend to have more severe conditions), there was no statistical difference in the distribution of MLVs.The median volume of the 10 subjects randomly selected from the Evaluation Cohort for intraclass correlation coefficient analysis was 9.7 cm 3 (interquartile range [IQR] ϭ 2.7-32.6 cm 3 ), ranging from 1.2 to 94.4 cm 3 .The intraclass correlation coefficient for the 2 readers was excellent (intraclass correlation coefficient ϭ 1.00, P Ͻ .001).

Effect of Selection of Diffusion Parametric Maps on CNN Performance
Significant differences (P Ͻ .001)were found among all performance metrics (Dice, precision, sensitivity) across all models (Table 2).Precision could not be calculated for cases in which models could not detect a lesion.

Individual Diffusion Maps
The CNN trained on DWI yielded significantly higher Dice scores compared with the CNN trained on ADC (P Ͻ .001) or LOWB (P Ͻ .001)maps (On-line Fig 2 and Table 2).Findings for the CNN precision (DWI versus ADC, P Ͻ .001,versus LOWB, P Ͻ .001) and sensitivity (DWI versus ADC, P Ͻ .001,versus LOWB, P Ͻ .001)were analogous to those for the Dice score.Of the networks trained with a single parametric map, the CNN models that used the DWI parametric map performed best, followed by the model based on the ADC map, with the LOWB-based model having the worst scores.DWIϩADC was comparable sensitivity with that of DWIϩLOWB (P ϭ .11)and had improved sensitivity with respect to ADCϩLOWB (P Ͻ .001).DWIϩ LOWB and ADCϩLOWB were equally sensitive (P ϭ .06).

Ensemble of CNNs
Five CNNs were trained, each using either DWIϩADC or DWIϩADCϩLOWB, the 2 best-performing models.The Dice performances of each of the 5 individual CNNs were slightly different using DWIϩADC (Online Table 2 and 2, P Ͻ .001).E3 and E2 had similar sensitivity to one another (P ϭ .46)and to the DWIϩADCϩLOWB model (versus E2, P ϭ .58;versus E3, P ϭ .12),but outperformed the others (P Ͻ .01,Table 2).

Validation on the Independent Cohort
E3 was used for the evaluation studies to assess the generalizability of the approach because E3 tended to perform better than E2.82.2 (64.9-88.9)83.2 (67.7-93.3)83.9 (71.9-92.4) a All metrics are denoted in percentages as median (IQR).Of the nonensemble models, significant differences in Dice, precision, and sensitivity were found (P Ͻ .001).The ensemble models, E2 and E3, were superior to all other models (P Ͻ .001).

DISCUSSION
We have shown that an ensemble of CNNs trained with multiparametric diffusion maps improves automated segmentation of acute infarcts over methods that use solo maps.Among the individual parameter models, CNNs trained on DWI performed best.However, a model trained on only DWI may incorrectly classify regions with susceptibility artifacts that appear as DWI hyperintensities or wrongly include subacute T2-shinethrough regions. 21etworks trained on only ADC images provided a fair performance because reduced ADC values represent cytotoxic edema that manifests in hyperacute stroke, 22 but may undersegment later-stage strokes when ADC pseudonormalizes. 21CNNs exclusively trained on LOWB performed poorly, likely because our data consisted of mainly patients with early-phase stroke (median, 6 hours from LKW) before vasogenic edema is evident on LOWB. 23ombining DWI and ADC improved segmentation, consistent with "standard practice" by expert outliners who typically refer to the ADC image to confirm that the DWI hyperintensity coincides with reduced diffusivity to minimize inclusion of artifacts.Combining LOWB with either ADC or DWI increased the Dice score, suggesting that LOWB provides complementary information.Although inclusion of LOWB with DWIϩADC did not result in statistically significant improved performance, a tendency toward more accurate segmentation was observed in the ensemble models.
We have also shown that our model performs comparably with humans as reflected by both high Dice scores and correlation between ALV and MLV.Indeed, the Dice scores of the E3 algorithm results were comparable with the Dice scores between human readers in our subcohort of 10 patients with outlines from both readers.The time for automated segmentation currently is approximately 5 minutes, which may be similar to times required by an experienced human reader, but we expect that with optimization and faster GPUs, the time for segmentation can be further reduced.Furthermore, the primary benefits of our automated approach are that the results will be reproducible, unbiased, and scalable (eg, clinical trials that compare lesion volumes for thousands of subjects).ALV and MLV were closely correlated, but segmentation of small lesion volumes was overestimated.Accurate estimation of small lesion volumes (Ͻ1 cm 3 ) is more difficult because they are harder to detect and small variation from the ground truth leads to greater aberrations of performance metrics.Small-lesion segmentation could possibly be improved by customizing specific CNNs tailored to detecting lesions by volume size.Nevertheless, we have shown that our automated approach performed comparably with manual lesions delineated by our human experts with regard to patient-selection tasks.The cases of disagreement typically occurred when there were image artifacts that led to poor brain extraction, which, in turn, might have led to poor normalization, resulting in oversegmentation.A second reason for this failure might be that the networks have not previously seen context outside the brain during training because it is excluded in most cases in which the masks are correctly computed.We did not manually fix the brain masks because we wanted to evaluate a fully-automated approach.Refining the automated brain extraction step will likely further improve our algorithms.
There were several limitations to this study.One is the retrospective nature of our analysis, which resulted in variable MR imaging acquisition protocols that changed across the years with clinical practice.However, this is also a strength because our approach will likely be more generalizable to real-world clinical situations and not dependent on a specific MR imaging protocol, which is often used in clinical trials.This may also explain why the thresholding approach performed poorly on our data compared with other studies for which MR imaging acquisition was harmonized as part of a trial. 19Another potential limitation is that a different reader created the manual outlines used for the Evalua- tion Cohort from the Training Cohort.However, the accurate segmentation results in both cohorts suggest that the model is not overfitted to 1 particular reader.Another benefit of an automated approach is that it is reproducible and not dependent on the expertise of the reader.
To evaluate the impact of different diffusion maps on segmentation performance, we kept the CNN architecture constant throughout all experiments.In addition to changing the combinations of inputs, we chose to build an ensemble from several CNNs because ensemble learning is known to boost the performances of single-classifier algorithms. 8,24DeepMedic samples randomly from the Training Cohort (ie, both the selected subjects and extracted samples differ in each training epoch).Although DeepMedic is very robust in its performance, the variation in sampling inherently results in slightly different models, even when trained with the same architecture.Merging the segmentations of several models reduces false-positives and improves overall performance.Although strong single networks are desired and necessary to create a high-performing ensemble, our CNNs may come with bias specific to DeepMedic.Building an ensemble of different CNN architectures might further enhance the performance.Future investigation will need to analyze the benefits of merging more diverse networks to cancel out each other's inherent biases. 8This diversity of models could be achieved by changing the hyperparameters of DeepMedic using completely different architectures or training on a different dataset.

CONCLUSIONS
Ensembles of CNNs trained on multiparametric diffusion MR imaging improved automated segmentation of acute infarcts in comparison with individual CNNs trained on solo diffusion maps, producing results that are comparable with manual lesions drawn by experts.
using the program Display (McConnell Brain Imaging Centre, Montreal, Canada) by a neuroscientist with 15 years of experience (reader 1: O.W., Training Cohort) and a neuroradiology fellow with 4 years of experience (reader 2: R.B., Evaluation Cohort) interpreting stroke MR imaging.The readers were blinded to the results of the automated segmentation algorithm.
On-line Fig 3, ANOVA P ϭ .02,with differences between CNN 2 and CNN 3, P ϭ .04;and CNN 4 and CNN 5, P ϭ .04)but were similar to one another using DWIϩADCϩLOWB (On-line Table 3 and On-line Fig 4, ANOVA P ϭ .60).Aggregating results of the individual CNNs to create ensembles (E2: DWIϩADC CNNs, E3: DWIϩADCϩLOWB CNNs) significantly improved the Dice performance over individual CNNs (P Ͻ .001).Both ensembles yielded results similar to one another in terms of Dice (P ϭ .66)and precision (P ϭ .62),but both surpassed the other CNNs (Table
we focused on thresholds that were used in positive prospective clinical trials.Statistical Analysis.Differences between model performance metrics were tested by 2-way ANOVA followed by post hoc paired Wilcoxon signed rank tests.Correlations were assessed via the Spearman correlation coefficient ().Univariate analysis was performed with the Wilcoxon 2-sample rank sum test for continuous variables or the 2-sided Fisher exact test for categoric variables.Cohen assessed agreement between MLV, and ALV statistical tests were conducted with JMP Pro 14.0 (SAS Institute, Cary, North Carolina).P values Ͻ .05 were considered significant.Figures of MR imaging data were generated using FSLeyes (Version 0.27; https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FSLeyes).

Table 3 : Dependency of automated segmentation performance on MLV a
Performance metrics are in median (IQR) and percentages.Results of E3 applied to the Evaluation Cohort are shown as a function of different volume thresholds.b Excludes 2 subjects in group A with automatically segmented lesion volumes of zero because precision is undefined in this circumstance.P Ͻ .05 group A versus group B, where Group A is the group meeting the threshold criteria and Group B is the group not meeting the threshold criteria.
a c P Ͻ .001.d P Ͻ .01. e