Deep Learning–Based Detection of Intracranial Aneurysms in 3D TOF-MRA

In a retrospective study, the authors established a system for the detection of intracranial aneurysms from 3D TOF-MRA data. The system is based on an open-source neural network, originally developed for segmentation of anatomic structures in medical images. Eighty-five datasets of patients with a total of 115 intracranial aneurysms were used to train the system and evaluate its performance. Manual annotation of aneurysms based on radiologic reports and critical revision of image data served as the reference standard. The highest overall sensitivity of this system for the detection of intracranial aneurysms was 90% with a sensitivity of 96% for aneurysms with a diameter of 3–7 mm and 100% for aneurysms of >7 mm. The best location-dependent performance was in the posterior circulation. BACKGROUND AND PURPOSE: The rupture of an intracranial aneurysm is a serious incident, causing subarachnoid hemorrhage associated with high fatality and morbidity rates. Because the demand for radiologic examinations is steadily growing, physician fatigue due to an increased workload is a real concern and may lead to mistaken diagnoses of potentially relevant findings. Our aim was to develop a sufficient system for automated detection of intracranial aneurysms. MATERIALS AND METHODS: In a retrospective study, we established a system for the detection of intracranial aneurysms from 3D TOF-MRA data. The system is based on an open-source neural network, originally developed for segmentation of anatomic structures in medical images. Eighty-five datasets of patients with a total of 115 intracranial aneurysms were used to train the system and evaluate its performance. Manual annotation of aneurysms based on radiologic reports and critical revision of image data served as the reference standard. Sensitivity, false-positives per case, and positive predictive value were determined for different pipelines with modified pre- and postprocessing. RESULTS: The highest overall sensitivity of our system for the detection of intracranial aneurysms was 90% with a sensitivity of 96% for aneurysms with a diameter of 3–7 mm and 100% for aneurysms of >7 mm. The best location-dependent performance was in the posterior circulation. Pre- and postprocessing sufficiently reduced the number of false-positives. CONCLUSIONS: Our system, based on a deep learning convolutional network, can detect intracranial aneurysms with a high sensitivity from 3D TOF-MRA data.

U nruptured intracranial aneurysms are common among the general population. It is estimated that approximately 3% of healthy adults have an intracranial aneurysm. 1 These aneurysms often remain undiagnosed unless they become symptomatic (eg, by compression of adjacent neural structures or rupture into the subarachnoid space). 2 Rupture of an intracranial aneurysm is a serious incident with high fatality and morbidity rates. 3 Identifi-cation of factors contributing to the risk of intracranial aneurysm development, growth, and rupture is an active field of investigation. Apart from several disorders like polycystic kidney disease or Marfan syndrome, elements such as genetic factors, family history, female sex, and age are linked to an increased risk of aneurysm development. Intracranial aneurysm site, size, and shape are further strongly associated with the risk of rupture. [4][5][6] Detection of an intracranial aneurysm before it becomes symptomatic allows endovascular or surgical treatment of the aneurysm before it ruptures and may thus prevent death or morbidity.
DSA is still considered the criterion standard in evaluating intracranial vessels and detection of intracranial aneurysms 7 ; however, it is inconvenient for primary diagnoses because it is invasive and time-consuming. CTA and MRA are noninvasive methods widely used in clinical routine. Unlike DSA and CTA, which are based on x-ray imaging, MRA does not cause radiation exposure. It is therefore the preferred technique for screening asymptomatic patients for intracranial pathology. The number of radiology examinations performed for diagnoses is steadily in-creasing. 8,9 Given the growing workload of radiology departments, physician fatigue with the inherent risk of missed diagnosis of potentially significant findings is a relevant concern. Hence, a reliable method for automated detection of intracranial aneurysms from routine diagnostic imaging would be of great utility in clinical routine.
Rapid advances in the field of computing and a growing amount of data prompted the rise of convolutional neural networks (CNNs), a specific type of deep learning network architecture, for segmentation, classification, and detection tasks in medical imaging. [10][11][12] The training process of a CNN is straightforward to implement because the features for discrimination of the desired output classes are not designed but learned in an automated fashion from the input data. 13 Several approaches for automated detection of intracranial aneurysms from noninvasive imaging have been proposed in the literature. [14][15][16][17] However, a deep learning-based method for sufficient detection of intracranial aneurysms from 3D TOF data has not yet been reported, to our knowledge. The aim of this study was to investigate the potential of a deep learning algorithm for automated detection of intracranial aneurysms from 3D TOF-MRA clinical data.

Dataset
This retrospective study was approved by the Independent Ethics Committee at the RWTH Aachen Faculty of Medicine. The requirement for informed consent was waived. From an internal data base belonging to our department, we incorporated data from all patients with a 3D TOF-MRA examination of at least 1 previously untreated intracranial aneurysm. Images were obtained for clinical purposes between 2015 and 2017. After we removed protected patient information and substituted subject identifiers, examinations were retrieved from the local PACS. The dataset consisted of 85 examinations. Of those, 72 image sets originated from our department. Sixty of these examinations were performed on a 3T scanner (Magnetom Prisma; Siemens; Erlangen, Germany). Twelve examinations were performed on a 1.5 scanner (Magnetom Aera; Siemens).
Thirteen examinations included in this dataset originated from external departments and were performed on different scanners.
We included all TOF acquisitions with at least 1 previously untreated aneurysm, irrespective of etiology, symptomatology, and configuration (saccular, fusiform, and dissecting). The aneurysms were located in the internal carotid arteries, the anterior cerebral arteries (including the anterior communicating artery), the middle cerebral arteries, or the posterior circulation (including the vertebral, basilar, posterior, cerebral, and posterior communicating arteries). One patient had polycystic kidney disease, while the remainder had incidental findings. Exclusion criteria were previous treatment (coil embolization or surgical clipping) or pronounced motion artifacts, preventing accurate segmentation.
The DeepMedic (Version .6.1; https://biomedia.doc.ic.ac.uk/ software/deepmedic/) CNN was used 18 with an application of required preprocessing on the dataset 19 : voxel size resampling (0.5 ϫ 0.5 ϫ 0.5 mm 3 ) and intensity normalization to a zeromean, unit-variance space. To evaluate the impact of preprocessing on the performance of the CNN, we modified our dataset using different BET2 skull-stripping (https://fsl.fmrib.ox.ac.uk/ fsl/fslwiki/BET) 20 and performing N4 bias correction. 21 The ground truth segmentation was performed by a neuroradiology resident experienced in cranial diagnostic imaging. On the basis of radiologic reports, anonymized TOF data were critically reviewed, and aneurysms were manually annotated in a voxelwise manner using the manual segmentation tool of ITK-SNAP (www.itksnap.org). 22 Intrarater reliability was studied using the Pearson correlation coefficient.
After evaluation of the dataset, we trained DeepMedic and performed inference to segment aneurysms. Remarkably, 2 aneurysms that had been previously overlooked were detected by the CNN in this early stage. Consequently, the dataset was validated by another radiologist who was blinded to the radiology reports. Complete ground truth was evaluated once again and adjusted accordingly.
The dataset needed division into training, test, and validation sets, to run the CNN and assess its performance. The training set was used for learning, which describes the process of fitting the parameters of the network to learn features for discriminating the output classes. The validation set was used during training to reduce overfitting to the training data. This is done by comparing the Dice similarity coefficient (DSC) (a measure indicative of segmentation accuracy) of the training samples with the DSC of the unknown validation samples and adjusting the learning rate of the network. The test set is used for evaluation of the trained model. 18 Training the model took about 20 hours; inference per case was about 50 seconds on a Titan XP GPU (Nvidia, Santa Clara, California).

DeepMedic and Evaluation
Segmentation of the aneurysms was executed with the DeepMedic framework, a CNN for voxelwise classification of medical imaging data after training with 3D patches at multiple scales. DeepMedic was developed and evaluated for the segmentation of brain lesions. 23 The network consists of 2 pathways with 11 layers. Both pathways are identical, but the input of the second pathway is a subsampled version of the first (see the full architecture in Fig 1). Parameters were set as proposed by Kamnitsas et al 18 : An initial learning rate of 10 Ϫ3 was used and gradually reduced. For optimization, a Nesterov Momentum of 0.6 was set. For better regularization, drop-out and L1 ϭ 10 Ϫ6 and L2 ϭ 10 Ϫ4 regularization was performed. To accelerate the convergence, we used Rectified Linear Unit activation functions and batch-normalization as implemented in the DeepMedic framework. 23 We used the proposed DeepMedic hybrid sampling scheme. In this strategy, image segments larger than the neural network's receptive field are given as an input to the network. A training batch is built by extracting segments with 50% probability centered on the foreground or background voxels, facilitating an automatic method for balancing the distribution of training samples regarding the size of the desired class in the segment and therefore preventing class imbalance by adjusting to the true distribution of background and aneurysm voxels. 18 With a probability of 50%, the training images were mirrored on the coronal axis to increase the diversity of the training set.
We used the EvaluateSegmentation Tool (https://github.com/ Visceral-Project/EvaluateSegmentation) 24 to analyze the segmentation results by determining Hausdorff distances and the DSC. For methodologic reasons, each segmented voxel or connected component of voxels in the output binary segmentation was considered a positive detection. Each positive detection that corresponded to an aneurysm in ground truth was considered a truepositive finding, while each positive detection that did not correspond to an aneurysm in ground truth was considered a false-positive finding. In preliminary studies, this approach led to a very high rate of false-positive detections. Because we observed that compared with true-positive detections, false-positives tended to be rather small, we further examined whether the integration of a detection threshold as a postprocessing step, removing connected components smaller than a given volume, would improve our results. On the basis of the composition of our dataset, detection thresholds of 5, 6, and 7 mm 3 were studied (Fig 1). To further reduce the number of false-positives, we fine-tuned the network using a modified training strategy in which 90% of the input samples corresponded to background class; and 10%, to aneurysm class, reflecting a more realistic distribution of aneurysms. The learning rate was lowered to 10 Ϫ4 and the pretrained weights of the last 3 layers were changed while the training weights of the other layers were kept constant. To study the reliability of true-positive detections and the capability of the system in predicting aneurysm size, we compared the volume segmented by the algorithm with the manually examined volume of the ground truth.
To assess the impact of preprocessing, we evaluated 4 models (A-D). In model A, only the necessary steps to obtain reasonable results from DeepMedic, resampling to isotropic voxel size and intensity normalization, were performed. Additional skull-stripping is advised in the DeepMedic documentation. 18 We used the well-established BET2 skull-stripping method. Skull-stripping in model B was performed with a fixed fractional intensity threshold of 0.2. In model C, the parameter was adjusted manually in each case to receive an optimal brain outline, without nonbrain structures such as skull or parts of the ocular muscles and nerves. For model D, we used the skull-stripping masks from model C and performed an additional N4 bias correction 25 to evaluate whether low-frequency intensity inhomogeneities in the acquisitions would have an impact on the performance of the algorithm (Fig  1). In this work, each model is depicted as a preprocessing model identifier (A-D), followed by the detection threshold (0, 5, 6, 7). Full preprocessing per case took about 5 minutes on a Corei7-8700K CPU (Intel, Santa Clara, California). Individual creation of a skull-stripping mask was performed by an experienced user and took about 8 minutes for each sample.
Statistical analysis was performed using SPSS software, Version 25.0 (Released 2017; IBM Armonk, New York). We used the Shapiro-Wilk test to test for normality. Significance values of normality tests are only reported for cases in which the normality assumption was violated. A Kruskal-Wallis test was used for the split-validation of maximum diameters.

Comparisons among the Models
We hypothesized the 4 different levels of preprocessing to each be improvements over the previous version. Therefore, sensitivity values of each preprocessing model were compared only with those of its closest neighbor by testing for differences in the proportions of hits and misses, using McNemar tests. These tests were chosen over 2 tests because the values obtained from each model were not independent of one another. Comparing each model with its closest neighbor yielded 3 comparisons (A0 versus B0, B0 versus C0, C0 versus D0); thus, significance levels were corrected for 3 comparisons using a Bonferroni correction.
False-positives per case (FPs/case) were compared for each preprocessing model using a Friedman test. Post hoc tests were run using Wilcoxon signed rank tests for each closest neighbor.
DSCs of each preprocessing model were compared using Friedman tests. Post hoc tests were run using Wilcoxon signed rank tests for each closest neighbor. Missing values, caused by the inability of the evaluation tool to analyze volumes with no segmented voxels, were set to zero.
Hausdorff distances of each preprocessing model were analyzed using a linear mixed model, which included a random subject factor, and "model" as the sole fixed dependent variable. This linear mixed model was chosen over a repeatedmeasures ANOVA because the linear mixed model can analyze missing values better; unlike DSCs, a Hausdorff distance of zero would not accurately describe the inability of the tool to analyze a volume with no segmented voxels.

Comparisons within the Models
We hypothesized that each of the postprocessing models reduces the number of false-positives sequentially. Thus, sensitivity values for each detection threshold were compared with those of the closest neighbor within each model by testing for a difference in the proportions of hits and misses using McNemar tests. This yielded 3 comparisons per preprocessing model (0 versus 5, five versus 6, and 6 versus 7).
FPs/case were compared for each detection threshold using a Friedman test. Post hoc tests were run using Wilcoxon signed rank tests comparing each closest neighbor.

Size and Location
Increased aneurysm size embodies an increased rupture risk. 4 However, consented classifications of aneurysms based on aneurysm size are missing. To study the impact of aneurysm size on the detection rate, we classified aneurysms on the basis of maximum diameter as follows: In the literature, aneurysms with a maximum diameter of Յ3 mm are generally considered tiny. 26 For simplification, we termed these findings small aneurysms. A distinct increased risk of rupture was identified for aneurysms with a diameter of Ͼ7 mm. 6 We therefore defined aneurysms of Ͼ3 but Յ7 mm as medium, and those of Ͼ7 mm as large. Additionally, aneurysms were categorized on the basis of their location.
Sensitivity values of these categories were compared for both categorizations using Fisher exact tests rather than 2 tests because the cases numbered below 5 for certain cells. Spearman rank correlation coefficients were calculated between ground truth and predicted volumes because the normality assumption was violated in all samples.
The locational proportion of aneurysms was as follows: Fortytwo percent of all aneurysms were located in internal carotid arteries; 17%, in the anterior cerebral arteries, including the anterior communicating artery; 23%, in the middle cerebral arteries; and 19%, in the posterior circulation, including the vertebral, basilar, posterior, cerebral, and posterior communicating arteries.

Sensitivity among Models
Comparing sensitivity values of the nearest neighbors' preprocessing models (A0, B0, C0, and D0) yielded no significant differences (P ϭ 1, binomial distribution used for all comparisons). Even the models showing the largest difference (A0 versus D0) did not approach significance (P ϭ .29, binomial distribution used, uncorrected for multiple comparisons).

False-Positives per Case among Models
Analyses of false-positive rates between the preprocessing models revealed a significant difference among models ( 2 [3] ϭ 136.144, P Ͻ .001). Pair-wise comparisons indicated a significant difference between models A0 and B0 (z ϭ 7.425, P Ͻ .001), but not B0 and C0 or C0 and D0 (z ϭ 1.878, P ϭ .18 and z ϭ 0.991, P ϭ .97, respectively). For each preprocessing model, the impact of detection thresholds on sensitivity, FPs/case, and positive predictive value was studied (Fig 2).

Sensitivity within Models
For model A, no significant changes in sensitivity were found between detection thresholds 0, 5, 6, and 7 mm 3 (P ϭ 1, binomial distribution used for all comparisons). For model B, a significant decrease in sensitivity was found between thresholds B0 and B5 (P ϭ .05, binomial distribution used). Sensitivity did not differ between thresholds B5 and B6, or B6 and B7 (P ϭ 1, binomial distribution used for both comparisons). For model C, a significant decrease in sensitivity was found between thresholds C0 and C5 (P Ͻ .001, binomial distribution used). Sensitivity did not differ between thresholds C5 and C6 or C6 and C7 (P ϭ 1, binomial distribution used for both comparisons). For model D, a significant decrease in sensitivity was found between thresholds D0 and D5 (P Ͻ .001, binomial distribution used). Sensitivity did not differ between thresholds D5 and D6 or D6 and D7 (P ϭ 1, binomial distribution used for both comparisons). A consecutive decrease in sensitivity ranged between 2% (version A) and 10% (version C).

False-Positives per Case within Models
Normality was violated for all models without thresholding applied (P ϭ .008 for A0, P Ͻ .001 for all other models).

Impact of Aneurysm Size
To evaluate the impact of aneurysm size on sensitivity, we divided aneurysms into 3 categories based on maximum diameter, as described above. Detection sensitivity was found to be dependent on aneurysm size (test statistics are shown in Table 1).
The Shapiro-Wilk test revealed that in all cases, normality assumption was violated by the ground truth volumes and/or the predicted volumes of the models. The ground truth volume showed a negative correlation with the predicted volume of each preprocessing model for the group of small aneurysms. The highest correlation was found in preprocessing model A0 for large aneurysms. The correlation values for all aneurysm sizes combined were, in all models, similar to those of large aneurysms ( Table 2).

Impact of Aneurysm Location
Sensitivity values among locations did not show a significant difference (test statistics are shown in Table 3).

Accuracy of Segmentation: DSC and Hausdorff Distance
The distribution of DSCs violated normality for all models and thresholds (P Յ .001 for all models). DSCs differed significantly among preprocessing models A0, B0, C0, and D0 ( 2 [3] ϭ 50.228, P Ͻ .001). Pair-wise comparisons between nearest neighbors indicated that this difference originated from the difference between A0 and B0 (z ϭ 5.44, P Ͻ .001). DSCs did not differ among sessions B0, C0, and D0.
DSC and Hausdorff distance values of the different preprocessing models are shown in Table 4. After we fine-tuned model A0, the DSC increased significantly from 0.47 Ϯ 0.28 to 0.50 Ϯ 0.30 (P Ͻ .001), and the Hausdorff distance changed from 90.16 Ϯ 22.25 to 85.6 Ϯ 22.69 (P ϭ .004) without significant changes in sensitivity or the number of FPs/case.

Visual Inspection
Two examples of our dataset are shown in Fig 3. The model was able to detect aneurysms of small-to-large size, location, and regional intensity distribution in the 2 displayed volumes. By means of a postprocessing step, false-positive components were removed.

DISCUSSION
Machine learning applications, in particular deep learning, have recently gained increased attention in the domain of medical imaging. These types of algorithms, specifically CNNs, are top per-formers in most medical-image analysis competitions. The ease of implementation of CNNs in processing pipelines 13 makes them accessible to a broad range of researchers. Machine learning is becoming a tool of growing importance in radiology and will probably change the way radiologists work.
In this study, we demonstrated the great potential of a CNN for reliable detection of intracranial aneurysms from 3D TOF-MRA. Demand for radiologic imaging is constantly growing; therefore, the steadily increasing workload must be managed by radiology departments. 27 Computer-aided detection tools may assist in preventing diagnostic errors that could occur due to a physician's fatigue or lack of concentration. In a clinical setting, cranial imaging is performed for several diagnostic purposes. However, potentially relevant findings are often missed if a conspicuity corresponding to the primary diagnostic purpose of an examination is found. 28 This phenomenon termed "satisfaction of search" is frequently observed in radiologic practice and could potentially be reduced by sufficient computeraided detection tools. To evaluate a realistic scenario, we included unspecified and therefore rather heterogeneous images (ie, different scanners, different field strengths) with varying image quality (signalto-noise ratio, motion artifacts).
Solely in terms of overall sensitivity, the best model was A0, without application of skull-stripping or bias correction, with a sensitivity of 90%. However, this model also had a FPs/case value of 6.1, which is rather high. The highest positive predictive value of 0.57 was achieved with model D7, consisting of customized skull-stripping and N4 bias correction. A sensitivity of 79% was achieved with a FPs/case rate of 0.8 Ϯ 1.3. The amount of preprocessing had a significant impact on the rate of false-positives. In terms of sensitivity, no significant differences between preprocessing models were detected. Using a thresholding method that removes segmentation components below a distinct volume, we were able to further decrease the rate of false-positives.
Aneurysm size had a distinct impact on the performance of the CNN: For small aneurysms, a lower sensitivity value was measured. These missed detections resulted in low correlation values between ground truth volumes and the model-predicted volumes for small aneurysm sizes. This correlation increased for mediumsized aneurysms, which were detected with a higher certainty but in some cases lacked segmentation precision. The correlation for large aneurysms and the overall correlation were high, the latter mainly due to a good segmentation capability for medium and large aneurysms. The DSC could be improved significantly by skull-stripping from 47% Ϯ 28% to 53% Ϯ 29%. The Hausdorff distance likewise improved from a value of 90 Ϯ 22 to 70 Ϯ 17.
Small aneurysms were underrepresented in the dataset; increasing this number would possibly improve the ability of the model to segment those aneurysms and predict their size better. A larger dataset would also decrease a possible overfitting of the model to the training data. We endeavored to address this issue using 5-fold cross-validation and flipping the image as a data augmentation concept. The ground truth segmentation is subjective and may differ among radiologists. A similar study showed that intra-and interoperator variability of 20% Ϯ 15% and 28% Ϯ 12% was reported for the segmen-  tation of brain tumors. 29 We attempted to overcome this issue by evaluating our dataset through another radiologist. Several approaches for automated detection of intracranial aneurysms from noninvasive cranial imaging have been reported previously. 14,15,17 However, most were limited by either the use of conventional computer-aided diagnosis algorithms or being applicable only on 2D images. For instance, Miki et al 14 increased the number of detections of 2 radiologists using a computer-aided diagnosis tool for MRA images. Their system is based on different handcrafted features 30 and reached a sensitivity of 82% in source and reconstructed images of a 3T MR imaging device. Štepán-Buksakowska et al 15 used a computer-aided diagnosis algorithm that applies global thresholding and region-growing schemes. They achieved a mean sensitivity of 83.6% by combining radiologists' examinations with their tool. Nakao et al 17 used a CNN for detecting aneurysms in 2D MIPs. Their tool detected aneurysms with a sensitivity of 94.2% with 2.9 FPs/case. However, their work is limited to 2D projections.
The main limitation of the presented algorithm is poor specificity. We acknowledge that this issue currently limits clinical util-ity. However, we demonstrated that an algorithm that was originally developed for segmentation tasks is able to detect aneurysms reliably from noninvasive cranial imaging, and this requires only a very limited number of training samples. We observed that several, easily applicable postprocessing steps allow distinct reduction of the number of false-positives. Because data augmentation is already included, we assume that for further improvement of specificity, enlargement of the sample size would be necessary. Given the low number of untreated aneurysms in MRA, this would require a multi-institutional approach. Fine-tuning the network on a larger dataset with a modified training strategy for a more realistic distribution of classes might improve not only the DSC and Hausdorff distance but also sensitivity and specificity.
In this study, the performance of DeepMedic was validated in a clinical dataset, which was based on radiology reports. To further investigate whether our approach might contribute to an improvement of aneurysm detection in a clinical setting, the performance of DeepMedic should be compared with that of human readers. Another limitation is that the algorithm was trained solely on cases that had intracranial aneurysms. Because Deep-Medic works as a voxelwise classifier, this was done for methodologic reasons. The algorithm learns to differentiate between physiologic vessel anatomy and aneurysms by classifying each voxel within a volume as a positive (aneurysm) or negative (no aneurysm) prediction. Every dataset includes not only aneurysms but also physiologic vessels. Hence, every aneurysm-free voxel of a brain vessel could be considered a negative finding in a voxelwise classifier; therefore, one could argue that the algorithm can also learn to separate aneurysms from normal vessel anatomy using only pathologic cases. However, given the relatively low prevalence of intracranial aneurysms in the general population, this approach might lead to overprediction, which explains, to some extent, the relatively high number of false-positive cases observed in our study.
To obtain a highly autonomous system, a robust and automated skull-stripping algorithm for TOF sequences is necessary to obtain a reliable brain mask comprising all relevant vessels without extracranial or nonbrain tissues. Most skull-stripping methods perform best with T1weighted images and need to be adjusted manually for different acquisition sequences. 31 Finally, in further research, it would be advantageous to compare the performance of DeepMedic in terms of aneurysm detection with that of other CNN architectures.

CONCLUSIONS
This study demonstrates that our CNNbased system can detect intracranial aneurysms with high sensitivity in a 3D TOF-MRA dataset. The dataset, comprising acquisitions of different field strengths and variable image quality, was created to evaluate a scenario similar to clinical reality. Adequate pre-and postprocessing significantly reduced the number of false-posi-    tives. The predicted aneurysm volume correlated well with the ground truth volume for medium-and large-sized aneurysms; hence, the system could also serve as a tool to predict aneurysm size.