Evaluating the Effects of White Matter Multiple Sclerosis Lesions on the Volume Estimation of 6 Brain Tissue Segmentation Methods

BACKGROUND AND PURPOSE: The accuracy of automatic tissue segmentation methods can be affected by the presence of hypointense white matter lesions during the tissue segmentation process. Our aim was to evaluate the impact of MS white matter lesions on the brain tissue measurements of 6 well-known segmentation techniques. These include straightforward techniques such as Artificial Neural Network and fuzzy C-means as well as more advanced techniques such as the Fuzzy And Noise Tolerant Adaptive Segmentation Method, fMRI of the Brain Automated Segmentation Tool, SPM5, and SPM8. MATERIALS AND METHODS: Thirty T1-weighted images from patients with MS from 3 different scanners were segmented twice, first including white matter lesions and then masking the lesions before segmentation and relabeling as WM afterward. The differences in total tissue volume and tissue volume outside the lesion regions were computed between the images by using the 2 methodologies. RESULTS: Total gray matter volume was overestimated by all methods when lesion volume increased. The tissue volume outside the lesion regions was also affected by white matter lesions with differences up to 20 cm3 on images with a high lesion load (≈50 cm3). SPM8 and Fuzzy And Noise Tolerant Adaptive Segmentation Method were the methods less influenced by white matter lesions, whereas the effect of white matter lesions was more prominent on fuzzy C-means and the fMRI of the Brain Automated Segmentation Tool. CONCLUSIONS: Although lesions were removed after segmentation to avoid their impact on tissue segmentation, the methods still overestimated GM tissue in most cases. This finding is especially relevant because on images with high lesion load, this bias will most likely distort actual tissue atrophy measurements.

D uring the past few years, MR imaging brain tissue segmentation techniques have become important tools in the clinical evaluation and progression of MS because they make it possible to measure the changes in brain atrophy and lesion load. [1][2][3] However, white matter lesions (WMLs) can significantly affect tissue volume measurements if these lesions are included in the segmentation process. [4][5][6] Several studies have analyzed the effects of WMLs on brain tissue measurements of common segmentation techniques such as SPM5 (http://www.fil.ion.ucl.ac.uk/spm/) 7 and FMRIB Automated Segmentation Tool (FAST, http://fsl. fmrib.ox.ac.uk/fsl/fslwiki/FAST). 8 Chard et al 5 studied the effect of synthetic lesions on SPM5 segmentations for different WML voxel intensities (from 30% to 90% of normal WM intensity) and lesion loads (from 10 to 20 cm 3 ). The authors reported that GM volume was overestimated by Ϸ2.3%, whereas WM tissue was underestimated by Ϸ3.6% in scans with 15 cm 3 of simulated lesions. More recently, Battaglini et al 4 also analyzed the effects of different WML intensities and lesion loads on tissue measurements obtained with FAST software. The authors showed again that total GM volume tended to increase with higher lesion loads in segmented images with generated simulated lesions. Gelineau-Morel et al 6 performed a similar study on the effects of simulated and real WMLs but on tissue volume measurements outside lesion regions. The authors reported that on images with simulated lesions, FAST clearly underestimated GM outside lesion regions as long as lesion volume increased and lesion intensities approximated those of GM tissue. The incidence of WMLs on real scans was smaller, but FAST still tended to underestimate GM with increasing lesion loads.
On the other hand, various studies have also analyzed the correlation between brain tissue atrophy and MS disability progression. 9,10 These studies showed a brain atrophy decrease rate between 0.3% and 0.5% of change in brain parenchyma per year in patients with MS, 9,10 with a decrease in GM and WM volume of up to 0.4% and 0.2% per year, respectively. 10 This statement along with study results such as those found by Battaglini et al 4 and Gelineau-Morel et al 6 indicates that a portion of brain atrophy could be hidden by the inclusion of WMLs on tissue segmentation.
In this study, we performed a quantitative evaluation of the effects of WMLs on brain tissue volume measurements to analyze the extent to which tissue estimations are affected by changes in WML volume and intensity. In contrast to other similar studies, 4-6 our analysis extended the number of segmentation methods involved, offering a comparative evaluation of the effects of WMLs on the volume measurements of 6 segmentation methods. Furthermore, given the reported correlation between brain atrophy rates and disability progression, 9,10 it can be clinically relevant for the MS community to extend the analysis of the effects of simulated WML to real data of patients with MS; hence, our analysis was focused exclusively on data from the T1-weighted images from patients with clinically confirmed MS.
All T1-weighted patient images were processed following the same pipeline (Fig 2). Internal skull-stripping and intensity-correction options were disabled on SPM5, SPM8 (http://www.fil. ion.ucl.ac.uk/spm/software/), and FAST. Instead, to reduce the differences in brain area and signal image intensity produced by different preprocessing tools, we skull-stripped all images by using the Brain Extraction Tool (http://fsl.fmrib.ox.ac.uk/fsl/ fslwiki/BET) 12 and intensity-corrected them by using N3. 13 As a second step, 2 sets were produced from preprocessed images: an original set that included WMLs as part of current tissue and a masked set in which the WMLs were masked out before tissue segmentation and relabeled as WM after, following the same procedure used by radiologists of the 3 hospitals.

Segmentation Methods
The set of methods was composed of 6 well-known automatic brain tissue segmentation techniques: Artificial Neural Network (ANN), fuzzy C-means (FCM), Fuzzy And Noise Tolerant Adaptive Segmentation Method (FANTASM), FAST, SPM5, and SPM8. ANN and FCM were implemented for our study, while the rest of the methods were obtained from available repositories. The ANN method is based on self-organizing maps, also known as Kohonen networks. 14 ANN was implemented for our study by using the Matlab 7.12 environment (MathWorks, Natick, Massachusetts) following the technique proposed by Tian et al. 15 FCM 16 and FANTASM 17 are both based on fuzzy-clustering techniques. FCM implements the classic fuzzy-clustering approach, while FANTASM adds neighboring information to increment the robustness of the method to intensity inhomogeneity artifacts and noise. FCM was also implemented by using the Matlab environment and following the technique described in Pham,16 in which clusters were initialized according to Bezdek et al. 18 FANTASM is included in the MIPAV toolbox (http://mipav.cit.nih.gov). FAST 8 guides the segmentation with spatial information through the optimization of Hidden Markov Random Fields, and the method is included in the fMRI of the Brain Software Library toolbox (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/). SPM5 and SPM8 7 are based on an iterative Gaussian Mixture Model optimization, weighting the probability of belonging to a certain tissue class with a priori spatial information from tissue-probability atlases. However, SPM8 comes with a set of different characteristics to improve registration and tissue segmentation. Both methods are included in the SPM8 toolbox (http://www.fil.ion.ucl.ac.uk/spm/ software/spm8). All methods were run with default parameters.

Evaluation
Images from both the original and masked sets were segmented into GM, WM, and CSF tissue classes by using the 6 presented segmentation methods. Then, we computed the normalized tissue volumes as the number of voxels classified as GM, WM, and CSF, respectively, divided by the total number of voxels. Three different analyses were performed on these data. First, we analyzed how lesion voxels were classified by each segmentation method to establish to what extent the tissue volumes reported by each algorithm on the original and masked images could be expected to be different. Second, we analyzed the direct effect of lesions in the global volume estimation by computing the differences in total tissue volume as the percentage of change between original and masked images. For example, in the case of GM tissue: where NGMV Original and NGMV Masked stand for the normalized gray matter volumes of original and masked images, respectively. Third, we also investigated the indirect effects of lesions in the rest of the tissue volume outside lesion regions. These are tissue volume estimations that incorporate lesions in the segmentation process but do not consider them when the volume is evaluated.

Statistical Analysis
The correlation among factors (differences in tissue volume, lesion load, and lesion intensity) was calculated by using Pearson linear correlation coefficient (r). The significance level ␣ was set at .05. This level was used both for confidence interval computation and 95% significance hypothesis 2-tailed t tests. All statistical analyses were calculated by using the Matlab environment. Figure 3 depicts the percentage of WML voxels classified either as WM (Fig 3, top) or GM (Fig 3, bottom). Percentages are detailed for each segmentation method and hospital. The amount of WMLs that were classified as GM varied for each method, mostly due to the differences among algorithms. Figure 4 illustrates the differences among methods by showing the output classification performed by each of the 6 segmentation methods. Observed differences in the percentage of classified WML vox- els as GM and WM between hospitals can be attributed to each particular scanner acquisition configuration that defines the tissue signal-intensity distributions. The distance between WML and WM mean signal intensities was highest in H3 as computed by each of the 6 methods (range, from 89.2 Ϯ 4.45% to 92.22 Ϯ 4.45% of WML mean signal intensity with respect to WM) and was lowest in H2 (range, from 95.3 Ϯ 1.76% to 100.34 Ϯ 6.39%). As shown in Fig 1, there is a better contrast between GM and WM tissue on the H3 images compared with the H1 and H2 images. The correlation between the percentage of lesion classification and lesion size was not significant in all cases (r Ͻ 0.33, P Ͼ .05). In contrast, the percentage of WML classified as GM or WM and the distance between the mean WML and WM signal intensities showed a moderate correlation in all hospitals (r Ͼ 0.6, P Ͻ .01). On the basis of our data, the contrast between tissues computed as the normalized difference between the mean GM and WM signal intensity distributions was correlated with the distance between the WM and WML mean signal intensities (r ϭ 0.6, P Ͻ .001).

Differences in Total Tissue Volume Estimation
The mean percentage differences in total tissue volume between the original and masked images are presented in Table 1. All methods overestimated GM tissue in original scans, regardless of the hospital, but the overestimation was increased in H2 compared with H1 and H3 due to greater lesion volumes in H2. The differences among methods for the same hospital and tissue were also significantly greater in H2 than in H1 and H3. Abnormally low mean and high SD values observed in SPM5 for both GM (0.10 Ϯ 2.68) and WM (1.04 Ϯ 3.01) in H2 were caused by 2 patients who exhibited very high opposite differences between their respective original and masked images, decreasing the overall mean difference and increasing the SD.
Correlation between the differences in total mean tissue volume and lesion size was significant in all hospitals: Lesion size had a direct effect on tissue segmentation. Table 2 shows the Pearson correlation values obtained between differences in tissue volume and lesion size across methods. All methods except SPM5 presented a positive correlation in GM and a negative correlation in WM in H1 and H2. SPM5 correlated in H1 but not in H2, where it was influenced by abnormal values in the 2 images with highest lesion load. In H3, only FCM, FANTASM, and FAST were positively correlated in GM and negatively correlated in WM. The correlation coefficients for ANN, SPM5, and SPM8 in H3 were weak and not significant in GM and WM.

Volume Estimation of Tissue Outside Lesion Regions
The mean percentage differences in tissue volume outside lesion regions between original and masked images are presented in Table 3. The differences between the images segmented with lesions and images in which the lesions were masked before tissue segmentation were again higher in H2, and the methods still substantially overestimated the GM outside the lesion regions to the detriment of WM, even though analyzed tissues were free of lesion regions. In contrast, only SPM5 and SPM8 reported a noticeable underestimation of GM in H3, also to the detriment of WM.
Differences in tissue volume outside the lesion regions correlated with lesion size for all tissues and hospitals, indicating an effect of lesion size not only on lesion voxels but also on tissue that is not affected by lesions. Table 4 presents the correlation values obtained across methods. In H1, there was a remarkable correlation for ANN, FCM, FANTASM, and FAST in all tissues. The obtained values for SPM8 were also significant in GM and CSF. In H2, the correlation was significant in ANN, FCM, and FANTASM in all tissues. In H3, only FCM and FAST showed a significant correlation in all tissues, whereas FCM, FAST, SPM5, and SPM8 correlated significantly only in WM. All methods except SPM5 and SPM8 reported a significant correlation for CSF.

DISCUSSION
Previous studies have shown that the range of voxel signal intensities composing each of the tissue distributions can be altered by WMLs if these voxels are included in the segmentation process. 4,5 Lesion load and the apparent lesion signal intensity lead to observed changes in tissue segmentation in original images.  For instance, if a portion of the lesion voxels is classified as WM, the mean overall WM intensity decreases, shifts WM boundaries into darker intensities, and narrows GM tissue distribution. 4,6 Voxels that should have been classified as GM are assigned to WM, increasing the WM volume estimation and decreasing GM volume. If some of the WML voxels are classified as GM, the apparent GM mean intensity increases and the WM tissue distribution narrows. This change occurs because voxels that are theoretically classified as WM are assigned to GM, increasing GM estimation against a lower WM volume estimation. We compare our results with those in previous studies regarding the effects of WMLs on brain tissue volume measurements. However, given the differences in image data, criterion standards, simulated lesions, and lesion voxel intensities among studies, a direct comparison further than an analysis of trends with similar WML intensities and lesion loads should be carefully performed. Our experiments follow the same trend presented by Battaglini et al, 4 and both studies show that FAST overestimates total GM volume on images segmented with lesions. Similarly, our results also coincide with those found by Chard et al 5 in simulated data, and in both studies, SPM5 overestimated GM tissue on images with lesions. In contrast, our results appear to be inconsistent with those reported by Gelineau-Morel et al. 6 These studies showed a significant correlation between WML intensity and an underesti-mation of GM volume outside the lesions, especially when the lesions had intensities similar to those of the mean GM. The observed differences are caused by distinct signal-intensity profiles of WMLs in each study. In the case of Gelineau-Morel et al, 6 the WML signal intensities were noticeably more hypointense compared with our data. The probability of voxels to be classified as GM dropped as a result of the influence of hypointense WML intensities in tissue distributions. Part of WML voxels with a signal intensity similar to that of GM were still classified as WM, reducing the signal intensity threshold between GM and WM. As a result, most of the partial volume voxels with signal intensity in the boundary between GM and WM were classified as WM, artificially reducing the overall number of GM voxels.
Our results show that the classification of WML regions is highly dependent on lesion voxel signal intensities and the variation of their signal intensity in terms of the WM signal distribution. Lesion segmentation is clearly determined by this variation because the probability of WML voxels being classified as WM will be higher as long as WML intensities resemble those of WM. However, the signal-intensity contrast among tissues also plays an important role because it can influence the amount of WML voxels that are classified as GM or WM. As long as the contrast among distributions increases, more lesion voxels will be added into the GM distribution. Although the main factor in the observed differences in tissue volume across methods is caused by lesion volume, the percentage of lesion voxels that are classified as GM and WM might also be a remarkable factor in the observed tissuevolume differences, especially in images with high lesion loads. Therefore, the relationship between image quality and lesion load also might have to be considered to explain the differences in tissue volume.
SPM8 was the method with the lowest difference in total tissue volume between original and masked images. In contrast, FAST was the method that was more affected by lesions. In general, all methods overestimated GM in original scans, though values were more significant in H2 than H1 and H3 due to higher lesion loads in H2. In H1 and H3, most of the underestimated WM was shifted into GM. The small percentage of lesions that were segmented as CSF, especially the low lesion volume, limited the impact of WML voxels on the overall CSF tissue distribution of original images.
SPM8 and FANTASM were the methods with the lowest incidence of WML in tissue volume measurements outside lesion regions, while FCM and FAST showed the largest differences among all methods. Lesion volume also explains the limited effect of WML on tissue segmentation outside lesion regions in H1 and H3, compared with images with higher lesion loads such as the H2 images. In H1 and H3, although the behavior differs slightly for  each method, the differences in tissues outside the lesion regions are very small. The differences outside the lesion regions are especially important because they highlight the bias introduced by WMLs on the estimation of tissue volume that is not pathologically affected. If one compares the results between total tissue volume and tissue volume outside lesion regions, it can be observed that an important part of the overestimated total GM is essentially derived from the same hypointense WML voxels that are classified as GM. Moreover, it is important to highlight the differences in the algorithms. Methods such as FCM and ANN, which only rely on signal intensity, introduce more errors in tissue segmentation compared with methods such as SPM8 and SPM5, which incorporate spatial information. This reinforces the necessity for selecting a segmentation algorithm that does not depend on signal intensity only. However, even though WML voxels have not been considered for computing tissue volume outside the lesion regions, there is still a clear tendency toward overestimating GM. On images with a high lesion load, the observed differences in GM volume outside lesion regions reach values that are equivalent to the yearly expected GM atrophy. 9,10 Following these assumptions, SPM8, FANTASM, and SPM5 are the methods with the lowest reported incidence of WML on brain tissue volume measurements, especially on images with a high lesion load.
The present study is not free of limitations. The principal limitation is the lack of tissue expert annotations, given that the study incorporated a relatively large number of images from 3 different hospitals and this task was time-consuming. A second limitation of the study is the sensitivity of the tissue segmentation methods to changes in the skull-stripping mask. Errors in the brain mask may lead to the inclusion of blood vessels such as the internal carotid arteries with hyperintense signal intensity, which might bias the tissue distributions. A final limitation of the study is the inherent difficulty of comparing previous studies, given the differences in the scanner protocols used to acquire the images of patients with MS. The differences in the acquisition protocol may cause the observed differences in the lesion intensity profile compared with previous works. 8,10 Our study shows that such an intensity profile introduces variations in GM and WM tissue distributions.

CONCLUSIONS
The results of this study indicate a direct relationship between the differences in brain tissue volume and changes in lesion load and WML intensity. Of the analyzed methods, SPM8 exhibited the lowest incidence of WMLs in volume estimation, whereas FCM yielded the highest GM overestimation. Furthermore, all methods were affected by WMLs in tissue volume outside the lesion regions. SPM8 and FANTASM exhibited the lowest differences in tissue volume outside the lesion regions, whereas the influence of WMLs outside the lesion regions is more important in methods such as FCM and FAST. The latter results are especially important because even when masking lesions after segmentation to avoid the inclusion of lesion voxels segmented as GM into the volume estimation, the methods tend to overestimate GM tissue on images segmented with lesions. On images with high lesion load, this bias might conceal or falsify part of the GM and WM tissue atrophy.  Ϫ0.78 b 0.72 0.14 b SPM8 Ϫ0.64 b 0.72 Ϫ0.01 b a Correlation was computed for each method and hospital separately. All values were found to be significant (P value Ͻ.05) unless otherwise noted. b Not significant.