Measurement of Cortical Thickness and Volume of Subcortical Structures in Multiple Sclerosis: Agreement between 2D Spin-Echo and 3D MPRAGE T1-Weighted Images

BACKGROUND AND PURPOSE: Gray matter pathology is known to occur in multiple sclerosis and is related to disease outcomes. FreeSurfer and the FMRIB Integrated Registration and Segmentation Tool (FIRST) have been developed for measuring cortical and subcortical gray matter in 3D-gradient-echo T1-weighted images. Unfortunately, most historical MS cohorts do not have 3D-gradient-echo, but 2D-spin-echo images instead. We aimed to evaluate whether cortical thickness and the volume of subcortical structures measured with FreeSurfer and FIRST could be reliably measured in 2D-spin-echo images and to investigate the strength and direction of clinicoradiologic correlations. MATERIALS AND METHODS: Thirty-eight patients with MS and 2D-spin-echo and 3D-gradient-echo T1-weighted images obtained at the same time were analyzed by using FreeSurfer and FIRST. The intraclass correlation coefficient between the estimates was obtained. Correlation coefficients were used to investigate clinicoradiologic associations. RESULTS: Subcortical volumes obtained with both FreeSurfer and FIRST showed good agreement between 2D-spin-echo and 3D-gradient-echo images, with 68.8%–76.2% of the structures having either a substantial or almost perfect agreement. Nevertheless, with FIRST with 2D-spin-echo, 18% of patients had mis-segmentation. Cortical thickness had the lowest intraclass correlation coefficient values, with only 1 structure (1.4%) having substantial agreement. Disease duration and the Expanded Disability Status Scale showed a moderate correlation with most of the subcortical structures measured with 3D-gradient-echo images, but some correlations lost significance with 2D-spin-echo images, especially with FIRST. CONCLUSIONS: Cortical thickness estimates with FreeSurfer on 2D-spin-echo images are inaccurate. Subcortical volume estimates obtained with FreeSurfer and FIRST on 2D-spin-echo images seem to be reliable, with acceptable clinicoradiologic correlations for FreeSurfer.

G ray matter pathology in patients with multiple sclerosis is present from the very early stages of the disease and has been related to long-term disability. 1,2 Therefore, in recent years, research has focused on obtaining accurate markers of GM damage, and different software packages have been developed or optimized for measuring it in MS. FreeSurfer software (http://surfer. nmr.mgh.harvard.edu) 3,4 allows automatic calculation of cortical thickness and the volume of subcortical GM structures by using 3D T1-weighted images. Briefly, the image-processing pipeline includes Talairach transformation of the 3D T1-weighted images and segmentation of the subcortical white matter and deep GM structures, relying on the gray and white matter boundaries and pial surfaces. The FMRIB Integrated Registration and Segmentation Tool (FIRST; http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FIRST) software package 5 automatically segments subcortical GM structures also on the basis of 3D T1-weighted images. Briefly, FIRST is a model-based segmentation and registration program that uses shape and appearance models constructed from manually segmented images. On the basis of the learned models, FIRST searches through linear combinations of shape modes of variation for the most probable shape instance, given the observed intensities in the 3D T1-weighted input images. Both software packages have been shown to be accurate and reproducible. [6][7][8][9][10][11] The study of cortical pathology in patients with MS by using FreeSurfer has shown cortical thinning in patients with MS compared with healthy controls, 12,13 which has been related to lesion volume, disease duration, disability, 12 and cognitive impairment. 14 Also, cortical thinning of the superior frontal gyrus, thalamus, and cerebellum significantly predicted conversion to MS in patients presenting with clinically isolated syndromes, 15 and global cortical thinning for 6 years was significantly associated with a more aggressive disease evolution. 16 The volume of deep GM structures (measured with both FreeSurfer and FIRST) has also been shown to be lower in patients with MS compared with healthy controls, [17][18][19] and it has been related to different clinical disease outcomes such as fatigue, 20 cognitive impairment, [17][18][19]21 disability, 19 and walking function. 22 Both FreeSurfer and FIRST have been optimized for 3D T1weighted gradient-echo images that incorporate a magnetizationprepared inversion pulse that increases the T1-weighting. 23 Unfortunately, for most of the historical MS cohorts with long-term clinical and radiologic follow-up, only 2D spin-echo (2D-SE) T1weighted images were acquired, a sequence that does not provide an optimal contrast between gray and white matter, particularly when acquired with high-field magnets. 24 The objectives of this work were the following: 1) to evaluate whether cortical thickness and subcortical volumes obtained with FreeSurfer could be reliably measured with 2D-SE T1-weighted images by using as the criterion standard the same measures obtained with 3D gradientecho (3D-GE) T1-weighted sequences, 2) to investigate whether subcortical volumes obtained with FIRST could be reliably measured in 2D-SE T1-weighted images by using as the criterion standard the same measures obtained with 3D-GE T1-weighted images, and 3) to assess whether the correlations between clinical outcomes and subcortical normalized volumes obtained with 3D-GE and 2D-SE T1-weighted images had a similar strength and direction.

MATERIALS AND METHODS
Patients with relapsing-remitting MS with 2D-SE and 3D-GE T1weighted images obtained at the same time were included in the analysis. Clinical and demographic data at the moment of MR imaging acquisition were recorded.

MR Imaging Acquisition and Analysis
All MR images were acquired on a 1.5T scanner (Magnetom Symphony; Siemens, Erlangen, Germany) with a standard head coil. The scanning protocol included 2 precontrast T1-weighted scans: a 3D magnetization-prepared rapid acquistion of gradient echo (TR/TE/TI, 2700/4.88/850 ms; voxel size, 1 ϫ 1ϫ1.2 mm 3 ; matrix size, 224 ϫ 256 ϫ 144; flip angle, 10°; receiver bandwidth, 130 Hz; averages, 1; acquisition time, 10 minutes) and a 2D-SE sequence (TR/TE, 450/17 ms; voxel size, 0.98 ϫ 0.98 ϫ 3 mm 3 ; matrix size, 192 ϫ 256 ϫ 46; section gap, 0; receiver bandwidth, 130 Hz; averages, 2; acquisition time, 2 minutes 52 seconds). Both sequences covered the whole brain. FreeSurfer software 3,4 (release Version 5.1.0) was used to obtain cortical thickness and volumes of subcortical structures in all 2D-SE and 3D-GE sequences ( Fig  1). Ninety-one GM structures (70 cortical and 21 subcortical) were obtained and used for 2D-SE versus 3D-GE reliability assessment. Cortical parcellation was also grouped into a categoric variable by medial or lateral structures. The estimated total intracranial volume, a measure obtained with FreeSurfer, was used to normalize the volume of subcortical structures as follows: raw subcortical structure volume/estimated total intracranial volume. FIRST software, 5 part of the FMRIB Software Library (FSL; http:// www.fmrib.ox.ac.uk/fsl), 25 was used to obtain volumes of 15 subcortical structures in all 2D-SE and 3D-GE sequences (Fig 1). Because FIRST does not provide a measure of total intracranial volume (TIV), this was calculated by obtaining the matrix determinant of each scan (by using the avscale utility of the FMRIB Linear Image Registration Tool [FLIRT; http://www.fmrib.ox.ac.uk], 26,27 part of the FSL) and applying the following formula: (FIRST template volume/matrix determinant) ϫ 1000. Normalized subcortical volume was then calculated as follows: raw subcortical structure volume/TIV.

Statistical Analysis
We used the SPSS program (IBM, Armonk, New York) to analyze clinical and demographic data. To assess reliability, we calculated the intraclass correlation coefficient (ICC) between the values obtained with 2D-SE and 3D-GE. ICC estimates of agreement were categorized as the following: slight (0.01-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and almost perfect agreement (0.81-1.0). Cortical thickness measures were grouped into lateral-versus-medial structures, and 2-tailed 2 tests were used for group comparison. Parametric and nonparametric correlation coefficients were used as appropriate to investigate associations between clinicodemographic data and normalized subcortical volume data. Because this was an exploratory study, correction for multiple comparisons was not performed. Statistical significance was set at P Ͻ .05.

RESULTS
Thirty-eight patients were included in the study. Twenty-three patients (60.5%) were women, with a mean age of 36.5 Ϯ 8.8 years, a mean disease duration of 10.5 Ϯ 7.2 years, and a median Expanded Disability Status Scale (EDSS) score of 4 (range, 1-6).

FreeSurfer Reliability Assessment
Cortical thickness measurements had the lowest ICC values, with 40% and 42.9% of the structures having a moderate or fair agreement, respectively. Only 1 structure (the right superior temporal gyrus) had a substantial agreement, and no structures had an almost perfect agreement (Table 1). Cortical thickness measurements of lateral structures had a higher proportion of moderate ICC estimates than medial structures (58.3% versus 15.6%, P ϭ .002). For most cortical structures (except for some right and left frontal lobe structures: frontal pole, rostral middle frontal, medial orbitofrontal, and pars triangularis) thickness values were underestimated with 2D-SE sequences. Estimates of subcortical volumes showed a better agreement between 2D-SE and 3D-GE images, with 76.2% of the structures having either a substantial or almost perfect agreement compared with only 1.4% in the cortical structures (P Ͻ .001). The highest ICC values for subcortical volume estimates included relevant structures for MS pathology such as the thalamus, pallidum, caudate, brain stem, and putamen (Table 2). No clear pattern of over-or underestimation when using 2D-SE sequences was seen for subcortical structures.

FIRST Reliability Assessment
With 2D-SE images, a registration error leading to a mis-segmentation occurred in 7 of 38 patients (18%) compared with none when 3D-GE images were used. Nevertheless, measurement of subcortical volumes from the studies that went through segmentation and registration showed a good agreement between 2D-SE and 3D-GE images, with 68.8% of the structures having either a substantial or almost perfect agreement (Table 2). Again, no clear pattern of over-or underestimation when using 2D-SE sequences was seen for subcortical structures.

Clinicoradiologic Correlations
Using normalized subcortical volume estimates obtained with FreeSurfer, we found the following: 1) Age did not correlate with any of the subcortical structures measured with both 3D-GE and 2D-SE images, and 2) disease duration and disability as measured with the Expanded Disability Status Scale showed a significant moderate correlation with most of the subcortical structures measured with both 3D-GE and 2D-SE images; however, when we used 2D-SE images, some correlations lost statistical significance (Table 3). Using normalized subcortical volume measures ob- tained with FIRST, we found the following: 1) Age did not correlate with any of the subcortical structures measured with both 3D-GE and 2D-SE images, and 2) disease duration and EDSS scores showed a significant moderate correlation with most of the subcortical structures measured with 3D-GE images, while almost all correlations were nonsignificant if estimates from 2D-SE images were used (Table 4).

DISCUSSION
Brain volume loss, as measured with different postprocessing software packages, is known to occur in patients with MS, and it is clinically meaningful because it has been related to long-term motor and cognitive disability outcomes. 2,18,28 Historical MR imaging data from MS cohorts, with long-term follow-up that would provide relevant clinical information, do not include 3D heavily T1-weighted images (such as MPRAGE) but conventional 2D-SE T1-weighted images instead. Thus, reliability assessment of segmentation techniques in 2D-SE T1-weighted images may be of interest for these cohorts. As far as we are aware, this is the first time that FreeSurfer software has been evaluated by using 2D-SE T1-weighted images and that clinical correlations by using 3D-GE and 2D-SE images processed with FIRST and FreeSurfer have been compared. Cortical thickness estimates by using FreeSurfer software have been previously evaluated and have been demonstrated to be a robust and reproducible measure, except when different field strengths were used. [6][7][8] All studies performed to date have used 3D T1-weighted sequences. In this study, we found that measurement of cortical thickness with 2D-SE images yields inaccurate and unreliable results, with ICC values below 0.6 for almost all structures; this finding was especially notable when medial structures were evaluated. Pulse sequence, voxel geometry, and parallel imaging do not seem to influence cortical thickness measurements. [6][7][8] Although FreeSurfer segmentation does not rely on voxel intensity histograms, the different gray-white matter contrast in 2D-SE compared with 3D-GE T1-weighted images (Fig 1) could partially explain this result. The contrast-to-noise ratio was calculated in a small subset sample (n ϭ 10; data not shown). The mean contrast-to-noise ratio between gray and white matter was 10.1 in the 2D-SE sequence and 19.7 in the 3D-GE. There are fundamental differences between contrast behaviors of spin-echo and gradient-echo sequences, especially when this last sequence is obtained with a magnetization-prepared inversion pulse. 23 This magnetization-prepared inversion pulse produces a strong T1-weighting in the image, resulting in an excellent gray-white matter contrast compared with the standard 2D-SE  images, facilitating morphologic evaluation and gray-white matter tissue segmentation. The section thickness of the 2 sequences was quite different (1.2 mm 3 for the 3D-GE sequence versus 3 mm 3 for the 2D-SE sequence). This greater thickness in the 2D-SE sequences could probably have influenced the volumetric measures obtained with both software, since section thickness has already been demonstrated to play a role when estimating lesion volumes in patients with MS. 29 We are aware that visual inspection of intermediate outputs to exclude MR images with segmentation errors could have improved our reliability results 30 ; however, this improvement would probably have been at the expense of introducing operator-derived biases. Furthermore, correcting cortical thickness segmentation errors (specially in 2D-SE T1-weighted images) would have been a very laborious and time-consuming task and very difficult to apply in future research involving a large number of patients. Thus, we decided to analyze the data obtained with FreeSurfer without adding control points to correct for topographic errors.
In this study, subcortical volume estimates of T1-weighted images by using both FreeSurfer and FIRST had a good 2D-SE versus 3D-GE reliability, with most of the ICC values being greater than 0.6. Subcortical GM volumes are estimated in both packages with a registration atlas-based technique (both by using the same atlas), a technique that is robust and insensitive to image contrast. We found that the amygdala and accumbens had the lowest ICC values, similar to descriptions in the literature for test-retest reliability assessment of 3D-GE subcortical segmentations by using FIRST software. 9,10,31 These structures are among the smallest; therefore, smaller volume differences may represent a higher percentage of error. The high ICC values obtained in our study for most of the subcortical structures using FIRST are in agreement with the only study performed to date that evaluated the performance of FIRST in 2D-SE T1-weighted images compared with 3D-GE T1-weighted images. 31 Also, we found segmentation errors in up to 18% of 2D-SE images, similar to what had been described in that work. 31 These mis-segmentations were not corrected because we did not focus on improving 2D-SE FIRST segmentation but on analyzing the performance of FIRST in these sequences in an automated fashion.
The main reason to study the performance of both packages in 2D-SE T1-weighted images was to test whether we could obtain measures that could be used to investigate clinical associations. Therefore, we assessed whether the subcortical normalized estimates obtained with 3D-GE and 2D-SE T1-weighted images by using the 2 packages were similarly associated with demographic and disease outcomes. Significant moderate correlations were obtained by using 3D-GE images, with relevant clinical outcomes such as disease duration and disability (measured with EDSS). However, by using 2D-SE scans, only normalized FreeSurfer estimates were significantly associated with clinical outcomes. This result could be partly explained because of the low ICC value of the TIV estimate between 2D-SE and 3D-GE T1-weighted images obtained by using FIRST (Table 2). Unlike FreeSurfer, FIRST does not provide a TIV estimate, and we calculated it by dividing the FIRST template volume (a fixed number of 1948.105) by the matrix determinant. Although both sequences covered the whole brain, the head coverage of the 2D-SE T1-weighted images is usually lower than the 3D template used in FSL, covering a lesser portion of the scalp and lower part of the brain stem (Fig 1). Thus, it is possible that the template volume used to calculate TIV does not match well enough for 2D-SE T1-weighted images.
We are aware that using FSL SIENAX software (http://fsl. fmrib.ox.ac.uk/fsl/fslwiki/SIENA) 25 to calculate TIV may have been a better approach. However, we wanted to evaluate whether robust volume measures could be obtained with 2D-SE T1weighted images by using a single package and without using image-processing options in the volumetric analysis (ie, the adequate threshold) that could introduce biases. Nevertheless, to confirm our interpretation, we used SIENAX software to calculate the normalized brain volume from 2D-SE and 3D-GE sequences in a small subset sample (n ϭ 15, data not shown). The ICC value of the normalized brain volume estimate between 2D-SE and 3D-GE T1-weighted images was 0.85 (almost perfect agreement), and most of the associations between subcortical normalized estimates and disease outcomes obtained by using 3D-GE se- quences remained significant when using 2D-SE sequences. These results reinforce our hypothesis that the poor clinicoradiologic correlations seen with 2D-SE sequences are most probably due to a bad TIV estimation when using the avscale tool. There are some limitations to our study. First, it was conducted with 2D-SE and 3D-GE T1-weighted images used in clinical practice with particular acquisition parameters. It is possible that a better optimized 2D-SE sequence obtained with a higher field scanner would have produced better segmentation results. Therefore, our results should be extrapolated with caution when using different 2D sequences. Second, a voxel-by-voxel comparison of the binary masks would have provided relevant information regarding whether the software included the same image points in both sequences. Unfortunately, 2D-SE and 3D-GE FIRST outputs have different resolutions; therefore, an accurate comparison of the binary masks could not be performed. Nevertheless, we believe this issue does not diminish the relevance of our results because a visual inspection of the outputs was performed to ensure that the FIRST software was correctly measuring subcortical structures in both sequences. Finally, a test-retest variability study with 2D-SE images would have provided more detailed information regarding reliability. Unfortunately, as stated before, the images used for this study were obtained in a clinical practice setting, with specific schedules, and test-retest studies are lacking.

CONCLUSIONS
We have demonstrated the following: 1) Measurement of cortical thickness with FreeSurfer with 2D-SE T1-weighted images is not accurate enough; 2) measurement of subcortical volumes with FreeSurfer and FIRST in 2D-SE images produces acceptable results; but 3) when we used normalized subcortical volumes of 2D-SE images for clinical correlations, FreeSurfer performed better than FIRST. Therefore, FreeSurfer should be preferred if normalized subcortical volume measures are to be used in transversal correlations with clinical and demographic variables but should not be used to measure cortical thickness in 2D-SE images.