a Fully Automatic Software MR Imaging of Brain Volumes: Evaluation of

BACKGROUND AND PURPOSE: Automatic assessment of brain volumes is needed in research and clinical practice. Manual tracing is still the criterion standard but is time-consuming. It is important to validate the automatic tools to avoid the problems of clinical studies drawing conclusions on the basis of brain volumes estimated with methodologic errors. The objective of this study was to evaluate a new commercially available fully automatic software for MR imaging of brain volume assessment. Automatic and expert manual brain volumes were compared. MATERIALS AND METHODS: MR imaging (3T, axial T2 and FLAIR) was performed in 41 healthy elderly volunteers (mean age, 70 (cid:1) 6 years) and 20 patients with hydrocephalus (mean age, 73 (cid:1) 7 years). The software Q Brain was used to manually and automatically measure the following brain volumes: ICV, BTV, VV, and WMHV. The manual method has been previously validated and was used as the reference. Agreement between the manual and automatic methods was evaluated by using linear regression and Bland-Altman plots. RESULTS: There were significant differences between the automatic and manual methods regarding all volumes. The mean differences were ICV (cid:2) 49 (cid:1) 93 mL (mean (cid:1) 2SD, n (cid:2) 61), BTV (cid:2) 11 (cid:1) 70 mL, VV (cid:2) (cid:3) 6 (cid:1) 10 mL, and WMHV (cid:2) 2.4 (cid:1) 9 mL. The automatic calculations of brain volumes took approximately 2 minutes per investigation. CONCLUSIONS: The automatic tool is promising and provides rapid assessment of brain volumes. However, the software needs improvement before it is incorporated into research or daily use. Manual segmentation remains the reference method. ABBREVIATIONS: A (cid:2) automatic, BTV (cid:2) brain tissue volume; FLAIR (cid:2) fluid-attenuated inversion recovery; ICV (cid:2) intracranial volume; M (cid:2) manual; MD (cid:2) mean difference

V olume quantification of the intracranial compartments is important in several neurologic diseases. For example, hydrocephalus is defined according to the size of the ventricles. 1 The degree and longitudinal evolution of white matter lesions reveal the clinical course of multiple sclerosis and vascular dementia. 2,3 Brain atrophy is used for the diagnosis of Alzheimer disease, and volume changes of brain tumors may be used as markers of prognosis or treatment. [4][5][6][7] Volumetric MR imaging was the first noninvasive in vivo technique to assess the volume of the intracranial compartments accurately. 8,9 Today, the techniques of MR imaging volume quantification are mainly manual or semiautomatic. Manual segmentation is performed by an observer tracing the outer contour of a region of interest on each section. The semiautomatic techniques also require input and feedback from the observer. Both manual and semiautomatic techniques are time-consuming and thus expensive; therefore, volumetric estimations are seldom used in clinical routine.
It is important to develop tools to measure volumes fast and reliably. Different kinds of software have been developed to segment the brain volumes in a fully automatic way. [10][11][12][13] These types of software are not commercially available and have only been validated and evaluated by the developers. FreeSurfer (http://surfer.nmr.mgh.harvard.edu/fswiki), a freely available software, has been evaluated against manual segmentation in a recent study, but the time needed for automatic computing of the brain volumes is too long for clinical use. 14 A new automatic software, Q Brain (Medis, Leiden, the Netherlands), has been developed to quantify BTV, ICV, brain VV, and WMHV. Using this software, the observer can perform the manual or the fully automatic segmentation of these brain volumes.
The fully automatic segmentation of Q Brain is based on axial MR imaging sequences (FLAIR and T2-weighted); and by using a standard computer, the software automatically calculates the different volumes within a few minutes. This automatic software is promising, but it has not been validated yet.
In this study, we have selected the thorough manual segmentation of an experienced observer as the reference method. We believe that the manual method is the best possi ble estimate of the volumes, and this belief was also supported in a previous study using MR imaging volume phantoms, which showed that the manual segmentation tool of Q Brain produces accurate and reproducible volume estimates. 1 The aim of this study was to evaluate this new automatic software. In a group of 61 individuals, the Q Brain manual protocol was used as the criterion standard and was compared with the Q Brain automatic protocol.

MR Imaging Investigation
Subjects were studied with a 3T Achieva MR imaging scanner (Philips Healthcare, Best, the Netherlands) with an 8-channel head coil. Axial T2-weighted turbo spin-echo (TE ϭ 80, TR ϭ 3000 ms) and FLAIR (TE ϭ 140, TR ϭ 12000, and TI ϭ 2850 ms) sequences were obtained. The section thickness was 3 mm, the intersection gap was 0.3 mm, and the matrix was 512 ϫ 512 in both T2-weighted and FLAIR sequences.

Subjects
Sixty-one subjects were included. To obtain a large span of brain volumes, we chose 41 healthy elderly volunteers (mean age, 70 Ϯ 6 years; 24 women) and 20 patients with ventriculomegaly due to possible or probable idiopathic normal pressure hydrocephalus (mean age, 73 Ϯ 7 years; 9 women). 15,16 The study was approved by the university ethics board.

Volumetry
Volumetric MR imaging measurements were performed by using the image analysis software Q Brain (Version 2.0). Volumes were calculated by using a standard computer (2.19 GHz, 1.96 GB of RAM). The volume quantifications were first performed with the manual segmentation and second with the fully automatic segmentation algorithms. Using the manual protocol, the observer segmented an area in each section by manually tracing the borders of the region of interest. The software estimates the volume in milliliters. ICV, BTV, VV, and WMHV were assessed. ICV was measured on T2 images; and BTV, VV, and WMH were measured on FLAIR images (Fig 1). The same brain volumes were measured by using automatic segmentation. The methodology of automatic segmentation was based on SNIPER (Leiden University Medical Center) and has been described previously. 12

Manual Segmentation as the Reference Method
The manual method, used as the reference in this study, has been previously validated by using phantom models with well-defined vol-umes. 1,17 The main observer in both studies (observer 1) was trained by a neuroradiologist and had 3 years' experience in brain volume segmentation. We further investigated the variability of the manual method by letting an additional observer measure brain volumes in 5 patients with hydrocephalus and in 5 healthy controls. The intraobserver variability was assessed by measuring the brain volumes twice by the same observer (observer 1). The time between the first and the second segmentation was always Ͼ1 month. For the interobserver variability, a second observer (observer 2) measured the brain volumes in the same 10 subjects and was blinded to the results of the first observer. Inter-and intraobserver variability was expressed as the MD between repeated brain volume measurements with the limits of agreement defined as 2 SD. MRD was also calculated.

Statistics
The statistical analysis was performed with the Statistical Package for the Social Sciences software, Version 12.0 (SPSS, Chicago, Illinois). Correlations between the automatic and the manual brain volumes were investigated by using linear regression analysis. Bland-Altman plots were used. The Shapiro-Wilk test was used to test the normality. Differences between the means of repeated brain volume measurements were analyzed by using the paired t test or the Mann-Whitney test when appropriate. The duration to assess all volumes by using the fully automatic protocol was measured by using a stopwatch. P values Ͻ .05 were considered statistically significant.

Results
The average brain volumes of the 61 subjects assessed by the manual and automatic methods are shown in Table 1. Comparisons between manual and automatic volume measurements are shown in Figs 2 and 3. The measurements were highly correlated (VV: R ϭ 0.998, P Ͻ .01; ICV: R ϭ 0.936, P Ͻ .01; BTV: R ϭ 0.934, P Ͻ .01; WMHV: R ϭ 0.961, P Ͻ .01).
However, there were significant differences between the mean volumes calculated by the manual and the automatic methods for ICV, BTV, VV, and WMHV (ICV, P Ͻ .01; BTV, P ϭ .02; VV, P Ͻ .01; WMHV, P Ͻ .01). As shown in Figs 2 B1, 3A1, and 3B1, the automatic method underestimated ICV, WMHV, and BTV. The systematic differences are displayed in Table 2 and also in the Bland-Altman plots (Figs 2B2, 3A2, and 3B2). There was a systematic overestimation of the VV by using the automatic segmentation ( Fig  2A2 and Table 2).
The Bland-Altman plots show a correlation between the mean and the difference of the automatic and manual meth- ods for WMHV (R ϭ 0.48, P Ͻ .01) and VV (R ϭ 0.68, P ϭ .01). This was not observed for BTV or ICV (Figs 2B1 and  3B2).
The variability of the reference method is summarized in Table 2. No significant differences were found for interobserver and intraobserver variability (WMHV, P Ͼ .2; VV, P Ͼ .08; ICV, P Ͼ .17; BTV, P Ͼ .06). From the same 10 subjects, the significant differences between the manual and the auto-matic methods were confirmed for all brain volumes (WMHV, P Ͻ .01; VV, P Ͻ .01; ICV, P ϭ .03; BTV, P ϭ .04). The manual-automatic limits of agreement were larger than the intraobserver and interobserver limits of agreement, except in the case of the interobserver manual VV method ( Table  2).
The automatic algorithm automatically calculated the mean volumes in 127 Ϯ 9 seconds.

Discussion
In this study, a fully automatic commercially available software for the assessment of brain volumes was evaluated and found to be very fast and user-friendly. Automatic brain volumes correlated strongly to the manual brain volumes. However, there was a significant difference and variability between the automatic and reference methods. For ICV, BTV, and WMHV, the differences can be considered of clinically important magnitude; therefore, the automatic method requires improvement.

Ventricular Volume
Despite the excellent correlation (R ϭ 0.998) between automatic and manual VV, there was a systematic overestimation. Two main sources for the overestimation were observed. First, the pixel-intensity threshold was systematically larger compared with the reference, causing an oversegmentation mainly localized at the lateral ventricles in all subjects (On-line supplemental Fig 3C). This may explain the significant correlation found between the means of automatic and manual VV and their differences (Fig 2A2). Second, the cisterna ambiens (which is the subarachnoid space between splenium of the corpus callosum and the superior aspect of cerebellum) contains difficult anatomic structures; VV was overestimated by the automatic tool (On-line supplemental Fig 3F). Furthermore, the automatic method did not recognize the cerebral aqueduct and the fourth ventricle as a part of the VV.
A recent study using a similar automatic software, Free-Surfer, found an excellent correlation (R Ͼ 0.98) between automatic and manual VV. 14 As in our study, the authors found a significant difference in mean VV between the 2 methods, also with a systematic overestimation.
In this study, the agreement between repeated manual measurements with different observers was similar to the agreement between manual and automatic measurements ( Table 2). If the threshold pixel intensity to assess VV automatically is adjusted, Q Brain could be a reliable tool to assess VV rapidly; thus, it could replace the traditional linear indices, such as the Evans index. 1,17,18

Intracranial and Brain Tissue Volumes
A limitation of the automatic Q Brain was that it calculates only the total brain tissue volume and does not differentiate the white matter and gray matter. In this study, we, therefore, reported only the results for total BTV.
Despite the high correlations between the automatic and manual protocols for ICV and BTV, there was a difference between the mean manual volume and the mean automatic volume. The main source of misclassification was localized at the top and base of the cranial cavity. The uppermost 5 sections were not perpendicular to the skull or brain parenchyma contour; this positioning increased the partial volume effects and thus complicated the automatic segmentation (On-line Figs 1C and 2F). Miscalculations were also observed at the middle and posterior fossae (On-line Figs 1F and 2C).
The results for Q Brain were similar to findings in the recent evaluation of automatic FreeSurfer software, 14 which overestimated ICV with a mean difference of 133 mL between the manual and the automatic methods. Another automatic software, SIENA (Oxford University; http://www.fmrib.ox.ac.uk/ analysis/research/siena/), revealed a difference of 46 mL between the normalized mean manual BTV and the normalized mean automatic BTV. 19 The automatic algorithm of Q Brain is based on the SNIPER tool, 12 which has been used in several clinical studies. [20][21][22] The capability of SNIPER to assess ICV and WMHV accurately has been validated in a previous study. 12 However, the mean difference between automatic and manual ICV found in our study was much larger than that in this previous study (ϩ49 mL compared with their ϩ3 mL). 12 A possible explanation could be that the automatic Q Brain software has an interscanner variability similar to that in the automatic SPM5 software (Wellcome Department of Imaging Neuroscience, London, UK). In a recent study, 1 healthy subject was scanned with 6 different MR imaging scanners. ICV and BTV were estimated automatically with SPM5 and ranged from 1408 to 1515 mL and 1224 to 1363 mL, respectively. 23 Irrespective of the dependency on MR imaging scanners, the manual ICV and BTV variability showed that the limits of agreement between the 2 methods (manual and automatic) were at least 2-fold larger due to inter-/intraobserver variability (Table 2); further improvement of the automatic algorithm is necessary.

Automatic White Matter Hyperintensity Volume
In almost all subjects, the WMHV was underestimated by the automatic tool (Fig 3B1 and On-line supplemental Fig 4). This finding is in agreement with the study using SNIPER. 12 The authors found a slightly smaller systematic underestimation (approximately 1 mL compared with our 2 mL). Their 95% confidence interval of the difference was smaller than ours ([Ϫ4 -5 mL] compared with ours [Ϫ6 -11 mL]). We found a significant correlation between the means of automatic and

Obs1 vs Obs2
Ϫ1 ( manual WMHV and their differences as found in the evaluation study of SNIPER. 12 Interscanner variability could also be a possible contributor to this discrepancy. It is important also to discuss the manual segmentation protocols used to delineate the WMHV. We used FLAIR images, whereas the previous study used T2 images. 12 We believe FLAIR images should be used because they seem to have a higher specificity and accuracy compared with T2 images, especially in the periventricular region. 24,25 As to ICV and BTV, the limits of agreement between automatic and manual were larger compared with inter-/intraobserver limits of agreement. Previous studies 11,13 have attempted to validate their automatic WMHV tool against visual scales. We believe this is not a robust means of validation because visual and volume scales have different properties. 21,26 While manual segmentation is, for many reasons, still considered a standard segmentation-validation method, objective approaches with realistic data for which the true volumes are known are needed for standardized method assessment. Thus, a previous study reported a comparison of different software packages by using simulated MR imaging, the data base Brain-Web (http://mouldy.bic.mni.mcgill.ca/brainweb). 27,28 In our study, we have not used this approach because the BrainWeb data do not contain simulated FLAIR images. Without the FLAIR image sequence, the automatic calculation of brain volumes with Q Brain software was not possible.

Conclusions
According to our findings, similar automatic tools should undergo the same evaluation tests. It is important to validate the automatic tools because a number of clinical studies draw conclusions about brain volumes estimated with software that has not yet been validated. The automatic algorithm of the software Q Brain needs improvement to be used by the neuroradiology and neuroscience communities. However, manual segmentation is still the criterion standard, and Q Brain incorporates an excellent toolkit for this purpose.