Pitfalls in the Use of Voxel-Based Morphometry as a Biomarker: Examples from Huntington Disease

BACKGROUND AND PURPOSE: VBM is increasingly used in the study of neurodegeneration, and recently there has been interest in its potential as a biomarker. However, although it is largely “automated,” VBM is rarely implemented consistently across studies, and changing user-specified options can alter the results in a way similar to the very biologic differences under investigation. MATERIALS AND METHODS: This work uses data from patients with HD to demonstrate the effects of several user-specified VBM parameters and analyses: type and level of statistical correction, modulation, smoothing kernel size, adjustment for brain size, subgroup analysis, and software version. RESULTS: The results demonstrate that changing these options can alter results in a way similar to the biologic differences under investigation. CONCLUSIONS: If VBM is to be useful clinically or considered for use as a biomarker, there is a need for greater recognition of these issues and more uniformity in its application for the method to be both reproducible and valid.

ABBREVIATIONS: CAG ϭ cytosine adenine guanine; DARTEL ϭ Diffeomorphic Anatomical Registration Through Exponentiated Lie algebra; EHDN ϭ European Huntington's Disease Network; FDR ϭ false discovery rate; FWE ϭ family-wise error; FWHM ϭ full width at half-maximum; GM ϭ gray matter; HD ϭ Huntington disease; Mod. ϭ modulation; NA ϭ not applicable; SPM ϭ statistical parametric mapping/statistical parametric map; TFC ϭ total functional capacity; TIV ϭ total intracranial volume; UHDRS ϭ Unified Huntington Disease Rating Scale; Uncor. ϭ uncorrected; VBM ϭ voxel-based morphometry V BM 1 involves voxel-wise statistical analysis of structural MR images and is commonly used to infer regions in which brain volume differs between groups or regions in which brain volume is associated with another variable. VBM is increasingly used in the study of neurodegeneration and is a complementary approach to region-of-interest methods because it is automated in many parts and can be applied across the whole brain and thus does not require a priori hypotheses about particular regions of interest. Although VBM has mainly been used to understand structural differences and behavioral correlates, there is increasing interest in the potential use of VBM as a biomarker, both diagnostic 2 and also in clinical trials of potentially disease-modifying therapies. 3,4 However, although automated in many parts, VBM is rarely implemented consistently across studies, and changing user-specified options can alter the results in a way similar to the biologic differences under investigation.
This article aims to illustrate the above problem by using data from patients with HD, a neurodegenerative disease which has been investigated using VBM. We put this into context with a brief review of the literature on HD, highlighting the wide range of different processing options used in published VBM studies to date. We reference the use of VBM in other areas, including Alzheimer disease, and suggest some changes that could be implemented if this technique is to be considered a useful tool in the context of clinical trials.
The aims of the work were the following: 1) to illustrate that all users need to be aware of these caveats when interpreting results and 2) to show that a more uniform approach to VBM is vital if it is to be considered a robust and valid clinical tool and eventually meet criteria for a biomarker.

Subjects
Subjects were recruited from the HD clinics at the National Hospital for Neurology and Neurosurgery, London, and at Addenbrooke's Hospital, Cambridge, UK. All had a CAG repeat length of Ͼ39 in the HD gene. Subjects were classified as "early HD" (stages 1 and 2) 5 or gene carriers without motor signs (ie, "premanifest"). HD gene carriers with UHDRS diagnostic confidence scores Ͻ4 were defined as premanifest subjects (n ϭ 21); those with diagnostic confidence scores of 4 were defined as manifest HD (n ϭ 40). 6 Neurologically healthy controls were also recruited (n ϭ 20). These were spouses of patients or subjects from affected families who were known not to carry the HD gene. Subjects gave written informed consent, and the study had local research ethics committee and hospital trust approval. As part of a longitudinal study, all subjects underwent annual assessments including MR imaging and clinical and cognitive evaluations. Baseline MR images were used to determine the impact of VBM parameters on results; details of other findings from the study can be found elsewhere. 7,8 Demographic details are shown in Table 1.

VBM Analysis
In general, images were normalized and segmented by using standard procedures from SPM5 software and DARTEL (Wellcome Department of Imaging Neuroscience, London, United Kingdom). 9 Unless otherwise stated, GM segments were modulated and smoothed at 4-mm FWHM before analysis. At each stage, all segmentations were inspected visually. The main comparison presented in this work is that of controls versus early HD, so most SPMs show regions in which the early HD group has reduced GM volume relative to controls. It is also useful to consider the reverse contrast (where the HD group has increased GM relative to controls) because unpredicted findings in this direction might be an indication of poor registration. Unless otherwise stated, all comparisons controlled for differences in age and head size by including these as covariates. Detailed methods can be found in the supplementary on-line data.
We recognize that VBM can be implemented through other software packages. We have chosen to use SPM5 and DARTEL because they are the latest versions of a commonly used package, but the issues demonstrated here will apply regardless of software type or version. This work should not be interpreted as advocating the use of a particular software package or version.

Varying the Type and Level of Statistical Correction
One of the benefits of VBM is the fact that it examines the whole brain in an unbiased way, but in doing so, many thousands of statistical tests are performed at once. At a standard ␣ level of 0.05, approximately 5000 voxels in an image of 100 000 voxels would be expected to be false-positives. This is often addressed by controlling the FWE rate (ie, controlling the probability of there being at least 1 false-positive voxel in the entire SPM), though this can lack power and hence omit many true-positives 10 ; some authors opt instead to show uncorrected data. This section investigates how variation in the level and type of correction can impact the resulting SPM. Figure 1 shows regions in which HD subjects have GM loss relative to controls, by using 3 different levels of FWE correction and 3 different levels of voxel-wise correction. At very strict levels, the evidence appears to show atrophy confined to the striatum. At an "exploratory" uncorrected level, most of the GM appears to be involved. Even though the underlying contrast is the same, varying the type and level of correction in this way could mimic the effect of increasing disease stage or the passage of time.

Using Modulated or Unmodulated Data
In the earlier formulations of VBM, normalization aimed to correct for global differences in head position and structure (eg, to align the left superior temporal gyri on all subjects) but not for local differences due to atrophy. 1 However in practice, it is likely that normalization results in some atrophy being lost. To correct for this, a modulation step that multiplies the voxel intensity by the Jacobian determinant from the normalization process was introduced. 11 The Jacobian determinant is an index of how much a voxel was stretched or contracted during normalization, so modulation, therefore, makes intensity a more accurate representation of volume. With modulated data, one is testing for "regional differences in the absolute amount (volume) of gray matter…," 11 whereas with unmodulated data one is looking at "differences in concentration of gray matter (per unit volume in native space)," 1,11 though this is not to be confused with, for example, the histologic attenuation of neurons. More flexible registration methods such as DARTEL intend to recover finer scale differences (eg, due to atrophy), with a greater proportion of the useful information being transferred to the Jacobian, making modulation of greater importance. In the literature, modulation is not always used, but results are often interpreted similarly regardless of whether this step is included. This section investigated how inclusion of the modulation step might affect results. Figure 2 shows regions of "atrophy" in subjects with HD compared with controls by using both modulated and unmodulated data. With unmodulated data, there is little evidence of putaminal involvement, though both the caudate and insula are shown to be reduced in early HD relative to controls. Using modulated data damage to the insula appears less widespread, while there is much more evidence of caudate and putamen atrophy. The t values are generally higher, indicating that including the information in the Jacobian improves discrimination between the groups.

Changing the Size of the Smoothing Kernel
A final preprocessing option is the smoothing kernel. Data are convolved with a 3D Gaussian kernel so that voxel intensities become a weighted average of the surrounding voxels; the size of this kernel is user-defined. Smoothing is required to render the data more normally distributed and to correct for some error in the registration process. 1 A range of smoothing kernel sizes has been used in the literature, and this section compares the effect of 3 different smoothing kernels on a single dataset.
Regions in which HD has significantly reduced GM relative to controls are shown in Fig 3 for 3 different smoothing kernels (4-, 6-, and 8-mm). As the kernel size increases, so does the extent of the findings, with, for example, the insula and posterior cortical regions becoming increasingly involved. Elsewhere in the work presented here, a kernel of 4 mm was chosen because the increased accuracy of the DARTEL registration algorithm means that smaller kernels should be sufficient to correct for misalignment.

Adjusting for Brain Volume
Apart from the effects of pathology, total brain volume in healthy subjects is known to vary with both head size 12 and sex, 13 and it has been shown that adjusting whole-brain volume for TIV eliminates differences due to sex. 14 It is common for volumetric studies to include an adjustment for some index of head size to ensure that these differences are not influencing findings. 15,16 However, few VBM studies of neurodegeneration include an index or measure of TIV as a covariate, though many adjust for total GM volume. In healthy subjects, total GM volume is likely to correlate with TIV, though it will decrease with age. 17 If one adjusts for age, covarying for total GM volume approximates an adjustment for TIV and allows investigation of differences in GM volume that are not caused by differences in overall head size. However in subjects with a neurodegenerative disease, total GM volume will almost certainly decrease with the duration or severity of the disease; hence, adjusting for it is likely to mask some disease-related effects (Fig 4). At an extreme level, if degeneration proceeded uniformly throughout the brain, then a comparison between healthy controls and patients that was adjusted for total GM volume would find no evidence of group differences. Figure 5 shows the effects of adjusting for TIV and total GM when investigating differences in volume between early HD subjects and controls. In this cohort, there was little effect of adjusting for TIV, though with adjustment, the maximum t value was slightly higher and there was a little more evidence of atrophy in the insula. If one adjusts for GM volume alone, evidence of atrophy outside the striatum almost disappears. When one adjusts for both, there is evidence that striatal atro-phy is disproportionately severe (ie, cannot be accounted for by general GM loss or head size).

Subgroup Analysis
Another common analysis is to use simple regression models to examine the association between a variable of interest and brain volume. While some groups model this as a regression, others chose to compare the outcome of 2 subgroup contrasts (eg, high CAG repeat length versus controls and low CAG repeat length versus controls). 18 This section examines a potential pitfall associated with the latter approach by using subgroups of the early HD group (the 12 subjects with the lowest UHDRS motor scores and the 12 subjects with the highest UHDRS motor scores) and a subgroup of 12 controls (Fig 6).
The 2 SPMs showing the contrast of the low motor group and the high motor group with controls show that atrophy in the high motor group is more widespread and perhaps that group differences are larger. However the direct contrast of the low and high motor group shows that there is no evidence that the 2 groups differ from each other (at the same level of statistical correction).

Effect of Software Version and Preprocessing Strategy
Finally, although for consistency all the work in the above sections has been performed by using SPM5 and DARTEL, 2 further points are worth noting. First, these issues will apply to the other software packages available for whole-brain analysis. Second, software package (and version) is a further source of potential variation between studies and, therefore, needs to be taken into account when interpreting and comparing findings. For example, incremental improvements have been made to the SPM software since it was first introduced in the early 1990s. Although a direct comparison of these software versions is beyond the scope of this article, a brief summary of the features relevant to VBM are outlined below. SPM96 had basic 3D spatial normalization by using basis functions and separate tissue segmentation. SPM99 improved the normalization and added MR imaging bias-field correction to the segmentation. This bias-field estimation was enhanced in SPM2, alongside some major changes to the statistical analysis, including restricted maximum likelihood estimation of variance components followed by maximum likelihood (weighted least squares) parameter estimation and the option of controlling the FDR. SPM5 included a unified segmentation approach, which combined the previously separate processes of spatial normalization and tissue classification. In addition, the introduction of DARTEL provided a major advance in the accuracy of spatial alignment of scans. SPM8 (which was released after the completion of our analysis) provides further refinement to the unified segmentation algorithm and a revised FDR procedure.
As our study shows, improvements in normalization accuracy (and consequently smaller smoothing kernels) and statistical inference can have a noticeable impact on resulting SPMs and, therefore, conclusions about the spatial distribution of atrophy. Software version is 1 source of variation that is beyond the user's control because it is to be expected that users will want to work with the latest versions. However, it does need to be acknowledged if this could (partly) explain differences in findings.
An issue closely intertwined with improvements to the registration and segmentation methods available in different software versions is that of modified pipelines for the combination of these steps. For example, Good et al 11 introduced an "optimized" procedure involving generation of "custom templates" and tissue probability maps and normalization of segmentations followed by re-segmentation. The unified segmentation of SPM5 provides a more theoretically grounded version of this iteration, while DARTEL allows registration to the group-wise average space instead of standard or custom templates (though it still typically relies on the initial unified segmentation results). Subject groups that are poorly represented by the individuals used to create the standard tissue probability maps (eg, very young or very old) may not be well segmented by the standard procedure. Wilke et al 19 propose a method to statistically generate subject-matched tissue probability maps based on a linear model of the variation of tissues in a separate large cohort of subjects. Graphs demonstrate how TIV and total GM volume vary with age and motor score (an index of HD severity). The top 2 graphs show that the relationship between TIV and both age and motor score is small and not statistically significant. The bottom 2 graphs show that total GM volume decreases with age (r ϭ Ϫ0.26, P ϭ .017) and motor score (r ϭ Ϫ0.31, P ϭ .0493).

Discussion
This study demonstrates that methodologic and biologic differences can appear very similar in VBM analyses, and this finding opens up a risk of misinterpretation of results, as well as making it hard to generalize between studies and, hence, be confident of the robustness of findings.
Very different pictures can be obtained by varying the level and type of correction used. Uncorrected results in which group numbers or effect sizes are small and hence would not survive FWE correction are often published, though this might result in a large number of false-positives. Conversely, stringent control of the FWE rate is likely to lead to underreporting of true effects. As discussed by Poldrack et al, 20 the risk of false-positives in uncorrected data depends on the smoothness, complicating the comparison between different sets of uncorrected results. For this reason, we prefer corrected results with a lower threshold and/or the presentation of unthresholded maps. 21 This may help emphasize similarities, rather than differences, between studies.
There were differences between findings with modulated and unmodulated data. These need to be interpreted differently because they are not representing the same phenomena.  As more precise registration methods are developed, modulation becomes more important to preserve structural differences. In studies of neurodegeneration, the incorporation of a modulation step is the preferred way of ensuring that intersubject alignment preserves intergroup differences in morphology. 22 Ashburner and Friston 1 stated, "Whenever possible, the size of the smoothing kernel should be comparable to the size of the expected regional differences between the groups of brains." New methods of image registration such as DARTEL should have a decreased registration error, and the choice of smaller kernels (eg, 4 or 6 mm) may be sufficient. When study- ing neurodegeneration, greater smoothing tends to increase sensitivity at the expense of specificity and makes it harder to localize an effect anatomically. 23 This again means that inconsistencies between studies in which different kernels have been used might not reflect true differences in the cohorts being studied.
Often the lack of a statistically significant difference between groups in so-called "nuisance covariates" (eg, sex or TIV) is wrongly assumed to imply that these variables are not having a material influence on the results. In this cohort, the groups had, on average, similar head sizes, and including TIV as a covariate did not greatly impact the SPM. However, because GM volume is related to TIV, including TIV as a covariate reduces some of the unexplained variance in the data and, hence, may increase the significance of the contrast of interest. In this cohort, this was reflected in the finding of slightly more atrophy at slightly higher t values when TIV was included as a covariate compared with when it was not.
When total GM volume was also included, it was clear that this had a marked effect on the results. Adjusting for total GM volume allows investigation of the relative loss or preservation of regions, compared with the amount of global loss. 11,24,25 This is an interesting question in itself but needs careful interpretation. Some studies seem to equate adjustment for total GM volume with adjustment for head size, but in studies of neurodegeneration in particular, adjusting for the former will get rid of some disease-related effects, whereas adjusting for the latter will not, as the current results demonstrate.
The results also demonstrate that while subgroup comparison can yield interesting SPMs, visual comparison of the 2 resulting statistical maps does not constitute a valid statistical comparison in itself. When one compares each group with a relatively homogeneous control group, the SPMs are not identical, but this is not evidence that the groups differ significantly from each other (see guideline "Report Statistical Tests to Support All Claims" in the recent set of guidelines for reporting functional MR imaging studies 20 ). Table 2 summarizes some of the processing methods and levels of correction used in a number of previously published VBM studies in HD.
Within the published VBM studies in HD, there are differences at almost every step. Three of the 17 HD studies that used VBM do not mention modulation 18,26,27 ; hence, the SPMs from these studies may not be showing the same sort of data as the others. The studies cover a wide range of smoothing kernels (from 4-to 12-mm FWHM), which can have a dramatic effect on findings (Fig 3); ␣ levels in these studies range from the conservative 0.001 (controlling the FWE rate) to the more exploratory 0.005 (without correction for multiple comparisons). There is also huge variation between research groups in the covariates they have included in the standard-HD-versus-control comparison: some include age and TIV but some do not. These differences make it hard to interpret the various findings and may mean that results do not generalize to the population as a whole.

Conclusions
The aim of the work presented here was to demonstrate how changes in VBM processing can mimic biologic changes and the potential for misinterpretation that this presents. This can mean that it is hard to generalize findings or to be confident about the robustness of results. This problem is not restricted to VBM or HD, though the methodologic variations in the studies in Table 2 illustrate the difficulties well. In addition, when contradictory results are published, there is a danger that studies are simply repeated; this repetition is a poor use of resources. Image-classification techniques by using VBM-like data have already been used as a diagnostic tool in the early stages of Alzheimer disease 2,28 and to measure brain changes in response to antipsychotic treatment in schizophrenia. 29 If VBM is to be useful clinically or considered for use as a biomarker, there is a need for more uniformity in its application for the method to be both reproducible and valid.