Measuring Brain Volume by MR Imaging: Impact of Measurement Precision and Natural Variation on Sample Size Requirements

BACKGROUND AND PURPOSE: To determine the sample size needed to provide adequate statistical power in studies of brain volume by MR imaging, we examined the precision and variability of measurements in healthy controls. MATERIALS AND METHODS: A cohort of 52 people (mean age, 25.1 years) was examined at weeks 0 and 12 at 1.5T. We used an axial multisection T1-weighted sequence and a contiguous proton-attenuation/T2-weighted sequence. Data were registered to a probabilistic brain atlas, and an automated atlas-based program was used to segment brain tissue by type and by lobe. We assumed that there were no changes in volume because there were no intervening neurologic events. Sample sizes required to yield 80% statistical power in detecting a significant difference in volume were calculated for various experimental designs, assuming a patient-control volume difference of 5% or 2%. RESULTS: The precision of most measurements was excellent, but required sample sizes were larger than anticipated. If the goal was to detect a 5% difference in whole brain volume in a 2-sample cross-sectional study, the required sample was 73 patients and 73 controls because brain volume varies between individuals in a way that is not informative about disease effects. For a similar 2-sample longitudinal study, the required sample size was just 5 patients and 5 controls. CONCLUSIONS: Our results argue strongly for longitudinal studies in preference to cross-sectional studies, especially as research budgets decline. Our findings also suggest that there may be more uncertainty than expected in published MR imaging brain volume studies.

M R imaging makes it possible to visualize the human brain in vivo with exquisite detail and has been used extensively to examine patients with various brain illnesses, including schizophrenia (SZ). There is a large volume of literature to support the notion that there are characteristic brain structural abnormalities in patients with chronic SZ, 1 and a growing amount of literature to support that notion in patients with first-episode SZ. 2 However, experimental variance can be introduced into MR imaging studies 3 during data acquisition (eg, subject position, scanner field variation, image artifacts, scanner-to-scanner variation) or data analysis (eg, image registration, interpolation, bias field correction, manual interaction). These considerations call into question some of the conclusions that have been made about brain volume abnormalities in patients with SZ. 2 We undertook a study of the precision of brain volume measurement by MR imaging in healthy controls to determine the sample size needed to provide adequate statistical power in future MR imaging studies of brain volume. We evaluated sample sizes required for studies of whole brain volume or volume of various smaller structures in the brain. The ba-sic question is, "If we take 2 sets of measurements from a single subject (or from a single brain structure), do we obtain values that are similar enough that they can be used interchangeably?"

Subjects
Data for this study were collected as part of a 2-year randomized double-blind clinical trial that compared the efficacy and safety of olanzapine with that of haloperidol in patients experiencing firstepisode SZ. [4][5][6] Patient data from that trial were not used here; instead, we focused on brain volume data from 52 healthy controls, which were not reported in detail before. 6 Healthy individuals, most of whom were university students, were recruited as patient controls by advertisement, and each person was seen in a face-to-face interview to screen for medical or psychiatric history findings of any kind, which were exclusionary. Controls were imaged at enrollment, then again 12 weeks later on the same scanners, by using the same imaging protocol described for patients. 6 We used data only from subjects who were imaged at both time points and who had no health complaints at either time point. Controls were a mean age of 25.1 Ϯ 4.0 years at first scanning, with 67.3% being male, and the ethnic composition was 59.6% white, 28.9% African-American, and 11.5% other ethnicity.
For the purposes of this study, we assumed that there should be no changes in adult brain volume over a 12-week period in the absence of an intervening neurologic event, and we assumed that any such events would have been reported by controls or detected by clinicians.

Image Acquisition
Rigorous quality-control procedures were used to ensure that all images were acquired and analyzed by identical methods. 3 All MR im-aging data were collected at 1.5T and analyzed blind as to group membership. 6 A scout sequence was run on each subject to help in section positioning; then T1-weighted and T2-weighted image sets were acquired from each subject in the axial plane. A 3D T1-weighted inversion-recovery prepared spoiled gradient-recalled acquisition in steady state was acquired (TR ϭ 12.3 msec, TE ϭ 5.4 msec, flip angle ϭ 20°, section thickness ϭ 1.5 mm, FOV ϭ 24 cm, matrix ϭ 256 ϫ 256, 124 sections). Then a contiguous proton-attenuation/T2weighted fast spin-echo sequence was acquired (TR ϭ 4000 msec, TE ϭ 15 and 105 msec, flip angle ϭ 90°, section thickness ϭ 3.0 mm, FOV ϭ 24 cm; matrix ϭ 256 ϫ 256, 60 sections). Parameters were optimized to show gray matter (GM) and white matter (WM) with good contrast and to yield reproducible segmentation with a fully automated program.

Image Processing
All patient and control images were centrally analyzed in a multistep process designed to minimize operator interaction. 6 Processing included a bias-field correction step to adjust for intensity inhomogeneities in the images. Baseline T1-weighted data were registered to a probabilistic brain atlas, so that all brains could be analyzed and displayed in a standard coordinate system. Then T2-weighted data were registered to the T1-weighted data within the segmentation algorithm. These images formed the basis for a 3-channel segmentation, which used an automatic atlas-based segmentation program (expectation maximization segmentation) to separate brain tissue into GM, WM, and CSF. 7 The probabilistic brain atlas driving the tissue segmentation also provided a Talairach-based parcellation, dividing the left and right hemispheres, which coarsely represented the frontal, temporal, parietal, and occipital lobes. 6 Atlas registration overlaid these representations onto each scan, thereby creating a fully automatic parcellation for each dataset. Caudate volume was obtained by manual outlining of the caudate head, after rigorous operator training and standardization of methods. 6 Most of the tools used were fully automatic (atlas registration, interscan registration, tissue segmentation, parcellation), which made these procedures robust against rater drift.

Data Analysis
All data from healthy control subjects imaged at both week 0 (baseline) and week 12 were analyzed. We did not attempt to evaluate longer follow-up data because the healthy brain can potentially change over long follow-up intervals 8,9 and because we wanted to characterize measurement precision in the absence of biologic change. Statistical analysis was done by using SAS System software (Version 9.1 TS1M2; SAS Institute, Cary, NC) to compare each subject at week 12 with the same subject at week 0, by using no covariates in the analysis.
To determine the sample size required for studies of brain volume, we made several key assumptions. First, we assumed that the minimal acceptable level of power was 80%. Second, we assumed that there were no biologic differences in controls between week 0 and week 12, so that volumetric changes over this time period must have been due only to error or random variation. Finally, we assumed that patients would differ in brain volume from controls by a small amount, either 5% or 2% in different simulations. Then we used the measured variance at week 0 and week 12 and the variance in the change scores between week 0 and week 12 to calculate the sample sizes necessary to detect both a 5% and a 2% change in the week 12 values for several different study designs.
Power and sample size calculations were performed by using PROC POWER in SAS, Version 9.1. For the cross-sectional 2-group study design, calculations were based on a 2-sample t test on means (cross-sectional). For the longitudinal 1-group study, calculations were based on a 1-sample t test. For the longitudinal 2-group (change) study, calculations were based on a 2-sample t test on mean change scores. All tests were 2-tailed, and the 2-sample tests assumed equal variance in both samples. Mean brain volume, mean change, mean difference, and difference in mean change were all calculated as both 5% and 2% of the baseline values or as 5% and 2% differences between groups for the cross-sectional comparisons. The computer program used noncentral t distributions based on hypothesized effect sizes to estimate power and sample size required for 80% statistical power.

Comparison of Baseline to Follow-Up Data
The mean difference between week 0 and week 12 was generally quite small. For whole brain volume, there was only a 2.2 cm 3 (mL) discrepancy in mean volume between week 0 and week 12 (Table 1). This represented a 0.22% difference, and both the Pearson correlation coefficient and the concordance value were 0.98. We note that because the image parcellation method was fully automated, analyzing the same dataset twice would have produced identical results each time (concordance value ϭ 1.00).
For left frontal WM, there was a 0.6-mL mean difference in volume between baseline and week 12, which represented a 0.58% difference, and both the Pearson and the concordance values were approximately 0.93. Even for the caudate, which shows the largest proportional difference between week 0 and week 12, there was only a 0.8% difference, and both the Pearson and concordance values showed a strong correlation. Data in Table 1 suggest that precision in this study was excellent overall and comparable with other published brain volumetric studies. 2

Analysis of Error
To characterize error that might corrupt MR imaging findings, we did an analysis of the absolute differences between week 0 and week 12 ( Table 2). The mean volume difference between week 0 and week 12 was generally quite small (Table 1), but this could have been an artifact. If all differences are random, then one would expect some volumes to increase and others to decrease, so that the net result could be zero because volume increases are offset by volume decreases. However, even if random variations in individual measurements average to a small value, there could still be a substantial reduction in the interchangeability of data between week 0 and week 12. To evaluate this possibility, we calculated the absolute magnitude of the difference between week 0 and week 12. In whole brain, the absolute magnitude of change was roughly 8-fold larger than the mean change, or approximately 2%. The greatest single brain volume decrease was 68.3 mL or Ϫ5.7%, whereas the greatest single volume increase was 77.3 mL or ϩ6.5%. Such changes are clearly not consistent with the small changes expected in the volume of an adult brain over 12 weeks.
In the caudate, the largest volume decrease was 44%, whereas the largest volume increase was 49%. Because the caudate is rather small and its margins can be hard to define, it should not be surprising that it is measured with less reliability than whole brain.

Sample Size Estimates
The sample size required to detect a 5% difference in the week 12 values is shown in Table 3 for several different study de-signs. The most common study design is to compare patients with controls at a single time point, 2 to measure volume differences at baseline. Even in the whole brain, where volume measurements are made with the greatest precision and where artifacts arising from parcellation cannot be a factor, the required sample size for a cross-sectional study design is 146 subjects, equally apportioned between patients and controls. If Table 1: Volume (milliliters) of whole brain and brain lobes in 52 control subjects, measured at baseline and again at 12 weeks postbaseline*

Structure Baseline
Week 12 * None of the mean volume differences were significant by paired-sample t test, and the Pearson correlation between week 0 and week 12 was generally quite high. Concordance is also a measure of the degree to which values at week 0 and week 12 are correlated, unlike the Pearson, which does not take account of the sample mean, concordance is sensitive to changes in sample mean.  the study aim is to detect a 5% change in a single group with time so that each subject acts as his or her own control, the required sample size is only 4 subjects. If the study aim is to detect a 5% change in 1 group, in contrast to no change in a second group, the required sample size is 10 subjects, equally apportioned between the groups. For frontal WM, if the study aim is to detect a 5% change in a single group with time, the required sample size is only 7 subjects, all in the same group (Table 3). If the study aim is to detect a 5% change in 1 group, in contrast to no significant change in a second group, the required sample size is 20 subjects, equally apportioned between the 2 groups. Yet, if the goal is to detect a 5% difference between 2 groups, a total sample size of 134 -162 subjects is required, equally apportioned between patients and controls.
For the caudate, which is measured with much less precision, the required sample sizes are accordingly larger (Table  3). To detect a 5% change with time in caudate volume in a single group requires a sample size of 47-54 subjects, but to detect a 5% difference in change rate between 2 groups requires a sample size of 180 -208 subjects. Finally, to have 80% power to detect a 5% difference in caudate volume between 2 groups at baseline requires a sample of 210 -244 subjects.
We also calculated the sample size required to detect a 2% difference between patients and controls for several study designs (Table 4). In the whole brain, if the study aim is to detect a 2% change in a single group with time so that each subject can act as his or her own control, the required sample size is 11 subjects. If the study aim is to detect a 2% change in 1 group, in contrast to no significant change in a second group, the required sample size is 38 subjects, apportioned equally between the 2 groups. However, the required sample size for a cross-sectional study with 80% power to detect a 2% difference between patients and controls is 896 subjects overall.

Discussion
Our results suggest that the sample sizes necessary to obtain 80% statistical power to detect a 5% difference in brain volume are considerably larger than anticipated (Table 3), even though the precision of most measurements was quite good ( Table 1). The sample sizes required to detect a significant difference are correspondingly larger if an assumption is made that there is actually a 2% difference in brain volume between patients and controls (Table 4). Our findings suggest that there may be more uncertainty than expected in brain volumetric findings.
In considering whether 2 sets of measurements are equivalent, a critical consideration is the intended purpose of the comparison. If researchers are probing for large differences in a cross-sectional study or for large changes in a longitudinal study, then the strict equivalence of 2 measurements is a less important issue. If the effect size sought is large, then measurement precision need not be great to detect such a difference. Conversely, if effect sizes are small, as they are likely to be in most brain imaging studies, then one needs very precise measurements to detect a difference.
There are several ways to determine the interchangeability of 2 sets of measurements that have a continuous distribution. Traditional psychometric testing uses interclass correlations (eg, Pearson correlations), which arise from test theory. High correlation values are required for a measurement to have adequate validity; in test theory, reliability is an upper bound on validity because a test cannot be more valid than it is reliable. However, concordance is more appropriate than the common Pearson product moment correlation for assessing the interchangeability of scores because concordance is sensitive to differences in sample means as well as to the linear relationship between 2 sets of scores. For example, if 2 sets of brain volume measurements were available and all volumes in 1 set were exactly 5-fold larger than in the other set, the Pearson correlation would be 1.00, whereas the concordance correlation would be much less than 1.00, showing that the 2 datasets are not interchangeable. In our analysis, we did not test for statistical significance of the correlation between week 0 and week 12; although this value would have been significant, it would not have been meaningful because the usual significance test for a correlation tests a null hypothesis that there is zero correlation between 2 measurements. Such a null hypothesis is not appropriate in this study; if 2 datasets are to be used interchangeably, they must have a correlation close to unity.
The concordance correlations reported here (Table 1) are generally quite high. Concordance for the whole brain is 0.98, whereas the concordance for GM averages 0.94 Ϯ 0.02 and for WM, averages 0.91 Ϯ 0.03. Certain structures have lower concordance correlations (eg, caudate ϭ 0.56), showing that there is more imprecision in the measurement of small structures or structures with indefinite boundaries (such as the head of the caudate).
Spatial resolution of the imaging method was rather low (individual voxels were 0.9 ϫ 0.9 ϫ 1.5 mm in the T1weighted sequence and 0.9 ϫ 0.9 ϫ 3.0 mm in the T2weighted sequence), so it is possible that some voxels con- tained a mixture of GM and WM. A "mixed" voxel would be classified as either GM or WM, depending on the exact proportion of each tissue in the voxel, as well as on a host of other factors, 2 so problems in segmentation could account for some of the variation that we report. However, given the nature of our dataset, we cannot calculate exactly how much experimental variance was due to errors in data acquisition (eg, subject position, scanner-field variation, image artifacts, scannerto-scanner variation) and how much was due to errors in data analysis (eg, image registration, interpolation, bias field correction, manual interaction). An unexpected finding is that the sample size required to evaluate whole brain volume differences between patients and controls is substantial, even though large structures can be measured with a great deal of precision (Tables 3 and 4). This is because total brain volume varies substantially from 1 person to another in a way that is probably not informative about health or disease effects. In our study, the smallest brain volume was 783 mL, whereas the largest brain volume was 1414 mL. Therefore, the largest brain was 81% bigger than the smallest brain, even though all of our subjects were healthy, of normal intelligence, and functioning at a high level.
Perhaps it should not be surprising that brain size varies in ways that are not biologically informative. Men have brains that are, on average, ϳ9% larger than those of women, after controlling for all known covariates including body size. 10,11 GM volume is significantly correlated with verbal intelligence quotient (IQ), performance IQ, and fullscale IQ, yet only 12%-31% of the total variance in IQ can be explained by GM volume. 12 Patients with SZ have brains that are, on average, only 2% lighter in weight than age-and sex-matched controls (P Ͻ .04), but the effects of both age and sex are far more significant (P Ͻ .0001) than the effect of disease. 13 The sample size required to characterize small structures such as frontal GM is substantially larger than the sample size required to characterize whole brain volume (Tables 3 and 4). This is presumably because of greater imprecision in the measurement of small-volume structures. Yet the volume of frontal GM also has a certain amount of natural person-to-person variation that is unrelated to disease effects, as is true of the whole brain. The difference in sample size required for longitudinal-versus-cross-sectional studies gives an indication of how sample size is influenced both by measurement precision and by natural variation. If a subject is compared with himself, as in a longitudinal study, then measurement precision is the only factor that can affect the sample size. If a subject is compared with another subject, as in a cross-sectional study, then both measurement precision and natural variation affect the sample size. When a 5% difference between groups is anticipated (Table 3), a cross-sectional 2-group comparison of GM volume requires, on average, 170.5 subjects, whereas a longitudinal 2-group comparison requires, on average, 20.5 subjects. When subtle differences are anticipated between patients and controls (Table 4), a cross-sectional 2-group comparison of GM volume requires, on average, 1048.8 subjects, whereas a longitudinal 2-group comparison requires an average of 111.8 subjects. Thus, a cross-sectional study of GM volume requires 8-to 9-fold more subjects than a longitudinal study, largely because of person-to-person variation. Yet such variation in GM volume may have no clinical significance.
One approach to compensating for individual variation in volume of brain structures is to normalize brain structure volume measurements to the total intracranial volume of each subject. 14 This might make it easier to compare hippocampal volume between subject groups, but this approach has several drawbacks. If hippocampal volume is normalized to intracranial volume, a ratio is formed that should not be analyzed with the same statistical tests that are used for raw (uncorrected) values. Furthermore, this approach will tend to minimize volumetric changes that are likely to be small anyway, thereby making it harder to achieve statistical significance. For example, if hippocampal volume is 7 mL total and the brain volume is 1400 mL, then the ratio of hippocampal volume to brain volume is only 0.005 or 0.5%. Such numbers are more difficult to use in statistical tests than the raw value of 7 mL would be. A stronger experimental design is to match patients and controls for total intracranial volume, but this is not always possible.
What implications do our findings have for the earlier study 6 of brain volume change in patients with SZ receiving olanzapine or haloperidol? That study was a longitudinal analysis of 2 groups, which used change scores as an end point, so a total of only 10 subjects would be required for an evaluation of changes in whole brain volume (Table 3). That study actually included 164 patients, 6 evenly allocated between 2 treatment groups, so there was more than adequate power to detect a 5% change in total brain volume between groups. In fact, power was even adequate to detect a 2% change in total brain volume (Table 4).
However, most studies of brain volume in SZ are crosssectional, not longitudinal, 2 and our findings could have implications for any such cross-sectional studies. In a recent review of 180 cross-sectional studies of patients with chronic SZ, 1 only 11 studies apportioned at least 146 subjects between 2 study groups. In a meta-analysis of 47 cross-sectional studies of patients with first-episode SZ, 2 no studies apportioned at least 146 subjects between 2 study groups. Thus, much of what we think we know about how SZ affects brain volume is open to question.
It is especially problematic that some of the brain volume changes that have been described, especially in patients with first-episode SZ, involve volume deficits smaller than the 5% difference that we assumed here. For example, a meta-analysis of whole brain volume deficit in patients with first-episode SZ, which included 524 patients and 650 healthy controls in a cross-sectional design, concluded that the first-episode patient brain is only 2.7% smaller than the control brain. 2 This finding agrees well with the finding that brain weight is 2% less in patients with SZ. 13 Yet, if the difference in brain volume between first-episode patients and controls is actually 2%-3%, then none of the contributing studies in the meta-analysis were adequately powered to detect such a small difference.
We note that our results are directly relevant only to studies that use a pixel-count volumetric method to measure brain volume, whereas an alternative approach to characterizing brain volume is provided by voxel-based morphometry (VBM). A direct comparison of a volumetric method with VBM showed that VBM could detect significant hippocampal atrophy in a longitudinal study of patients with Alzheimer disease, whereas a volumetric method could identify no significant change. 15 This study was a longitudinal evaluation of patients and controls who were matched for volume of the hippocampus, so the study design was optimized to characterize hippocampal atrophy by VBM, even with a small sample size. In the absence of volumetric matching, natural variability in brain volume will generally force VBM studies to have a large sample size as well, even if VBM is more precise than volumetric methods.
A potential limitation of our study is that we have assumed that volume changes in the adult brain over a period of 12 weeks are due to imprecision of the MR imaging method. However, there are some studies suggesting that human brain volume, measured by MR imaging, can change rather rapidly. The average large-vessel ischemic stroke volume is 54 mL, and this volume of tissue is lost during just 10 hours of stroke evolution. 16 Lack of fluid intake for 16 hours decreases brain volume by 0.55% (or roughly 7 mL), and rehydration can increase total cerebral volume by 0.72%. 17 Acute brain volume changes have also been described in healthy people dehydrated as a result of airplane travel, 18 in young patients having prolonged febrile seizure, 19 in adults recovering from an eating disorder, 20 in patients with bipolar disorder receiving lithium, 21 in patients with multiple sclerosis either left untreated 22 or treated with methylprednisolone, 23 in patients with obsessive-compulsive disorder given paroxetine, 24 in short-stature youth receiving growth hormone therapy, 25 in patients with renal failure who got hemodialysis, 26 and in patients with SZ treated with haloperidol. 6 Among patients with SZ, acute increases in whole brain volume are associated with exacerbation of psychosis, whereas acute decreases in volume are linked to symptom remission. 27 Among alcoholdependent men, there can be acute changes in brain volume during alcohol withdrawal, 28 and WM volume is correlated with blood hematocrit. 29 In a small study of alcohol-dependent men imaged before and after 1 month of abstinence, total intracranial volume was reported to vary by only 0.4%, but WM volume increased by an average of 10.3%. 30 However, in all of these previous reports, subjects showing an acute brain volume change were demonstrably not healthy before treatment, whereas the subjects in the present study were all well at baseline and well at follow-up. Even if several of our subjects had health issues that were not detected, our sample of subjects was still characterized by remarkably good health overall.
One potential way to compensate for the experimental imprecision that we demonstrate is to model brain volume by using a more sophisticated method. Currently, each individual brain region or tissue type is typically analyzed as if it were changing independently of all other tissue types. Clearly, if a certain voxel is segmented as GM at baseline and WM at follow-up, then there will be intercorrelated changes in volume of both GM and WM. A statistical method should be developed, on the basis of correlated volumetric changes, that would acknowledge that changes in 1 tissue compartment can be offset by changes in another tissue compartment.
Our main finding is that natural variation among very healthy people can swamp all but the largest experimental or disease effects. Our results argue in strong terms for the utility of longitudinal studies in preference to cross-sectional studies, especially as research budgets decline. If one considers only WM tissues in which a 5% change in volume is expected, the average sample size required for a crosssectional study is 221 subjects, whereas the average sample size required for a longitudinal study is 41 subjects. Even if one allows that a longitudinal study requires that each subject be imaged twice, a cross-sectional study would require approximately 2.7-fold more images to be acquired than a longitudinal study and would be correspondingly more expensive for a comparable level of statistical power.