A Validation Study of Multicenter Diffusion Tensor Imaging: Reliability of Fractional Anisotropy and Diffusivity Values

BACKGROUND AND PURPOSE: DTI is increasingly being used as a measure to study tissue damage in several neurologic diseases. Our aim was to investigate the comparability of DTI measures between different MR imaging magnets and platforms. MATERIALS AND METHODS: Two healthy volunteers underwent DTI on five 3T MR imaging scanners (3 Trios and 2 Signas) by using a matched 33 noncollinear diffusion-direction pulse sequence. Within each subject, a total of 16 white matter (corpus callosum, periventricular, and deep white matter) and gray matter (cortical and deep gray) ROIs were drawn on a single image set and then were coregistered to the other images. Mean FA, ADC, and longitudinal and transverse diffusivities were calculated within each ROI. Concordance correlations were derived by comparing ROI DTI values among each of the 5 magnets. RESULTS: Mean concordance for FA was 0.96; for both longitudinal and transverse diffusivities, it was 0.93; and for ADC, it was 0.88. Mean scan-rescan concordance was 0.96–0.97 for all DTI measures. Concordance correlations within platforms were, in general, better than those between platforms for all DTI measures (mean concordance of 0.96). CONCLUSIONS: We found that a 3T magnet and high-angular-resolution pulse sequence yielded comparable DTI measurements across different MR imaging magnets and platforms. Our results indicate that FA is the most comparable measure across magnets, followed by individual diffusivities. The comparability of DTI measures between different magnets supports the feasibility of multicentered clinical trials by using DTI as an outcome measure.

D TI provides a quantitative measurement of the magnitude and direction of water diffusion within tissues. 1,2 Changes in water diffusion reflect alterations in tissue microstructure 3 and thus make DTI a noninvasive technique to study diseases caused by tissue destruction, such as MS. 2,4 The extent of tissue injury is commonly evaluated by using DTI-derived measures such as FA, ʈ , Ќ , and ADC. 2,5 Studies in patients with MS revealed changes in DTI measures not only within MS lesions but also in normal-appearing white matter and gray matter, which are generally not discernible by using conventional T1-and T2-weighted imaging techniques. [6][7][8][9] The sensitivity of DTI metrics to ongoing changes in the brain makes it an attractive outcome measure for MS thera-peutic studies. 10 However, little is known about the validity of DTI measures between scanners. Recently, 2 independent studies 11,12 have attempted to address the reproducibility of DTI measures. They assessed FA 11,12 and Ќ 11 in the corpus callosum across different centers, all by using 3T scanners. They both found DTI measures to generally be homogeneous across centers when measured with scanners from the same manufacturer.
However, 2 major issues that remain to be resolved are the following: the comparability of DTI measures across different brain regions and the comparability of DTI measures across different scanners. We thus sought to address these issues by comparing DTI measures obtained from 5 centers from 2 subjects. We addressed the following issues: 1) scanners from different manufacturers, 2) multiple platforms from the same manufacturer, 3) DTI measures from different brain regions with a wide range of anisotropy, and 4) software upgrades.

MR Imaging Scanners
Two volunteers (both men and age 35, healthy controls) were imaged on five 3T MR imaging magnets located in different locations: 3 Trios (Siemens, Erlangen, Germany) and 2 Signas (GE Healthcare, Milwaukee, Wisconsin). In addition, the same volunteers were imaged annually during the course of 2 years (year 2 and year 3 on one of the Siemens magnets, Trio-1). During the course of the study, a scanner upgrade (TIM upgrade) on the Trio-1 alone between years 1 and 2 allowed comparisons across the upgrade (year 1 versus year 2) as well as independent of the upgrade (year 2 versus year 3). The other 2 Trio scanners (year 1) were used before the TIM upgrade.
We adapted these pulse sequences to the GE scanners, endeavoring to match the parameters. The imaging included anatomic T1 MPRAGE (spoiled gradient-recalled 256 ϫ 256 FOV, 128 ϫ 128 matrix, 124 sections, 1.2 mm thick, TE ϭ 1.3 ms, TR ϭ 2800 ms, TI ϭ 900 ms, bandwidth ϭ 977 Hz/Px) and twice-refocused spin-echo DTI with EPI readout (same as for Siemens, except TE ϭ 94 ms, TR ϭ 4200 ms, bandwidth ϭ 1954 Hz/Px). The diffusion-gradient directions were explicitly matched to be the same on each scanner. The voxel size of DTI for both Siemens and GE scanners was 2.5 ϫ 2.5 ϫ 2.5 mm.

Data Analysis
Image Registration. Images were registered by using a surfacebased method, a modified version of the Iterated Closest Point algorithm, 13 which minimized the distance between automatically segmented brain surfaces to determine the optimal rigid transformation (3 rotations ϩ 3 translations). Transformation parameters were determined between the DTI and MPRAGE images acquired at the same site and between the MPRAGE images acquired at each of the 5 sites. These parameters were further combined to obtain a set of final registration transformation parameters, which were used to map ROIs as described below.
Mapping ROIs. For each subject, 16 ROIs were drawn on the MPRAGE image from 1 of the magnets (Trio-1, Fig 1), which was used as a reference mask. The ROIs encompassed 5 major areas with differing anisotropic properties: 1) the corpus callosum (genu, splenium), 2) periventricular white matter (parietal, occipital), 3) deep white matter (frontal, parietal, occipital), 4) cortical gray matter (posterior parietal, occipital), and 5) deep gray matter (putamen). ROI volumes ranged from approximately 182 to 742 mm 3 , with the smallest ROIs located in the cortex. The final registration transformations (described above) were applied to coregister the reference ROI mask to each set of DTI images obtained from the other 4 centers. The ROIs were verified visually and adjusted manually to account for nonlinear distortions inherent to EPI data.
Calculation of FA, ʈ , Ќ , and ADC. Mean values for FA, ʈ and Ќ , and ADC were calculated within each ROI. The diffusion tensor and its properties (eigenvalues, FA, and ADC) were calculated by using software developed in-house. First, the diffusion profile for the 33 diffusion directions was calculated on a voxel-by-voxel basis by a standard linear-log fit. 14 The diffusion profile was fit to the 9-element diffusion tensor by using least-squares. The tensor was then diagonalized to obtain the eigenvalues and eigenvectors by using standard routines to derive ʈ and Ќ . 15 ADC was calculated by taking the mean of the eigenvalues, and FA was derived accordingly. 1

Statistical Analysis
All analyses were performed by using SAS (SAS Institute, Cary, North Carolina). The Lin concordance correlation 16 was used to compute concordance among DTI measures from all 16 ROIs. The following comparisons were made to evaluate the relationship between scanners: 1) overall mean concordance (ie, the mean of each paired concordance analysis) among each of the 5 magnets, 2) overall mean concordance within the 3 Siemens and within the 2 GE magnets (same-platform comparison), 3) overall mean concordance between the Siemens and GE magnets (cross-platform comparison), 4) longitudinal comparison (year 1-year 3) in Trio-1, and 5) before and after TIM upgrade in Trio-1. The CV was derived within each ROI among the 5 different scanners and longitudinally within the single scanner, and it was then averaged together across the ROIs and 2 study subjects.

Overall Comparison among the 5 Magnets
The regional mean DTI values were comparable among all 5 scanners (Fig 2). The FA, ʈ , Ќ , and ADC values obtained from regions of low (gray matter), medium (periventricular and deep white matter), and high (corpus callosum) anisotropy were very similar among the 5 scanners. As indicated by error bars in Fig 2, the ʈ , Ќ , and ADC values were more variable in the cortical gray matter compared with other regions, possibly because these ROIs were smaller and, therefore, more sensitive to noise. Because FA is calculated as a ratio with diffusivities in the numerator and denominator, noise effects may partially cancel out. The overall mean concordance (Table 1) was highest for FA (0.96) and lowest for ADC (0.88), with the values for both Ќ and ʈ being 0.93.

Cross-Platform Comparison
To evaluate manufacturer differences, we compared the DTI measures obtained from the 3 Siemens platforms with those from the 2 GE platforms (On-line Fig 1 and Table 1). The concordance was strongest for FA (0.96), followed by ʈ (0.92) and Ќ (0.91), and weakest for ADC (0.86).

Within-Platform Comparison
We found better concordance in the within-platform than the cross-platform comparison ( Table 1, On-line Fig 1). We found concordance to be the strongest for FA (0.96 for Siemens; 0.97 for GE). In addition, we also determined concordance for the other 3 DTI measures and found strong concordance for Ќ and ʈ (0.93 for Siemens, 0.94 for GE) as well. As in earlier cases, we observed a weak concordance for ADC (0.89 for Siemens, 0.90 for GE).

Longitudinal Scan Comparison
There was a modestly higher concordance with ʈ , Ќ , and ADC when the 2 subjects were rescanned on the same Siemens magnet (Table 1 and On-line Fig 2), compared with findings between magnets.

Software Upgrade Comparison
The first longitudinal scan (year 1) was obtained before the Siemens TIM upgrade, and the subsequent 2 scans (year 2 and year 3) were obtained after the upgrade. There was excellent correlation both across and after the software upgrade (Table  1 and On-line Fig 2). We observed a small decrease in mean FA in periventricular and deep white matter regions, which was driven predominantly by a corresponding increase in Ќ (Online Fig 2). There was also a small increase in ADC values following the TIM upgrade (On-line Fig 2); however, that did not appear to have any effect on concordance values.

CV
The CV across scanners varied from 4.8% to 9.1%, depending upon the DTI measure (Table 2). CV was less within-scanner with time, compared with across different scanners. CV was generally lower when evaluating white matter only.

Discussion
We evaluated the applicability of DTI in multicenter clinical trials by systematically comparing DTI measures obtained by scanning 2 subjects in 5 different 3T MR imaging scanners by using a standardized pulse sequence. Comparability of DTI measures is important when combining values from different magnets and different platforms together into a single dataset. We observed very strong concordances (Fig 2 and On-line Fig  1 and Table 1) for FA, Ќ , and ʈ values across different mag-nets within different brain regions encompassing both white and gray matter.

Reliability of FA and Ќ
We found the highest concordance for FA (0.96) followed by the 2 component diffusivities and the weakest for ADC. One explanation for the lower ADC concordance is that because ADC is the mean of the diffusion tensor eigenvalues, it can reflect variability due to changes in the overall magnitude of diffusivity, anisotropy of diffusivity, or a combination of the 2 effects. Therefore, fitting the diffusion profile to a single isotropic coefficient may lead to considerable variability. Moreover, diffusion in the brain is not isotropic, thereby making the derivation of a precise diffusion coefficient value more complex. A previous simulation study had also predicted that cross-scanner comparisons would find greater variability in ADC than FA, due to the underlying properties of ADC calculations as mentioned above. 17 The results of our study suggest that ADC may provide less statistical power than the other metrics examined. However, our CV estimates found the greatest variability in FA and the least in ADC (Table 2). Because FA is a unitless ratio of eigenvalues, variability in individual eigenvalues may compound when eigenvalues are combined to derive FA, leading to greater variability. Concordance values are a function not only of the absolute difference in values but also the range of values. As evident from Fig 2, there are more significant variations in FA values throughout the brain compared with ADC, thus having a positive impact on concordance, despite the higher CV. In addition, FA becomes non-Gaussian at high values, whereby the same amount of noise in individual eigenvalues will lead to less variability with high FAs compared with low FAs. This non-Gaussian behavior of FA may have contributed to its stronger concordance. Two potential sources of error in our study are inaccuracies in image coregistration and variability in different diffusion pulse sequences across platforms. Inaccuracies in coregistration would decrease the observed correlations between ROIs, which is expected to be greater between platforms. FA, ʈ , and Ќ have greater regional variability than ADC, which would suggest that FA and component diffusivities would be more susceptible to coregistration error and thus have weaker concordance correlations. We have, in fact, observed the opposite-stronger correlations with regional FA and component diffusivities than ADC. We used identical pulse sequences within platforms and matched the pulse sequences between platforms. The effect of a less well-matched pulse sequence is not known.
These observations regarding comparability suggest that in multicentered studies, it may be best to rank regional FA followed by component diffusivities as a higher priority outcome measure than regional ADC. Furthermore, our observations suggest that ROI analyses are sufficiently reproducible to apply in multicentered studies. This is of significance especially in MS, because subtle changes within small regions of the brain may go undetected in a whole-brain histogram analysis. Furthermore, changes in DTI values have been shown pathologically to reflect alterations in tissue integrity. 18 We have evaluated 16 ROIs in 2 healthy controls and found similar concordance in both individuals (Fig 2). Although it is unlikely that our observations would change with additional healthy controls, the cross-scanner comparability of DTI measures in individuals with MS is not known. Most important, all the measures from 16 ROIs were used to estimate Lin concordance coefficient. Due to small sample sizes, we did not perform statistical comparisons among the concordance coefficients. Partial volume averaging between fiber populations of different orientations can lead to systematic dependence of diffusion tensor properties on parameters such as voxel volume, shape, or location. 19 Although this study suggests the feasibility of a multicenter DTI trial, the concordances may vary with changes in scan parameters.

Longitudinal Scan and Software Upgrade Comparison
We included a longitudinal scan-rescan analysis, in which we found FA concordance similar to that observed between magnets. Component diffusivities and ADC values were modestly more reproducible on scan-rescan than between scanners. This difference likely reflects cross-scanner differences in the absolute measures of diffusivity, a difference that is greatly reduced when using a scalar measure such as FA. The similarity of FA concordance values between and within magnets suggests that the remaining variability is likely secondary to either biologic variability or errors in co-registration. Scanrescan reproducibility of DTI measures with 3T scanners has been previously published and has been satisfactory for FA and ADC values (Յ5%). 20,21 In our study, we have additionally determined regional values for diffusivity measures and found them to be comparable.
In the longitudinal scan-rescan component of our study, scanner upgrade led to a slight shift in FA values, which appeared to be driven predominantly by a small change in Ќ . Despite slight changes in diffusion values, DTI scans retained good concordance following the system upgrade. This is important because it is difficult to maintain specific platform software and hardware configurations during the course of a prolonged longitudinal study. This observation suggests that longitudinal control subjects may be useful in longitudinal studies of patients with disease, particularly across system upgrades.
Our studies used a high-field magnet (3T) and a high-angular-resolution pulse sequence (33 noncollinear directions), which have been found, in previous studies, to provide robust diffusion tensor estimates. 17 Because low-angular-resolution (eg, 6-direction DTI) decreases the reliability of tensor estimates, it is likely that concordance would be lower. A 1.5T system will, in general, demonstrate a lower signal intensityto-noise ratio but less susceptibility-related artifacts than a 3T system. The choice of field may, therefore, hinge on whether the ROI is located in areas of brain that have susceptibility artifacts, such as the medial temporal lobe. 9 A previous scanrescan study of 8 healthy adults by using a single 1.5T magnet and a 60-direction DTI pulse sequence found 3%-6% withinpatient CV in FA and MD in different organized fiber tracts. 22 When the authors re-analyzed their data by using only 12 directions, there was little change in CV, though the tracts were significantly smaller and had a higher FA and lower MD.
Another scan-rescan study on 10 healthy adults by using two 1.5T scanners, a 6-direction DTI pulse sequence, and an ROI approach found a within-scanner CV of 1.9% for FA in the corpus callosum. 23 The non-Gaussian distribution of high FA values in the corpus callosum may have contributed to the low CV. In the same study, CV increased to 4.5% when comparing between scanners. 23 With the exception of FA, our CVs across scanners within white matter are similar to those reported in these other studies. 22,23 Our higher CV for FA may relate to inclusion of a broad spectrum of FA values, whereas previous studies focused only on highly organized white matter regions, which have high FA values.
Our results are in agreement with a recent small study reporting greater homogeneity in corpus callosum FA and Ќ values compared with ʈ and mean diffusivity across different centers. 11 The authors' observations were based on the comparison of 5 subjects on 2 Siemens scanners. Our experimental design expanded these observations by including the platforms of 2 different manufacturers in 5 different centers. Furthermore, in the same analysis, we have incorporated a longitudinal study and software upgrade, a situation that closely mimics real-life scenarios. We have also not limited our analysis to the corpus callosum (high anisotropic region) alone but included other areas of the brain with a diverse range of anisotropic properties. A second study, 12 which compared both scan-rescan and intercenter reliability, also validated the homogeneity of FA values. As with the previous one, this study did not compare platforms from different manufacturers, and the authors' analysis was restricted to white matter regions of the brain.
A recent study that evaluated brain atrophy measurements in patients with MS, as estimated by 5 different centers, also reported an intercenter concordance value of 0.95 for the percentage of brain volume change and 0.94 for normalized brain volume. 24 The concordance values for brain atrophy are very similar to those observed in our analysis for DTI measures. Adequate characterization of variability among scanner platforms will prove important for multicenter DTI-based trials. Given a larger number of subjects, voxel-based methods such as those used by Takao et al 25 may prove to be more powerful. However, this study shows that even with a limited dataset, it is possible to assess the reproducibility of diffusion tensor measures across a number of platforms.

Conclusions
We have observed strong correlations in DTI measures from different magnets and different manufacturers, with the strongest correlation observed with FA, followed by Ќ , and ʈ . Despite significant pulse sequence differences between platforms, DTI values appeared to vary little between platforms. In scan-rescan comparisons on the same magnet, we have found excellent correlations in all DTI values. Our study provides strong evidence for the feasibility of multicentered DTI studies by using a 3T MR imaging and a high-angular-resolution pulse sequence.