Volumetric Analysis from a Harmonized Multisite Brain MRI Study of a Single Subject with Multiple Sclerosis

The North American Imaging in Multiple Sclerosis Cooperative steering committee developed a uniform high-resolution 3T MR imaging protocol relevant to the quantification of cerebral lesions and atrophy and implemented it at 7 sites across the United States. They assessed intersite variability in scan data, by imaging a volunteer with relapsing-remitting MS with a scan-rescan at each site. In multicenter studies with consistent scanner field strength and manufacturer after protocol harmonization, systematic differences can lead to severe biases in volumetric analyses. BACKGROUND AND PURPOSE: MR imaging can be used to measure structural changes in the brains of individuals with multiple sclerosis and is essential for diagnosis, longitudinal monitoring, and therapy evaluation. The North American Imaging in Multiple Sclerosis Cooperative steering committee developed a uniform high-resolution 3T MR imaging protocol relevant to the quantification of cerebral lesions and atrophy and implemented it at 7 sites across the United States. To assess intersite variability in scan data, we imaged a volunteer with relapsing-remitting MS with a scan-rescan at each site. MATERIALS AND METHODS: All imaging was acquired on Siemens scanners (4 Skyra, 2 Tim Trio, and 1 Verio). Expert segmentations were manually obtained for T1-hypointense and T2 (FLAIR) hyperintense lesions. Several automated lesion-detection and whole-brain, cortical, and deep gray matter volumetric pipelines were applied. Statistical analyses were conducted to assess variability across sites, as well as systematic biases in the volumetric measurements that were site-related. RESULTS: Systematic biases due to site differences in expert-traced lesion measurements were significant (P < .01 for both T1 and T2 lesion volumes), with site explaining >90% of the variation (range, 13.0–16.4 mL in T1 and 15.9–20.1 mL in T2) in lesion volumes. Site also explained >80% of the variation in most automated volumetric measurements. Output measures clustered according to scanner models, with similar results from the Skyra versus the other 2 units. CONCLUSIONS: Even in multicenter studies with consistent scanner field strength and manufacturer after protocol harmonization, systematic differences can lead to severe biases in volumetric analyses.

atrophy, a commonly used supportive outcome measure of the neurodegenerative aspects of the disease in both relapsing-remitting and progressive forms of MS. [9][10][11][12][13][14][15][16][17][18] Together, lesion and atrophy measures provide complementary quantitative information about disease progression that are considered central to patient assessment. 19 Unfortunately, differences in acquisition methods have the potential to bias MR imaging metrics. Factors such as equipment manufacturer, magnetic field strength, and acquisition protocol can affect image contrast and resultant volumetric data. Indeed, several groups have investigated the reliability of volumetric measurements across scanners, [20][21][22][23][24][25][26][27] but little is understood about the variability in volumetric measurements of lesions and atrophy in individuals with MS. Furthermore, many automated segmentation algorithms depend on statistical atlases or models that are built with healthy volunteers or that depend on registration, which can be compromised by the presence of MS pathology. 28 The North American Imaging in Multiple Sclerosis Cooperative (NAIMS) was established to accelerate the pace of imaging research. As a consortium, our first aim was to facilitate multicenter imaging studies by creating harmonized MR imaging protocols across sites. In this article, we describe initial results from our pilot study, which tested the feasibility of multisite standardization of MR imaging acquisitions for the quantification of lesion and tissue volumes. We compare inter-to intrasite scan-rescan variability in various MR imaging output metrics with consistently acquired 3T acquisitions.

Participant
A 45-year-old man with clinically stable relapsing-remitting MS and mild-to-moderate physical disability was imaged at 7 NAIMS sites across the United States (Table). He developed the first symptoms of the disease 13 years before study enrollment and had been relapse-free in the previous year after starting dimethyl fumarate. His last intravenous corticosteroid administration was 5 years previously. His timed 25-foot walk at study entry was 5.3 seconds. His Expanded Disability Status Scale score was 3.5, both at study entry and exit, without any intervening relapses on-study. The participant signed in-formed consent for this study, which was approved by the institutional review board of each site.

Scan Acquisition
Through consensus agreement in the Cooperative, NAIMS developed a standardized high-resolution 3T MR imaging brain scan protocol. All imaging was acquired with Siemens scanners, which, at the time of the study, were used by most NAIMS sites. Scanrescan pairs were acquired on these scanners; the most relevant acquisition sequences are shown in the Table. At each site, the scan-rescan experiment was performed on the same day, with the participant removed and repositioned between scans. None of the participant's scans were coregistered to each other, to replicate a "real world" clinical trial setting. The volunteer was also imaged at the National Institutes of Health NAIMS site at the beginning and end of the study (5 months later) to assess disease stability. Raw MR imaging scans were distributed to 4 NAIMS sites for postprocessing.

Expert Lesion Tracing
De-identified images underwent manual quantification to assess total cerebral T1-hypointense lesion volume (T1LV) and T2LV from the native 3D FLAIR and T1 images by the consensus of trained observers (G.K., F.Y.) under the supervision of an experienced observer (S.T.). For T2LV, this process involved manually identifying all lesions on the FLAIR images. For T1LV, lesions were required to show hypointensity on T1-weighted images and at least partial hyperintensity on FLAIR images. The lesions were then segmented by 1 observer (G.K.) with a semiautomated edgefinding tool in Jim (Version 7.0; http://www.xinapse.com/ home.php) to determine lesion volumes. Images were presented to the same reading panel for all of the above steps in random order in 1 batch and mixed into a stack of 50 other MS images to reduce scan-to-scan memory effects and preserve blinding.

Automated Analysis
Several fully automated pipelines were also used to estimate T2LV and the volumes of total brain, normal-appearing white matter, and both cortical and deep gray matter structures. To prevent overfitting, we used all pipelines with their default settings, according to published recommendations for each 3T brain MRI anatomic acquisition protocols a  method separately, in which appropriate images were inhomogeneity corrected, rigidly aligned across sequences from each scan session, processed for removal of extracerebral voxels for all processing pipelines, and intensity normalized. For lesion measurements, several algorithms were applied by the laboratories that developed or codeveloped the various methods: Lesion-TOADS (TOpology-preserving Anatomical Segmentation; https://www.nitrc.org/projects/toads-cruise/), 29 a fuzzy C-meansbased segmentation technique with topologic constraints; Automated Statistical Inference for Segmentation (OASIS), 30 a logistic-regression-based segmentation method leveraging statistical intensity normalization; Subject Specific Sparse Dictionary Learning (S3DL; https://www.nitrc.org/projects/s3dl/), 31 a patchbased dictionary learning multiclass method; and White Matter Lesion Segmentation (WMLS; https://www.nitrc.org/projects/ wmls/), 32 a local support vector machine-based segmentation algorithm developed for vascular lesions that also uses corrective learning. To estimate the volume of gray matter structures, we used Lesion-TOADS; FMRIB Integrated Registration and Segmentation Tool (FSL-FIRST; http://fsl.fmrib.ox.ac.uk/fsl/ fslwiki/FIRST) 33 (a Bayesian appearance method); Multi-atlas Segmentation with Brain Surface Estimation (MaCRUISE) 34 (a combined multiatlas segmentation and cortical reconstruction algorithm); and MUlti-atlas region Segmentation utilizing Ensembles of registration algorithms (MUSE) 35 (an ensemble multiatlas label-fusion method). The FSL-FIRST 33 analysis was applied directly to the raw T1 images according to common practice, and OASIS 30 was applied to the T1, FLAIR, and a 3D T2 high-resolution sequence after preprocessing; all other pipelines were applied to appropriately preprocessed T1 and FLAIR images. Not all algorithms measured volumes of the same set of structures. Lesion-filling was not performed. Lesion-TOADS, MaCRUISE, and MUSE also yielded estimates for total brain volume.

Statistical Analysis
All statistical analyses were conducted in the R software environment (http://www.r-project.org/). 36 To compare estimated vol-umes within and across sites, we computed mean volumes and SDs. T tests were also used for differences in within-site averages between scanner platforms. Correlations between these averages across segmentation algorithms were also explored. The proportion of variation explained by site was computed, and the association with site was assessed with permutation testing. The coefficients of variation were also estimated across sites. To assess associations between session-average measured total brain and lesional volumes and time of day (morning versus afternoon), we used Wald testing within a linear model framework, both marginally and adjusting for scanner platform.

RESULTS
The participant was found to be stable regarding cerebral lesion load during the study. When we compared images acquired at the National Institutes of Health at study entry and exit, the manually measured T2LV in the participant was similar (17.9 mL in September 2015 versus 17.8 mL in February 2016). The T1LV was also stable (15.5 versus 15.1 mL). This imaging stability paralleled his clinical stability (see "Materials and Methods"). The manually estimated T1LV and T2LV for each scan is shown in Fig 1. Site explained 95% of the variation observed in the estimated T2LV and 92% of the variation in the estimated T1LV, indicating marked scanner-to-scanner differences despite protocol harmonization, which clearly exceeded scan-rescan variability within sites. The range of T2LVs was 15.9 to 20.1 mL, indicating that differences of up to 25% of the lesion volume were observed across sites. The range of T1LVs was similarly wide, ranging from 13.0 to 16.4 mL. Further inspection of these volumes across platforms indicated that Skyra (Magnetom Skyra; Siemens, Erlangen, Germany) scanners showed larger lesion volumes compared with other Siemens platforms both on T1LV (Skyra: mean T1, 15.2 mL compared with non-Skyra: mean T1, 13.8 mL; P Ͻ .05) and T2LV (Skyra: mean T2, 18.9 mL compared with non-Skyra: mean T2, 16.6 mL; P Ͻ .01). An example of the segmented lesions across scanners is provided in Fig 2. Results from the automated techniques for delineating and mea- suring T2LV are shown in Fig 3. The automated lesion segmentations showed marked disagreement in the average lesional volume measurements compared with the manually assessed volumes, and all methods showed large site-to-site differences (in some cases up to 7.5 mL, or almost 50% of the manually measured lesion volume), except for Lesion-TOADS (range, 10.5-11.0 mL), which was more stable. For all methods, site explained Ͼ50% of the observed variation; 53% of the variation was explained by site (permutation P ϭ .36) for S3DL, 54% for Lesion-TOADS (P ϭ .41), 44% for OASIS (P ϭ .57), and 83% for WMLS (P ϭ .002), which clearly was most prone to site-related variation.
To measure brain structure volumes, we used several auto-   Finally, the proportion of variation explained by site is shown in Fig 7. Note that in almost all cases, site explained Ͼ50% of the variation, with most measurement techniques showing Ͼ80% variation due to site for all structures assessed. While all images were acquired on 3T Siemens scanners, the model type appeared to influence the results; there was evidence of systematic differences in many measurements between Skyra and non-Skyra scanners. Figure 8 shows the negative log P values for the comparison of volumes averaged across scan-rescan measurements, with larger values indicating more systematic differences between platforms. The largest platform-associated differences were observed in MaCRUISE measurements of normalappearing white matter, cortical gray matter, and, consequently, total brain volume. Lesion-TOADS also showed large differences in total brain volume attributable to cortical gray matter, as did S3DL for T2LV measurements. MUSE showed major differences in thalamic volume across scanner models, and FSL-FIRST showed similar discrepancies in the thalamus and caudate. The correlation between site-averaged measurements varied dramatically, especially for lesional and total brain volume measurements (On-line Fig 6); this variation indicates that site differences resulted in contrasting effects on output from the different algorithms. While the other measurements showed less scanner modelrelated variation, most still showed prominent differences between Skrya and non-Skrya scanners.
The time of day of scan acquisition was not associated with manually segmented T1 lesion volumes (t ϭ 0.45) or T2 lesion volumes (t ϭ 0.38) or total brain volume, as measured by any of the automated algorithms (On-line Figs 7 and 8).

DISCUSSION
Clinical MS therapeutic trials have traditionally used 1.5T MR imaging platforms to provide metrics on cerebral lesions and atrophy as supportive outcome measures. However, there is growing interest in the use of high-resolution 3T imaging to assess disease activity and disease severity in MS. Such 3T imaging has the potential for increased sensitivity to lesions 37,38 and atrophy, 39 higher reliability, 39,40 and closer relationships to clinical status, 38,39 compared with scanning at 1.5T. The purpose of this study was to evaluate the consistency of metrics obtained from a single MS participant with a high-resolution 3T brain MR imaging protocol distributed to 7 sites. The results of our study indicate that even in multicenter acquisitions from the same scanner vendor after careful protocol harmonization, systematic differences in images led to severe biases in volumetric analyses. These biases were present in manually and automatically measured volumes of white matter lesions, as well as in automatically measured volumes of whole-brain and gray and white matter structures. These biases were also highly dependent on scanning equipment, which resulted from a higher sensitivity to lesions in newer scanners from the same manufacturer compared with earlier models, even at the same field strengths. In comparison with past estimates of reliability of volumetric measurements of brain structures, our findings point to higher between-site variation than previously documented. In particular, Cannon et al 27 reported that between 3% and 26% of the observed variation in global and subcortical volumes were attributable to site; this was a study of 8 healthy participants imaged on 2 successive days across 8 sites with 3T Siemens and GE Healthcare scanners. However, the proportion of explained variation has a different interpretation from that reported here. The total variation in Cannon et al consisted of 4 contributors to variance: first, across-site differences; second, across-scan differences; third, across-day differences; and fourth, across-subject differences. In our singleparticipant study, we isolated only the first 2 variance components, allowing us to compare variation because it is relevant for precision medicine (subjectspecific) applications.
Previous work indicated that the observed variation attributable to scanning occasion was small 25,27 ; indeed, Cannon et al 27 found this to constitute Ͻ1% of the variation. Thus, we did not scan our participant on subsequent days but rather simply repositioned the participant between scans during the same imaging session. A notable difference between our study and that of Cannon et al is that we did not use data from a standardized phantom concurrently acquired for correction of between-scanner variations in gradient nonlinearity and scaling. Cannon et al found that this correction improved between-site intraclass correlations and greatly reduced differences between scanner manufacturers. Similarly, Gunter et al 41 reported the usefulness of a phantom for scanner harmonization and quality control in the Alzheimer's Disease Neuroimaging Initiative (http://www.adni-info.org/). In future studies, we will focus on applying phantom calibrations across NAIMS sites to extend our current observations. Despite the growing literature on the importance of diurnal variation and hydration status for volumetric analyses, [42][43][44][45] we found no significant associations between time of day and measured volumes. This may indicate that in single-participant analyses, time of day and day-to-day variation may be of less concern than the much larger source of variation of scanner platform. Most interesting, Cannon et al also found that measurements acquired with scanners from the same manufacturer and similar receive coils had higher reliability. In our study, we found that even scanner models (ie, Skyra versus non-Skyra) from the same manufacturer varied markedly in their estimates of lesion volume; this variation highlights the importance of between-scanner differences for assessing MS-related structural changes.
To assess differences across processing pipelines, we used a variety of techniques for automated segmentation of lesion and white and gray matter volumes. Different segmentation algorithms showed a range of variability in their estimates, as well as their sensitivity to differences between scanners. For example, Lesion-TOADS showed much less variable lesion measurements  than any other technique and was not as sensitive to differences in scanner platform. Lesion-TOADS was the only unsupervised lesion-segmentation technique used. Contrast differences between the participant data and the training data of the other supervised methods could be associated with greater sensitivity to scanner differences, and this might be mitigated by specific (albeit potentially laborious) tuning to individual platforms. However, while sensitivity to biologic change is generally higher for methods yielding less noisy estimates, because only a single individual was studied here, our data cannot be taken to indicate that Lesion-TOADS is superior to other methods of estimating thalamic volume, for example. Additionally, both purely intensity-based segmentation algorithms, OASIS and WMLS, appeared to be more sensitive to site differences, which may indicate that methods that rely more on topology, shape, or spatial context may be more stable across scanners. This finding indicates that across-scanner differences may be driven by contrast differences rather than geometric distortions. Future investigation to extend these findings could involve quantitative contrast-to-noise and signal-to-noise comparisons across scanners. Allowing segmentation parameters to vary across sites could also help stability.
A limitation of this study is its single-subject and single time point design, which makes the generalizability of the findings dependent on further investigation. In particular, the degree to which across-site differences might vary by lesion burden and degree of atrophy, as well as demographic variables, requires additional study. Future larger studies of multiple participants across disease stages, including longitudinal measurements, are necessary for understanding the implications of the biases described in this pilot study. Indeed, such studies would also allow the assessment of the trade-off between stability in measures across sites, with sensitivity to biologic differences. Differences between scanning equipment and scanner software versions have also been noted in past studies of reliability, 23,25,27,46,47 but their implications for the assessment of pathology remain unclear. In particular, repeat acquisitions on scanners with different receive coils could provide additional insight concerning reliability. In addition, our study was from a single time point across scanners, whereas clinical trials rely on the quantification of intrasubject longitudinal change. 48 Each participant is typically scanned on the same platform, which may limit the variability in on-study change between participants. Further studies are necessary to assess whether scan platform introduces the same level of acquisition-related variability when assessing longitudinal changes.
Given the intersite differences observed in lesional measurements, acrosssite-inference statistical adjustment for site is clearly necessary when analyzing volumetrics from multisite studies, even when images are acquired with a harmonized protocol on 3T scanners produced by the same manufacturer. From a single participant, it is unclear what the role of differential sensitivity to lesions might be across individuals with heterogeneity in lesion location. For example, while lesion detection in the supratentorial white matter might be more straightforward and comparable across individuals, detection of lesions in the brain stem, cerebellum, and spinal cord may be more sensitive to differences in equipment. New statistical methods for measuring and correcting systematic biases are warranted, especially for studies in which patient populations may differ across sites. Indeed, intensity normalization and scan-effect removal techniques [49][50][51][52][53][54][55] (akin to batch-effect removal methods in genomic studies 56 ) are an active area of methodologic research and promise to improve comparability of volumetric estimates from automated segmentation methods. After volumes are measured, statistical techniques for modeling estimated volumes from multicenter studies are also rapidly evolving. 18,57 These techniques bring the potential to mitigate site-to-site biases in grouplevel analyses, with better external validity at the cost of increased sample size.

CONCLUSIONS
By imaging the same subject with stable relapsing-remitting MS during 5 months, we assessed scanner-related biases in volumetric measurements at 7 NAIMS centers. Despite careful protocol harmonization and the acquisition of all imaging at 3T on Siemens scanners, we found significant differences in lesion and structural volumes. These differences were especially pronounced when comparing Skyra scanners with other Siemens 3T platforms. The results from this study highlight the potential for interscanner and intersite differences that, unless properly accounted for, might Negative logarithm (base 10) P value from t tests describing the difference in average volume between Skyra-versus-non-Skyra platforms explained by site with various segmentation methods for different structures in the brain. cGM indicates cortical gray matter; NAWM, normal-appearing white matter; TBV, total brain volume.
confound MR imaging volumetric data from multicenter studies of brain disorders.
Our findings raise a key issue of the interpretability of MR imaging measurements in the context of personalized medicine, even in carefully controlled studies with harmonized imaging protocols.