Feasibility of Brain Atrophy Measurement in Clinical Routine without Prior Standardization of the MRI Protocol: Results from MS-MRIUS, a Longitudinal Observational, Multicenter Real-World Outcome Study in Patients with Relapsing-Remitting MS

Brain atrophy outcomes of 590 patients were analyzed by the percentage brain volume change measured by structural image evaluation with normalization of atrophy on 2D-T1WI and 3D-T1WI and the percentage lateral ventricle volume change, measured by VIENA on 2D-T1WI and 3D-T1WI and NeuroSTREAM on T2-FLAIR examinations. The median annualized percentage brain volume change was -0.31% on 2D-T1WI and -0.38% on 3D-T1WI. The median annualized percentage lateral ventricle volume change was 0.95% on 2D-T1WI, 1.47% on 3D-T1WI, and 0.90% on T2-FLAIR. The authors conclude that brain atrophy was more readily assessed by estimating the percentage lateral ventricle volume change on T2-FLAIR compared with the percentage brain volume change or percentage lateral ventricle volume change using 2D- or 3D-T1WI. BACKGROUND AND PURPOSE: Feasibility of brain atrophy measurement in patients with MS in clinical routine, without prior standardization of the MRI protocol, is unknown. Our aim was to investigate the feasibility of brain atrophy measurement in patients with MS in clinical routine. MATERIALS AND METHODS: Multiple Sclerosis and Clinical Outcome and MR Imaging in the United States (MS-MRIUS) is a multicenter (33 sites), retrospective study that included patients with relapsing-remitting MS who began treatment with fingolimod. Brain MR imaging examinations previously acquired at the baseline and follow-up periods on 1.5T or 3T scanners with no prior standardization were used, to resemble a real-world situation. Brain atrophy outcomes included the percentage brain volume change measured by structural image evaluation with normalization of atrophy on 2D-T1-weighted imaging and 3D-T1WI and the percentage lateral ventricle volume change, measured by VIENA on 2D-T1WI and 3D-T1WI and NeuroSTREAM on T2-fluid-attenuated inversion recovery examinations. RESULTS: A total of 590 patients, followed for 16 months, were included. There were 585 (99.2%) T2-FLAIR, 425 (72%) 2D-T1WI, and 166 (28.2%) 3D-T1WI longitudinal pairs of examinations available. Excluding MR imaging examinations with scanner changes, the analyses were available on 388 (65.8%) patients on T2-FLAIR for the percentage lateral ventricle volume change, 259 and 257 (43.9% and 43.6%, respectively) on 2D-T1WI for the percentage brain volume change and the percentage lateral ventricle volume change, and 110 (18.6%) on 3D-T1WI for the percentage brain volume change and percentage lateral ventricle volume change. The median annualized percentage brain volume change was −0.31% on 2D-T1WI and −0.38% on 3D-T1WI. The median annualized percentage lateral ventricle volume change was 0.95% on 2D-T1WI, 1.47% on 3D-T1WI, and 0.90% on T2-FLAIR. CONCLUSIONS: Brain atrophy was more readily assessed by estimating the percentage lateral ventricle volume change on T2-FLAIR compared with the percentage brain volume change or percentage lateral ventricle volume change using 2D- or 3D-T1WI in this observational retrospective study. Although measurement of the percentage brain volume change on 3D-T1WI remains the criterion standard and should be encouraged in future prospective studies, T2-FLAIR–derived percentage lateral ventricle volume change may be a more feasible surrogate when historical or other practical constraints limit the availability of percentage brain volume change on 3D-T1WI.

M ultiple sclerosis is an inflammatory and neurodegenerative autoimmune disease of the central nervous system characterized by demyelination and axonal degeneration. 1 The mea-surement of brain atrophy has become one of the most important biomarkers for assessing the extent of neurodegenerative pathology in patients with MS. [2][3][4] Development of brain atrophy in patients with MS is 3-5 times accelerated compared with the healthy aging population, 2,3,5,6 correlates with physical and cognitive disability from the earliest disease stages, 7,8 and continues throughout the course of the disease. [9][10][11][12][13] Evidence is mounting that the development of disability progression in patients with MS is partially independent of accumulation of active MR imaging lesions and substantially dependent on the development of brain atrophy. 14 Hence, there is an increasing interest in evaluating the effect of disease-modifying therapy on decelerating brain volume loss in clinical trials and consequently making personalized, patient-centric treatment choices. 2,4,15,16 Therefore, there is an urgent need for translation and incorporation of brain atrophy measurement into clinical routine and patient-level treatment monitoring. 2,4 While the need for assessment of brain atrophy on the individual patient level for the effectiveness of treatment monitoring has become a hot topic in the literature recently, 15 there is little understanding of how major obstacles are to be overcome. The feasibility of brain atrophy measurement for the short term, midterm, and long term, [2][3][4]17 using MR imaging sequences available in a clinical routine is currently unknown.
Against this background, we aimed to determine whether it is feasible to measure brain atrophy in patients with MS in clinical routine without prior standardization of the MR imaging protocol, using a multicenter study design that included academic and nonacademic centers in the United States to resemble the realworld situation. The study evaluated MR imaging scanner strength, type, and the quality of pulse sequence characteristics and investigated whether MR imaging changes influence the measurement of brain atrophy on the most commonly used pulse sequences in a clinical routine on a group level over the midterm.

MATERIALS AND METHODS
The Multiple Sclerosis and Clinical Outcome and MR Imaging in the United States (MS-MRIUS) study is a multicenter, longitudinal, retrospective, observational chart review of patients with MS treated with fingolimod (Gilenya) in clinical routine practice. The methodologic approach and design of this retrospective study have been previously reported. 18 Briefly, clinical information and digital brain MR imaging image data were retrospectively collected from 33 participating academic and nonacademic MS centers across the United States and integrated into a central research data base.
The inclusion criteria were the following: 1) adult patients younger than 65 years of age with relapsing-remitting MS (RRMS) able to walk 20 m with or without assistance at the index date (defined as the date the patient first received treatment with fingolimod), equivalent to an Expanded Disability Status Scale 19 score of Յ6.5; 2) starting fingolimod at the index date and remaining on fingolimod for at least 28 days; 3) the availability of clinical data Ϯ12 months from baseline (index date); and 4) a minimum of index and follow-up (postindex date) MR imaging examinations performed from 6 months before to 1 month after the index date and 9 -24 months after index, respectively. Key exclusion criteria were the following: 1) neurologic disease other than MS affecting CNS structure or function; 2) a history of alcohol or substance abuse; 3) participation in an interventional trial during the study period; 4) prior use of fingolimod or natalizumab; and 5) steroid treatment in the 30 days prior to the scan dates.
A subgroup of 184 patients had preindex scans performed 9 -24 months before fingolimod initiation (median, 13.9 months), but this was not a required inclusion criterion. This subgroup of patients was used only to investigate serial longitudinal changes (3 time points) of brain volume, measured for approximately 30 months.
All demographic and clinical data required for this study were collected from patient medical records into a study-specific electronic clinical research form. For each patient, clinical information was collected for a 48-month period, including 12-24 months' data in the pre-and postindex period.
MR imaging examinations should have been performed on 1.5T or 3T scanners, and no standardization was expected. Patients did not need to have study examinations performed on the same scanner type and strength to resemble a real-world situation. 2D-or 3D-T2 fluid-attenuated inversion recovery and 2Dor 3D-T1-weighted images were collected. The participating sites transferred digital images using the standard DICOM format. To ensure that patient privacy was protected and that we adhered to relevant regulations, the centralized imaging center followed guidance from DICOM PS3.15 2015b, Security and System Management Profiles, Annex E: Attribute Confidentiality Profiles (http://dicom.nema.org/medical/Dicom/2015b/output/html/ part15.html#chapter_E). 20 Automatic de-identification via the on-line transfer portal was performed for all study scans. This pathway was the simplest and least burdensome for the sites, because all sites had digital transfer capability. DICOM images were automatically anonymized before transmission to the centralized imaging center via encrypted channels, and there was no "burnedin" information on the images. All scans were inspected by an experienced rater at the centralized imaging center. We evaluated the following quality metrics: section thickness, excessive patient motion ("yes" or "no"), image contrast ("bad," "acceptable," or "good"), and overall quality ("bad," "acceptable," or "good"). The overall quality metric reflected anatomic coverage, the presence of imaging artifacts, noise level, and contrast. Examinations with excessive patient motion or bad image contrast automatically received a bad rating for overall quality. Additionally, for each MR imaging examination, differences in hardware model, scanner software, and the coil between index and postindex were evaluated. For each MR imaging sequence (2D-or 3D-T2-FLAIR, 2D-T1WI, and 3D-T1WI), differences in orientation, thickness, and protocol changes were examined. Then, overall hardware, software, coil, or protocol differences between time points were determined. Hardware change was defined as a change in the MR imaging scanner. Software was defined as an upgrade to a different software version using the same scanner. Coil change was defined as a change of the coil. Protocol change was defined as a meaningful change in TR/TE/ TI/flip angle/geometry. When the hardware changes occurred, in almost all instances, software, coil, and protocol changes were noticed; therefore, we refer to those as the MR imaging scanner change. When software, coil, or protocol changes occurred without hardware change, we refer to those as software/coil/protocol MR imaging changes.
Longitudinal brain atrophy outcomes included percentage brain volume change (PBVC) measured by structural image evaluation, with normalization of atrophy (SIENA) 21 on 2D-T1WI and 3D-T1WI and the percentage lateral ventricle volume change (PLVVC) measured by VIENA 22 on 2D-T1WI and 3D-T1WI examinations and measured by NeuroSTREAM 23 on T2-FLAIR. Lesions were inpainted before segmentation to reduce the impact of T1 hypointensities. 24 All outcomes of brain atrophy analyses were assessed by an experienced rater. Because hardware changes can affect longitudinal measurements, 25-27 SIENA PBVC and VIENA PLVVC analyses were considered invalid when a patient was imaged on different hardware. In addition, because Neuro-STREAM PLVVC was previously shown to be robust to hardware changes in a study that included 125 patients with MS and 76 healthy controls, 23 we explored the stability of this measure in patients with and without MR imaging hardware changes using the current real-world setting dataset.
The study adhered to the Health Insurance Portability and Accountability Act and received central and local institutional review board approvals.

Statistical Analyses
All statistical analyses were performed using the SAS statistical software systems (SAS Institute, Cary, North Carolina). All analyses were performed on the basis of a statistical analysis plan defined a priori. Summary statistics for continuous variables included the number of patients with valid/missing observations, mean, SD, median, interquartile ratio, minimum, and maximum. Summary statistics for categoric variables included frequencies and related percentages per class level. Demographic, clinical, and MR imaging characteristic differences were examined with the Student t, 2 , and Wilcoxon rank sum tests as appropriate. The MR imaging acquisition differences between academic and nonacademic centers were examined using the 2 test. The influence of MR imaging changes during follow-up was investigated by analysis of covariance, adjusted for age, sex, days between time periods during the follow-up, and baseline volume, as appropriate. To investigate correlations between PLVVC on T2-FLAIR using NeuroSTREAM and PLVVC and PBVC on 2D-T1WI and 3D-T1WI using VIENA and SIENA, respectively, in patients with MS with and without software/coil/protocol MR imaging changes, we performed Spearman rank correlations. A nominal P value of Յ .05 was considered statistically significant, using 2-tailed tests. Table 1 outlines demographic and clinical characteristics of the study subjects, according to the MR imaging scanner changes between index and postindex. Of the 590 patients with RRMS included in the study, 464 (78.6%) were women. Of the 33 centers participating in the study, 25 (75.8%) centers were nonacademic specialty and community MS centers and 8 (24.2%) were academic; 398 (67.5%) patients with RRMS were collected in nonacademic, and 192 (32.5%), in academic centers. Between index and postindex follow-ups, MR imaging scanner changes occurred in 284 (48.1%) patients with RRMS. The median follow-up between index and postindex was approximately 16 months.

Demographic and Clinical Characteristics at Index and Postindex
The median age was somewhat lower in patients with RRMS with MR imaging scanner changes compared with those without (P ϭ .007). The number of relapses in 2 years before index was significantly higher in patients with RRMS with MR imaging scanner changes compared with those without (P ϭ .001). Patients with RRMS without MR imaging scanner changes had higher Expanded Disability Status Scale scores (P ϭ .002). No significant differences between patients with RRMS with and without MR imaging scanner changes were detected for disease duration or previous type of disease-modifying treatment at index.

MR Imaging Acquisition Characteristics at Index and Postindex
At postindex, there was a higher proportion of patients examined on 3T scanners (34.1%) compared with the index (29.5%, Online Table 1). All except 3 patients at index (99.5%) and 4 patients at postindex (99.3%) had T2-FLAIR examinations, with most being 2D acquisitions, except for 9 examinations at index and 16 examinations at postindex, which were 3D. The use of 2D-T1WI decreased during follow-up from 79.7% at index to 75.6% at postindex, whereas the use of 3D-T1WI increased from 31.4% at index to 39.7% at postindex. Only 16.6% index and 20.7% postindex examinations had both 2D-and 3D-T1WI. A section thickness of Յ5 mm was present in 40%-50% of the T2-FLAIR and 35%-40% of the 2D-T1WI examinations. More than 85% of 3D-T1WI examinations were Յ2 mm thick. There was minimalto-no excessive patient motion for all sequences, and scanner contrast and overall quality of the examinations were generally acceptable or good. 3D-T1WI examinations had superior quality to 2D-T1WI, with 80%-90% having good quality compared with 60%-70% on T2-FLAIR and 40%-45% on 2D-T1WI. At index and postindex, there was a significantly higher proportion of 3T scanners (P Ͻ .001) and more 2D-and 3D-T1WI examinations in academic centers (On-line Table 2). The section thickness for various pulse sequences was lower in academic centers (Յ5-mm section thickness for T2-FLAIR and 2D-T1WI and Յ2-mm section thickness for 3D-T1WI) (On-line Table 2). No differences in patient motion were found between academic and nonacademic centers, but scanner contrast and overall scan quality were better in the academic centers (On-line Table 2).

Eligibility of MR Imaging Examinations for Calculation of Brain Atrophy Outcomes
On-line Table 3 shows the eligibility of a longitudinal pair of examinations for calculation of PBVC and PLVVC measures between index and postindex. Among index-to-postindex examinations, hardware changes were identified in 29.5%; software, in 27.3%; and coil, in 31.9% of the longitudinal pairs. When pooled together, hardware/software/coil changes were identified in about 50% of the examinations, more frequently in the academic than in nonacademic centers (59.9% versus 42.2%, P ϭ .002). A similar frequency of MR imaging protocol changes during the follow-up occurred on a pulse sequence basis for T2-FLAIR, 2D-T1WI, and 3D-T1WI, with higher rates of change in the academic than in nonacademic centers.

Changes in Brain Atrophy Measures during Follow-Up
At index (data not shown), no differences were seen in brain atrophy measures for any pulse sequence when stratified according to MR imaging software/coil/protocol changes occurring during the follow-up, except for the 3D-T1WI lateral ventricle volume, which was significantly higher in the group that underwent MR imaging changes (P ϭ .05).
In a subgroup of 91 patients with RRMS with available preindex, index, and postindex MR imaging examinations, no significant differences in brain atrophy measures were found between patients with RRMS with and without software/coil/protocol MR imaging changes (On-line Table 4).
Using all patients with RRMS, who had analysis for PLVVC on T2-FLAIR examination (n ϭ 554) independent of hardware changes, we found a significant index-to-postindex difference when comparing those with and without MR imaging scanner changes (Ϫ0.31% versus 1.13%, P ϭ .007, Table 2), but not in a subgroup of 184 patients with RRMS with available preindex, index, and postindex MR imaging examinations (On-line Table 4). When only MR imaging hardware change was considered, a significantly different PLVCC was observed for patients imaged with hardware changes between index and postindex (n ϭ 166, 30%) compared with those who were not (n ϭ 388, 70%) (Ϫ1.23% versus 0.9%, P ϭ .02).

Correlation among Different Brain Atrophy Measures Using Different MR Imaging Pulse Sequences
On-line Table 5 shows the correlations between PLVVC on T2-FLAIR using NeuroSTREAM and PLVVC and PBVC on 2D-T1WI and 3D-T1WI using VIENA and SIENA, respectively, in patients with MS with and without software/coil/protocol MR imaging changes. PLVVC on 3D-T1WI, 2D-T1WI, and T2-FLAIR was significantly associated with PBVC on 3D-T1WI and 2D-T1WI when all patients with MS were considered. As expected, correlations were stronger in patients with MS who were not imaged with software/coil/protocol changes, compared with those who were. PBVC on 3D-T1WI was associated with PLVVC on 3D-T1WI (r ϭ Ϫ0.7, P Ͻ .0001), PLVVC on T2-FLAIR (r ϭ Ϫ0.5, P Ͻ .0001), and PLVVC on 2D-T1WI (r ϭ Ϫ0.39, P Ͻ .0001) in patients with MS imaged without MR imaging software/ coil/protocol changes during the follow-up.

DISCUSSION
This is one of the first multicenter, retrospective, real-world studies that investigated the feasibility of brain atrophy measurement in a clinical routine without prior standardization of the MR imaging protocol, using academic and nonacademic centers specialized in treatment and monitoring of MS in the United States. The main findings of the study can be summarized as follows: 1) The quality of the MR images used for brain atrophy analyses were mostly acceptable or good in all centers; 2) about 70% of the centers used 1.5T field strength, and there was a tendency for higher use of 3T scanners during the follow-up; 3) academic centers performed MR imaging examinations with thinner sections, better contrast, and overall scan quality; they also used 3T scanners more frequently, with Ͼ50% of postindex examinations being performed on 3T, and had a higher proportion of 3D-T1WI examinations; 4) T2-FLAIR examinations were used in Ͼ99% of patients, while the figures were 72% for 2D-T1WI and 28% for 3D-T1WI examinations longitudinally; 5) scanner changes occurred in Ͼ50% of the patients during the follow-up, and the changes occurred more frequently in the academic, compared with nonacademic centers; 6) measurement of PLVVC on T2-FLAIR was feasible in 66% of patients, making it the most suitable measure of brain atrophy for clinical routine, while PBVC was obtained in 44% of patients on 2D-T1WI and 18% on 3D-T1WI; 7) excluding MR imaging hardware changes, there were no significant differences of brain atrophy outcomes in patients with and without MR imaging software/coil/protocol changes from index to postindex; 8) hardware changes resulted in significant PLVVC differences on T2-FLAIR, though this was not evident in the subgroup of patients with 3 serial MR imaging examinations during 30 months; and 9) finally, PBVC changes on 2D-and 3D-T1WI in patients treated with fingolimod for 16 months were similar to those reported in pivotal and/or open-label observational studies. 28 Clinical routine MR imaging examinations pose many unique challenges. 2,16 They have a lower signal-to-noise ratio and/or spatial resolution due to trade-offs in scanning time. They lack standardization, which, in turn, is compounded by changes in MR imaging hardware and/or software upgrades. In the MS-MRIUS study, we confirmed that the spatial resolution was lower than that used in MS research clinical trials; however, most of the scans were of acceptable or good quality with negligible patient motion.
The MS-MRIUS study detected important differences in the type of MR imaging scanners used between academic and nonacademic centers in the United States. On the basis of some previous reports of the use of MR imaging in the referral of patients with MS to academic centers in the United States, 29 it could be expected that the adherence to the Consortium of MS Centers 30 and Magnetic Resonance Imaging in MS (MAGNIMS) 31 MR imaging acquisition protocol guidelines for MS is somewhat lower in nonacademic compared with academic centers, which was exactly what we found in the current study. However, MR imaging scanner changes during the follow-up occurred even more frequently in academic than in nonacademic centers, which could be attributed to academic centers upgrading software more frequently or replacing their scanning technology more rapidly.
Image resolution and image contrast are important for a reliable and optimal segmentation of brain volume, and 3D pulse sequences are preferred for measurement of brain atrophy as the criterion standard for brain volumetric imaging because of the reduction of partial volume effects and more accurate coregistration, especially for serial imaging with time, compared with 2Dpulse sequences. 2,3,16,17 Although 3D-T1WI is the recommended sequence for the calculation of brain volume measures, it is considered only as an optional sequence in the current MR imaging acquisition guidelines. 30,31 The MS-MRIUS study showed that less than one-quarter of patients with MS had eligible 3D-T1WI for estimation of brain volume measures during the follow-up, which substantially impacted applicability for use in clinical routine. On the other hand, the MS-MRIUS study showed that T2-FLAIR was eligible in 99% of patients, allowing reliable PLVVC analysis in 66% of patients during the follow-up. Therefore, measurement of PLVVC on T2-FLAIR increased by 33% the proportion of patients who obtained reliable brain atrophy measurements compared with 2D-T1WI and by 73% compared with 3D-T1WI, respectively.
Given that more than half of the patients underwent changes in the MR imaging scanner during the follow-up, there is a strong need for a universally applicable panel of simple brain volume measures that are resistant to MR imaging scanner changes. For proper estimation of brain volume changes with time, patients should undergo an acquisition with the same hardware and without software/coil/protocol changes. However, this is very difficult to achieve in clinical routine because of a number of different logistic factors. 2 The MS-MRIUS study showed no significant impact of software/coil/protocol changes on the measurement of PBVC and PLVVC in the clinical routine during 16 months, as evaluated using 3 different types of software on 3 different types of sequences. In particular, PLVVC is of interest as a choice for brain atrophy assessment in the clinical routine because it is relatively robust to the negative impact of imprecise positioning, gradient distortions, incomplete head coverage, and other motion and wraparound artifacts. 12 In addition, PLVVC occurs at a 5-fold greater rate compared with PBVC, and the effect size of PBVC and PLVVC for the estimation of disability progression from baseline to 10-year follow-up is similar. 12 In a previous study, the correlation analysis between PLVVC on T2-FLAIR using NeuroSTREAM and PLVVC and PBVC on 3D-T1WI using VIENA and SIENA, respectively, was described. 23 The correlation analysis from the present study showed similar associations between PBVC on 3D-T1WI and PLVVC on T2-FLAIR in patients with MS who did not experience software/coil/ protocol MR imaging changes during the follow-up, corroborating that PLVVC on T2-FLAIR can be used reliably in retrospective and prospective observational studies without prior standardization of the MR imaging protocol. However, measurement of brain atrophy on T2-FLAIR using PLVVC is still suboptimal compared with PBVC on 3D-T1WI, which should prompt clinicians in the more extensive use of the latter in clinical practice.
In line with other recent studies, 23,25,26 the MS-MRIUS study showed that scanner changes had an impact on brain volume estimates. While it was previously shown that NeuroSTREAMderived PLVVC is relatively robust to different field strengths when imaged during a short time (approximately 2% coefficient of variation in the 1.5T versus 3T scan-rescan test for 72 hours), 23 the current study showed that PLVVC on T2-FLAIR was significantly different in patients with RRMS with hardware changes, compared with those without. If the measurement of brain atrophy is to become an additional assessment tool for monitoring individual patients with MS, the annualized rate of pathologic cutoff 5 will have to be lower than the error rate introduced by hardware changes or the analyses will be considered invalid for those patients. Although the MS-MRIUS study was not designed to answer this important question, it provides the first evidence of the feasibility of brain atrophy measurement in clinical routine without prior standardization of the MR imaging protocol. Future studies should investigate in greater detail the influence of individual components of scanner changes on a variety of brain atrophy measures, applicable to clinical routine, over the short term, midterm, and long term.

CONCLUSIONS
We showed, in this retrospective observational study, that T2-FLAIR was the most frequent sequence in the clinical routine. To increase general applicability of brain atrophy measurement in observational studies in the clinical routine, one can more feasibly estimate brain atrophy by assessing PLVVC on T2-FLAIR compared with PBVC or PLVVC using 2D-or 3D-T1WI. As the most accurate and well-established measurement of brain atrophy, PBVC assessment on 3D-T1WI is, and should remain, the criterion standard of brain volumetric imaging research. However, T2-FLAIR-derived PLVVC may be a more feasible surrogate when historical or other practical constraints limit the availability of PBVC on 3D-T1WI.