Considerations for Mean Upper Cervical Cord Area Implementation in a Longitudinal MRI Setting: Methods, Interrater Reliability, and MRI Quality Control

BACKGROUND AND PURPOSE: Spinal cord atrophy is commonly measured from cerebral MRIs, including the upper cervical cord. However, rescan intraparticipant measures have not been investigated in healthy cohorts. This study investigated technical and rescan variability in the mean upper cervical cord area calculated from T1-weighted cerebral MRIs. MATERIALS AND METHODS: In this retrospective study, 8 healthy participants were scanned and rescanned with non-distortion- and distortion-corrected MPRAGE sequences (11–50 sessions in 6–8 months), and 50 participants were scanned once with distortion-corrected MPRAGE sequences in the Day2day daily variability study. From another real-world observational cohort, we collected non-distortion-corrected MPRAGE scans from 27 healthy participants (annually for 2–4 years) and cross-sectionally from 77 participants. Statistical analyses included coefficient of variation, smallest real difference, intraclass correlation coefficient, Bland-Altman limits of agreement, and paired t tests. RESULTS: Distortion- versus non-distortion-corrected MPRAGE-derived mean upper cervical cord areas were similar; however, a paired t test showed incomparability (t = 11.0, P = <.001). Higher variability was found in the mean upper cervical cord areas calculated from an automatic segmentation method. Interrater analysis yielded incomparable measures in the same participant scans (t = 4.5, P = <.001). Non-distortion-corrected mean upper cervical cord area measures were shown to be robust in real-world data (t = −1.04, P = .31). The main sources of variability were found to be artifacts from movement, head/neck positioning, and/or metal implants. CONCLUSIONS: Technical variability in cord measures decreased using non-distortion-corrected MRIs, a semiautomatic segmentation approach, and 1 rater. Rescan variability was within ±4.4% for group mean upper cervical cord area when MR imaging quality criteria were met.

S pinal cord (SC) atrophy occurs often in neuroinflammatory diseases and can be measured using structural MRI. [1][2][3] Several studies have identified cervical SC atrophy as predictive of disease progression and/or disability in patients with multiple sclerosis and related diseases; thus, it is of great interest to measure SC atrophy longitudinally. 1,4,5 One of the most common SC atrophy measurement methods is the mean upper cervical cord area (MUCCA). [6][7][8] MUCCA has been found to be robust at the C1-C2 and C2-C3 intervertebral levels. 9,10 However, there is little consensus on how reliable it is longitudinally or whether SC atrophy measures are ready for use as a clinical trial end point and/or for patient monitoring in neuroinflammatory diseases, 11,12 SC injury, 13,14 or degenerative cervical myelopathy, 15 among other diseases affecting the SC. Most studies of MR imaging-based SC atrophy comparisons of diseased with healthy SC measures have relied on interindividual variability calculated from crosssectional data. 16 Intraindividual variability in healthy participants (HP) is often ignored or assumed to be equivalent to the mean change (or SD) in SC atrophy measures. 9 The utility of using cerebral 3D MRIs, including the upper cervical cord (UCC), as a source of MUCCA calculations is based on the availability of this sequence in most patients. These scans can be used for both detection of abnormalities in the brain and the retrospective analysis of SC atrophy. 8 Although some methods for reducing variability in MUCCA have been described, these methods require extra calculations and effort that are not implemented in many SC atrophy studies, as of yet. 17 Thus, investigations of raw MUCCA in HP are still required for making considerations for MRI sequence and analysis methods in prospective SC atrophy studies. Technical and rescan variability in HP must be evaluated to understand the pathologic changes in raw MUCCA in a longitudinal setting.
In this study, we hypothesized that the raw MUCCA will differ in HP, dependent on several factors: 1) non-distortion-versus distortion-corrected source MRIs, 2) analysis software used for MUCCA calculation, 3) interrater biases, and 4) MR imaging artifacts/positioning. The aim of our study was to identify factors that impact healthy rescan MUCCA in a longitudinal intraparticipant manner using rescan cerebral distortion-and non-distortion-corrected MPRAGE images, while also accounting for interparticipant variations using cross-sectional cerebral distortion and non-distortion-corrected MPRAGE images.

MATERIALS AND METHODS
The local institutional review board (NeuroCure Clinical Research Center, Charité-Universitätsmedizin Berlin) approved this retrospective study, and written informed consent was obtained from all participants.

Cohort
We performed a retrospective analysis of data of HP collected from the Day2day daily variability study, 18 and an ongoing observational single-center cohort (EA1/163/12). Inclusion criteria were as follows: a minimum of 18 years of age and no presence of any neurologic/psychiatric disorders, contraindications for an MR imaging examination, or incidental MR imaging findings. Participants who had undergone a full MR imaging scan protocol between January 2015 and January 2019 were eligible. Age and sex were collected for each healthy participant.

Image Postprocessing
All images were reoriented with fslreorient2std (https://www.  19 Longitudinal/ rescan images were not coregistered because the SC region is not the center of registration, and in most coregistration pipelines, brain extraction is required, which cuts off the UCC.

Image Analysis
MUCCA was measured in all postprocessed MPRAGE images using the active surface method 20 using Jim software (2019; Version 7.0; http://www.xinapse.com/Manual/cord_intro.html) at the C2/C3 intervertebral space level (MUCCA-Jim), where the mean of cross-sectional areas in 5 consecutive slices of each MPRAGE image were obtained by an experienced rater (C.C.). Three center-of-cord seeds were marked by the rater, 1 at the most superior section of the C2/C3 intervertebral space, 1 after moving 2 slices down (center of C2/C3 intervertebral space), and the last seed after moving 2 more slices down in the cervical cord. Jim Cord Finder tool was set to a nominal cord diameter of 10 mm, with the number of shape coefficients set to 20 coefficients, a default order of longitudinal variation of 6, and the "cord is hypointense to CSF" setting not checked (for a T1 image). The active surface model takes into account the angulation of the cord by calculating cross-sectional areas in a plane perpendicular to the local cord centerline. Unfolded cervical cord images from each MPRAGE image were created using the same 5 consecutive slices, in which the setting for "create an unfolded image of the spinal cord" was checked in the Spinal Cord Finder Tool in Jim 7.0 so that during the calculation of the cross-sectional areas of the cord outline an unfolded image is automatically saved. The unfolded images are straightened cord centerline images, where cord image section planes are perpendicular to the straightened cord centerline. 21 The unfolded cervical cord images were used as input for automatic MUCCA segmentation (MUCCA-SCT) using the Spinal Cord Toolbox (SCT; PropSeg; 2019; version 4.1.0 https://sourceforge.net/p/ spinalcordtoolbox/wiki/correction_PropSeg/). 22 The comparison of the 2 segmentation methods is valuable because the semiautomatic Jim requires roughly 2-3 minutes of manual marking and quality checks per scan after reorienting to standard Montreal Neurological Institute space, while the automatic PropSeg requires only about 3 seconds of computation time after creating unfolded cervical cord images. If no unfolded cervical cord images are created, PropSeg is able to segment full cervical cords in roughly 15 seconds.
For interrater analysis, V.J. (trainee) measured MUCCA-Jim in the same MPRAGE scans from 1 healthy participant, using the same settings and methodology as described above.
Of the 308 rescan MRIs from 8 HP in the Day2day study, 37 (13%) distortion-corrected and 28 (9%) non-distortion-corrected MPRAGE scans were excluded from analyses due to motion artifacts and/or rescan differences in participant positioning in the scanner. From the ongoing observational cohort of 27 HP, we collected 66 longitudinal MRIs, of which 15 (7 HP, 23%) were excluded due to motion and/or metallic implant artifacts and/or differences in positioning in the scanner. In total, 28 HP with 600 longitudinal/ rescan MPRAGE scans were included in statistical analysis.

Statistical Analysis
For age and sex comparison between HP from the Day2day study and our center, Kruskal-Wallis rank sum and x 2 tests were performed, respectively. Group baseline MUCCAs calculated from different MR imaging scans were compared using an ANOVA test. Effect sizes between groups were evaluated with bootstrapped (n 5 5000) confidence intervals. 23 Comparisons between MUCCAs derived from distortion-and non-distortioncorrected MPRAGE scans and from Jim versus SCT were performed using the coefficient of variation (CoV) asymptotic test of equality, 24 smallest real difference (SRD), 25 Bland-Altman limits of agreement, and paired t tests. The SRD (Equation 1) is a statistical method that estimates a true difference in measures and accounts for the 95% confidence interval of true observed differences among measures (1.96) using the difference in variances of measures ( ffiffi ffi 2 p ) and the standard error of measurement (SEM). SEM is calculated using intraclass correlation (ICC) metrics (Equation 2), as shown below.
Interrater MUCCAs were compared using the CoV, ICC, SRD, and a paired t test. ICCs were calculated with consistency from a 2-way mixed model. Because the number of longitudinal/ rescan MUCCAs differed among HP, we calculated the CoV for each individual participant. All statistics and graphs were produced using R (Version 3.4.0; http://www.r-project.org), 26 and statistical significance was set to P , .05.

Cohort Demographics
Of the 28 participants with rescan MUCCAs, 21 (75%) were women, and of the 127 participants with cross-sectional MUCCAs, 99 (78%) were women. All relevant demographic data are shown in On-line Table 1.
There were significant differences between the crosssectional HP' sex (x 2 5 21.3, P 5 ,.001) and age (x 2 5 33.1, P 5 ,.001) distributions. A significant difference in the baseline mean MUCCA-Jim when comparing rescan, longitudinal, and cross-sectional data from the Day2day and clinical participants (F 5 14.7, P 5 ,.001) was observed (On-line Table 1). However, little difference in the effect sizes of mean MUCCA-Jim derived from distortion-versus non-distortion-corrected MPRAGE scans was found (Fig 1).
Larger effect sizes from differences in MUCCA-Jim seemed to be the result of small sample sizes when subtracting the mean MUCCAs of a small cohort of 8 HP from the mean MUCCAs of 20 HP. However, when larger cohorts were compared, MUCCAs from distortion-and non-distortion-corrected MPRAGE scans (Day2day distortion-corrected cross-sectional and clinical nondistortion-corrected cross-sectional scans), a much smaller effect size was seen.

Rescan MUCCA from Distortion versus Non-Distortion-Corrected MPRAGE Sequences
Rescan MUCCA-Jims derived from the Day2day distortion-and non-distortion-corrected MPRAGE scans were compared. The CoVs in repeat MUCCAs for each participant are shown in the Table. An asymptotic test for the equality of the CoV between distortion-and non-distortion-derived MUCCA-Jim was nonsignificant (asymptotic test of equality 5 ,0.001, P 5 .98). The SRD range in distortion-corrected derived MUCCA-Jim was 63.5 mm 2 (64.8% of the mean group MUCCA), while it was 63.2 mm 2 (64.4% of the mean group MUCCA) for non-distortion-corrected derived MUCCA-Jim. Bland-Altman limits of agreement (À2.9 to 6.0 mm 2 ) showed a predisposition for MUCCA-Jim to have a lower value when measured from non-distortion-corrected MPRAGE scans. This result was corroborated when the rescan MUCCA-Jim was evaluated using a paired t test (t 5 10.98, P 5 ,.001). Although CoVs from most non-distortion-corrected MPRAGE-derived MUCCA-Jims were lower than those derived from distortion-corrected derived MUCCA-Jims, it can be seen, along with Fig 1, that participants with a similar number of scans analyzed had similar CoVs from both types of scans (eg, participants 1, 2, 4, 5, 7, and 8). CoVs from a similar number of scans analyzed were mostly lower in non-distortion-corrected derived MUCCA-Jims, and because the mean follow-up scan time per healthy participant was 5.2 days, this finding suggests that these measures are more robust than those from distortion-corrected cerebral scans.

Rescan MUCCA from Semiautomatic-versus-Automatic Segmentation
Variability in non-distortion-corrected MPRAGE-derived MUCCA was tested using the semiautomatic Jim (MUCCA-Jim) and the fully automatic PropSeg (MUCCA-SCT) segmentation methods. The CoVs in MUCCA for each participant from the Day2day cohort are shown in On-line Table 2.
An asymptotic test for the equality of CoV between MUCCA-Jim and MUCCA-SCT was significant (asymptotic test of equality 5 130.1, P 5 ,.001). The SRD range for MUCCA-Jim was 63.2 mm 2 (64.4% of the mean group MUCCA-Jim), while it was 69.5 mm 2 (614.5% of the mean group MUCCA-SCT) for MUCCA-SCT. Bland-Altman limits of agreement (À2.3 to 16.0 mm 2 ) showed a predisposition for MUCCA-Jim to have a higher value than MUCCA-SCT, which was also evident when evaluating the rescan MUCCA-Jim and MUCCA-SCT with a paired t test (t 5 24.42, P 5 ,.001). Since PropSeg was optimized on MR imaging acquired with a spine coil instead of a head-neck coil, as used in this study, differences in the CSF to SC contrast-to-noise ratios may cause an underestimation of this segmentation method. 27 On-line Figure 2 shows the Bland-Altman and paired t test results when comparing MUCCA-Jim and MUCCA-SCT, as well as sample segmentations by Jim and PropSeg.

Single-versus Multirater Rescan MUCCA-Jim
To evaluate the robustness of using nondistortion MPRAGE scans and semiautomatic software to measure the MUCCA, we performed an interrater analysis for the same participant with 50 weekly scans from the Day2day study. Rater 1 (C.C.) has several years of experience with MUCCA segmentation, while rater 2 (V.J.) has never segmented MUCCA before this study. The CoVs in rescan MUCCA-Jim for healthy participant the same HP from rater 1 and rater 2 were 1.38% and 1.64%, respectively. An asymptotic test for the equality of CoV was nonsignificant (asymptotic test of equality 5 1.5, P 5 .23). The ICC was 0.74, and the SRD range in MUCCA-Jim from the 2 raters was 62.8 mm 2 (63.8% of the mean group MUCCA-Jim). The rescan MUCCA-Jim when compared using a paired t test still showed a significant difference (t 5 4.54, P 5 ,.001). On-line Figure 3 illustrates the paired t test comparison of rescan MUCCA-Jim, from the same participant, measured by 2 raters.

Real-World Data Confirmation
To validate the findings from our analysis of the Day2day study cohort, we compared, cross-sectionally, 50 Day2day HP MUCCA-Jim derived from distortion-corrected MPRAGE scans with 77 observational HP MUCCA-Jim derived from non-  this cohort. However, our finding that there was a higher (though slightly higher) CoV in the MUCCA-Jim from this cohort suggests that a "normal range" of MUCCA may be larger when using distortion-corrected scans for SC atrophy evaluation. We also tested rescan MPRAGE MUCCA-Jim from the observational cohort to evaluate the feasibility of using the raw MUCCA in a longitudinal, real-world setting. The CoVs in rescan MUCCA-Jim for each participant are shown in On-line Table 3.

Sources of Exclusion from Analysis (MPRAGE Quality Control)
On inspection of large variations in consecutive MUCCA-Jim from the Day2day and observational cohort, several artifacts were identified to give inadequate rescan MUCCAs for comparison. Participants with largely variable MUCCAs most often had $1 of the following MR imaging qualities: 1) motion artifacts seen in the brain or neck region, 2) head and neck positions being different between sessions, and/or 3) metal implants in the mouth causing artifacts (eg, retainer, braces, and so forth). Head and neck positioning differences included head tilt angle and different levels of UCC included in the scan FOV. All of these qualities can be observed in postprocessed MPRAGE images, as shown in Fig 4.

DISCUSSION
We investigated rescan MUCCA variability in HP on the basis of the following: 1) scan settings for source MRIs, 2) segmentation software, 3) interrater bias, and 4) imaging artifacts. Rescan MUCCA was found to be less variable when measured from 3T, non-distortion-corrected cerebral 3D MPRAGE scans. We also showed that using a semiautomatic segmentation approach (Jim 7.0 Cord Finder Tool) gave a robust MUCCA in participants compared with fully automatic segmentation (SCT PropSeg) in  the exact same MR imaging slices. Interrater bias evaluation revealed that MUCCA from a single rater is more consistent, and if measured by different raters, a systematic bias could occur. Finally, several sources of highly variable rescan MUCCAs were found through quality control analysis of postprocessed MPRAGE scans. These included imaging artifacts in the brain/ mouth/neck regions and, most noticeably, when participants were positioned differently (ie, different head tilt, different levels of vertebrae in the FOV) and/or had metal implants in the mouth.
Few studies publish whether MR imaging sequences are distortion-or non-distortion-corrected using a phantom. MUCCA has been shown to be more robust when normalized with a factor calculated from a cylindric phantom that represents the UCC. 17 However, we found that raw MUCCA derived from distortioncorrected MPRAGE scans yielded more variable (higher CoVs and SDs) within-participant rescan measures than those derived from non-distortion-corrected MPRAGE scans. This finding suggests that without an SC-specific phantom for distortion correction, non-distortion-corrected cerebral scans yield less variable and more robust raw MUCCAs for longitudinal SC atrophy evaluation. We would recommend, in retrospective studies, using non-distortion-corrected cerebral MPRAGE for higher longitudinal reproducibility, as seen from the results of our real-world clinical data. Furthermore, to reduce variability in repeat MUCCAs as much as possible, when using cerebral MPRAGE scans, we propose a strict quality control step before inclusion in a study, which involves the removal of participants with any metallic implants and scans that have any brain-movement artifacts. It would be also beneficial to standardize the positioning of patients in the scanner to achieve cervical cord placement as straight as possible and include a minimum number of vertebral levels.
We compared 2 common SC segmentation tools used to calculate the MUCCA: the semiautomatic Jim Cord Finder and the fully automatic SCT PropSeg. Jim uses an active surface method, 20 and SCT uses an iterative propagation of a deformable adaptive contrast method. 22 One study found good correlation between Jim and PropSeg using a normalization factor, 28 while another study found a systematic difference in the 2 outputs. 27 We confirmed this systematic bias in raw MUCCA, in which PropSeg yielded significantly smaller and more variable segmentations than Jim, with higher intraparticipant variability and low consistency in paired measures. Because we segmented MUCCAs using the exact same MR imaging slices for this analysis, our study shows that raw MUCCAs measured by these 2 methods are not comparable. SCT has another functionality, namely DeepSeg (https://github.com/ neuropoly/spinalcordtoolbox), which is based on a convolutional neural network algorithm that segments the spinal cord, with the capability of also segmenting intramedullary multiple sclerosis lesions. 29 This method was shown to better segment MR imaging scans, including the brain, than PropSeg; however, raw MUCCAs were still significantly different between DeepSeg and Jim. 7 Thus, when one compares results from different studies using different segmentation methods, absolute MUCCA cannot be used; rather, we propose the evaluation of effect sizes from the change in the MUCCA from each study.
Many studies have used Jim for MUCCA segmentation, but they had either 1 experienced rater for multiple centers 30 or no raters were mentioned. 31 We found only a moderate intraclass correlation between MUCCAs segmented by 2 raters. The less experienced rater also had higher variability in the MUCCA, as evidenced by large differences in paired measures. These results highlight the need for disclosure of raters in MUCCA studies, standardized training of individuals performing SC segmentation, and/or using 1 rater for longitudinal studies.
We validated our findings in real-world clinical data. In distortion versus non-distortion MPRAGE-derived MUCCA, stability of the rescan MUCCA was shown to be a smallest real difference of roughly 64% change. This is in line with a recent study that found high reproducibility of the MUCCA within 1 scanner with 1 measurement method, but no comparability among scanners or methods. 27 One large metadata study found, on average, that patients with multiple sclerosis lose 1.78%/year of their cervical SC area. 16 We do not need to assume that SC atrophy during 1 year in HP would be zero; however, a large study with .1200 asymptomatic participants spanning 20-80 years in age found that in both men and women the C2/C3 intervertebral level cross-sectional area had a minimum decrease of about 1% and a maximum decrease of roughly 4.5% spanning several decades. 32 Thus, we can be fairly certain that during the course of a 1-year follow-up, normal SC atrophy would, most likely, not surpass 1.5%. Our study shows that annual MUCCA may not give enough signal over noise; thus, we suggest that MUCCA be monitored for .24 months to assess SC atrophy.
The limitations of our study include the small sample size of HP scanned longitudinally. Because our study was performed with retrospective data, we could not evaluate dedicated SC sequences, which may give more robust SC atrophy measurements using fast, automated methods such as PropSeg or DeepSeg. A future direction would be to collect a large amount of healthy participant SC MR imaging for further longitudinal testing. It would be ideal to scan many HP with a wide age range, to truly validate whether there is an effect of age or sex on the MUCCA. One study did find an effect of age, sex, height, and weight on the lower cervical volume of HP, however, with very small effect sizes for their models, 33 We were also not able to investigate the time of day of MR imaging acquisition or hydration levels of HP, which may lead to further fluctuations in the MUCCA. These factors have been found to affect brain morphometric measures 34 and SC area. 35 However, due to the robustness of our real-world longitudinal MUCCA analysis, we can postulate that these factors play a smaller role in MUCCA variability.

CONCLUSIONS
Our study illustrates some technical and rescan factors important in longitudinal SC atrophy studies using MUCCA: type and correction of source MR imaging, segmentation method, rater training, and MR imaging quality controls should be addressed, because all these factors contribute to increased rescan variability in the MUCCA. With these considerations, the MUCCA has the potential to be used in a longitudinal setting to evaluate and track SC atrophy.