An Automated Statistical Technique for Counting Distinct Multiple Sclerosis Lesions

BACKGROUND AND PURPOSE: Lesion load is a common biomarker in multiple sclerosis, yet it has historically shown modest association with clinical outcome. Lesion count, which encapsulates the natural history of lesion formation and is thought to provide complementary information, is difficult to assess in patients with confluent (ie, spatially overlapping) lesions. We introduce a statistical technique for cross-sectionally counting pathologically distinct lesions. MATERIALS AND METHODS: MR imaging was used to assess the probability of a lesion at each location. The texture of this map was quantified using a novel technique, and clusters resembling the center of a lesion were counted. Validity compared with a criterion standard count was demonstrated in 60 subjects observed longitudinally, and reliability was determined using 14 scans of a clinically stable subject acquired at 7 sites. RESULTS: The proposed count and the criterion standard count were highly correlated (r = 0.97, P < .001) and not significantly different (t59 = −.83, P = .41), and the variability of the proposed count across repeat scans was equivalent to that of lesion load. After accounting for lesion load and age, lesion count was negatively associated (t58 = −2.73, P < .01) with the Expanded Disability Status Scale. Average lesion size had a higher association with the Expanded Disability Status Scale (r = 0.35, P < .01) than lesion load (r = 0.10, P = .44) or lesion count (r = −.12, P = .36) alone. CONCLUSIONS: This study introduces a novel technique for counting pathologically distinct lesions using cross-sectional data and demonstrates its ability to recover obscured longitudinal information. The proposed count allows more accurate estimation of lesion size, which correlated more closely with disability scores than either lesion load or lesion count alone.

M ultiple sclerosis is a neuroinflammatory disorder characterized by demyelinating lesions that occur in the central nervous system. MR imaging is the most commonly used method to observe these lesions, especially in the white matter of the brain. 1 The presence of new lesions on MR imaging is often considered an important clinical marker of disease activity, yet MR imagingbased measures of disease severity have been elusive. 2 The total lesion burden in the white matter or "lesion load"-measured as volume or volume fraction of brain size-is often used in the study of MS, typically as a measure of disease severity 3 and as a clinical trial outcome. 4 However, lesion load has consistently shown a surprisingly weak association with clinical measures of disease severity, calling into question its usefulness as a surrogate and reinforcing the need for further development of MR imaging outcomes for MS. 2,5 In past years, several clinical studies have discussed the number of lesions in a patient's brain as a possible outcome of interest. [6][7][8] In these studies, baseline lesion count has been shown to be correlated with the Expanded Disability Status Scale (EDSS) and changes in lesion count have been shown to be correlated with changes in the EDSS. However, obtaining an accurate count of biologically distinct lesions in the brain can be costly and logistically challenging, typically requiring expert review or automated analysis of scans obtained at frequent follow-up visits. This process is especially difficult in patients with a high lesion load and many confluent lesions. 9 Confluent lesions commonly occur when pathologically distinct lesions (ie, lesions that arise due to spatially separate sources of structural damage in the brain, usually separated in time) occur near each other and create a larger connected region of lesion tissue. Depending on the level of lesion burden, confluent lesions can range from 2 overlapping lesions with a single connecting edge to dozens of connected lesions spanning large stretches of white matter. The existence of such confluent tissue can make it difficult or impossible to estimate the number of distinct lesions in the brain at any given visit. A patient must be scanned regularly, with temporality of appearance serving to separate spatially confluent lesions, to obtain accurate lesion counts. However, MR imaging scans are extremely costly, which can make regular follow-up visits infeasible. Additionally, in patients with a great deal of disease activity, even monthly or bimonthly scans can produce multiple new lesions that overlap in space. 10,11 These considerations render lesion counts unavailable or inaccurate in most clinical settings in which patients are typically scanned yearly or twice yearly.
To address this issue, the current study introduces a statistical analysis technique for obtaining valid and reliable estimates of lesion counts from a single cross-sectional MR imaging study. This fully automated method uses cutting-edge statistical models for segmenting lesion tissue and well-demonstrated mathematic methods for quantifying texture to obtain the number and location of temporally distinct white matter lesions. Additionally, this study provides evidence that the derived lesion counts are associated with clinical measures of disease severity, independent of total lesion volume.

Proposed Lesion-Count Algorithm
To obtain the lesion-count estimate in a given subject, we performed the following steps: First, a map of lesion probability at each voxel in the brain was obtained using preprocessed and coregistered MR imaging volumes from a single visit. Depending on the automated segmentation method used, a combination of T1-weighted, fluid-attenuated inversion recovery, T2weighted, and proton density volumes would be required for probability estimation. A threshold was then applied to the prob-ability map to create a binary mask of regions considered lesion tissue.
With the probability map, the texture of the lesion tissue was quantified to find regions that exhibited the properties expected of the center of a single lesion. Texture was quantified using the eigenvalues of the Hessian matrix. The Hessian matrix was calculated for the intensity of the lesion probability map at every voxel in the lesion mask, with a gradient window of 1 voxel in each direction. In the context of a 3D image, the Hessian matrix describes the second-order variation in image intensity in the local neighborhood around a voxel. When applied to a lesion probability map, the eigenvalues of the Hessian matrix at each voxel represent the 3 primary directions of change in lesion probability at that voxel.
Thus, voxels in the center of a lesion would be expected to have a negative eigenvalue, implying a decrease in probability, in all directions. This follows from the commonly accepted pathology of MS lesions, in which initial damage to a vein causes residual inflammation to spread outward from the vein in a relatively ovoid fashion, with less damage occurring around the periphery of the visible lesion. 12 Therefore, voxels are eliminated if any of the 3 eigenvalues are positive; the elimination indicates that the voxel is less likely to be a lesion than its surroundings in at least 1 direction. The remaining voxels with 3 negative eigenvalues are clustered by location, and connected clusters (operationalized as the centers of distinct lesions) are counted. Figure 1 provides an example of this technique.

Data and Preprocessing
Validation and Clinical-Radiologic Association. Sixty subjects diagnosed with MS were scanned between 2000 and 2008 monthly for Յ5.5 years (mean, 2.2 Ϯ 1.2 years) as part of a natural history study at the National Institute of Neurological Disorders and Stroke in Bethesda, Maryland. The subjects ranged from 18 to 60 years of age, with a mean age of 38 Ϯ 9 years. Of the 60 subjects, 38 were women and 22 were men. Most subjects (n ϭ 44) were diagnosed with relapsing-remitting MS; 13 had secondary-progressive MS; 1 had primary-progressive MS; and 2 were unspecified. Each subject was either untreated or treated with a variety of disease-modifying therapies during the observation period, including both FDA-approved (various preparations of interferon-␤) and experimental therapies.
Details of the image acquisition and preprocessing have been previously published 13 and are briefly summarized in this section. Whole-brain 2D FLAIR, proton density, T2, and 3D T1-weighted volumes were acquired on a 1.5T MR imaging scanner (Signa Excite HDxt; GE Healthcare, Milwaukee, Wisconsin). The 2D FLAIR, proton density, and T2 volumes were acquired using fastspin-echo sequences, and the 3D T1 volume was acquired using a gradient-echo sequence. All scanning parameters were clinically optimized for each acquired image. Each subject was scanned over multiple visits, and subjects' images at each visit were rigidly coregistered longitudinally and across sequences to a template space. 14 All images were N4 bias-corrected, and FLAIR, T2, and proton density volumes for each subject were interpolated and rigidly coregistered to the T1 volume in isotropic 1-mm 3 space. 15 Extra-cerebral voxels were removed using the T1 volume via a skullstripping procedure, 16 and intensity normalization 17 of the volumes based on z scoring was applied. Studies were manually quality-controlled by a researcher with Ͼ5 years' experience with structural MR imaging, and studies with analysis-limiting motion or other artifacts were removed. Following preprocessing and quality control, automatic lesion segmentation was performed on coregistered T1, T2, FLAIR, and proton density volumes using the Automated Statistical Inference for Segmentation (OASIS) model 18 to produce a lesion probability map for each subject. A conservative threshold of 30% was applied to the probability maps to create binary lesion masks.
Reliability. To test reliability, also referred to as repeatability, we analyzed data from a 45-year-old man diagnosed with clinically stable relapsing-remitting MS. This patient was imaged at 7 sites in the United States as part of a pilot study for the North American Imaging in Multiple Sclerosis (NAIMS) Cooperative. He was characterized as having mild-to-moderate physical disability, which was stable between the first and last visits, and had no clinical relapses or radiologic changes during the study. 19 Details of the image acquisition have been previously published 19 and are briefly summarized in this section. Whole-brain 3D high-resolution FLAIR, T2, and T1-weighted volumes were acquired on 7 3T MR imaging scanners across the United States (4 Skyras, 2 Tim Trios, 1 Verio; Siemens, Erlangen, Germany). A standardized high-resolution scanning protocol was developed through a consensus agreement in the NAIMS Cooperative and was used to the extent possible (allowing for different scanner types and software versions) for each scan. The participant was scanned twice on the same day at each site and was removed and repositioned between scan and rescan.
All images were N4 bias-corrected, and the subject's images at each scan were rigidly coregistered across sequences to the T1 volume in isotropic 1-mm 3 space. 15 Extracerebral voxels were removed using the T1 volume via a skull-stripping procedure, 20 and intensity normalization 17 of the volumes based on z scoring was applied. Following preprocessing, automatic lesion segmentation was performed on coregistered T1, T2, and FLAIR volumes using an extension of the OASIS model 21 to produce a lesion-probability map for each scan session. A conservative threshold of 30% was applied to the probability maps to create binary lesion masks.

Statistical Analysis
Validation. Using the longitudinal nature of the data, we developed a criterion standard count of lesions that appeared during the study for validation. A state-of-the-art technique for segmenting new lesions since a previous visit 22 was applied at each visit after baseline, resulting in the number and location of new lesions at each visit for every patient. For the criterion standard count, segmented regions containing lesions separated in space or time were considered distinct. For example, if a large contiguous region at the end of a study consisted of 1 lesion that appeared at the sixth visit and 1 lesion that appeared at the eighth visit, these would be considered 2 lesions in the criterion standard count.
The criterion standard count, henceforth referred to as C G , was compared with 2 counts obtained cross-sectionally at the final observation for each patient. The first, C P , is the count based on the technique proposed in this study. C P was obtained by applying the algorithm described in the "Proposed Lesion-Count Algorithm" section to the images obtained at each patient's final visit, then restricting the count to the number of lesion centers contained in the lesion voxels determined to have appeared during the study. Most important, this restriction means that C P represents a subset of the total number of lesions in a subject's scan and is distinct from the full lesion count later described in the context of the clinical-radiologic analysis. This limitation was implemented to make direct comparison between C P and C G possible because a criterion standard count can only be obtained for lesions that appeared during the study.
The second cross-sectional count, C C , refers to a count based on the standard connected-components technique. C C was obtained by performing lesion segmentation on the images obtained at each patient's final visit, thresholding at a probability of 30%, and labeling lesions as distinct if they were separated in space. C C was then restricted to the number of unique lesion labels contained in the lesion voxels known to have appeared during the study, to facilitate comparison with C P and C G .
Comparison among C G , C C , and C P occurred in 2 ways: First, to compare the linear correspondence between the criterion standard and the different counting techniques, we compared the correlation between C G and C P with that of C G and C C . Then, to determine whether the counts themselves differed meaningfully from the criterion standard, paired t tests were run for C G and C P , as well as C G and C C .
Reliability. Determination of the reliability of the proposed counting method was based on the coefficient of variation (CV) of the counts obtained from the 14 repeat scans. Because the typical connected-components technique for counting automatically or manually segmented lesions yields a stable-but-invalid estimate of the true count, there is no current criterion standard CV for a lesion count. Thus, the CV of the proposed count was compared with a commonly used outcome measure for MS: total cerebral lesion volume ("lesion load").
This comparison took place in 2 contexts. The first represented a fully automated version of the proposed count, in which variation may arise from false-negatives in the segmentation mask, false-positives in the segmentation mask, thresholding of the segmentation mask, and changes in the Hessian structure of the segmentation mask. This coefficient was compared with the CV of the automated lesion load, as determined by the segmentation method.
The second context represented a manually supplemented version of the count, in which a mask of lesion tissue was provided by an expert rater 19 and the count was obtained using the segmentation probability map within the manual lesion tissue mask. In this case, variation in the count arises solely due to changes in the Hessian structure of the segmentation mask and changes in the manual segmentation. This coefficient was compared with the CV of the manually obtained lesion load.
Clinical-Radiologic Association. Because the Expanded Disability Status Scale score is known to be noisy, a more stable measure of neurologic disability was created by averaging the EDSS scores over all visits for each subject in the National Institute of Neurological Disorders and Stroke longitudinal study, 13 hereby referred to as the EDSS avg . One subject had no EDSS information across all follow-ups and was excluded from this analysis. Using the OASIS lesion probability maps, 18 we obtained the lesion load at the final visit for each subject using a probability threshold of 30%. Then, using the lesion-count technique described in the "Proposed Lesion-Count Algorithm" section, we obtained a full count of white matter lesions at the final visit for each subject. Most important, the counts obtained for the clinical-radiologic analysis are distinct from the C P measure described in the "Validation" section because these counts represent the application of the proposed method to the entire brain, while C P represents the application of the proposed method to only lesion tissue that appeared during the longitudinal study.
To determine the clinical relevance of the proposed lesion count independent of other potentially confounding variables, we created a linear regression model for EDSS avg , with age, lesion load, and lesion count as predictors. The added statistical contribution of the lesion count was quantified using a Wald test, which is inferentially identical to a likelihood ratio test in this context, and its added clinical contribution was quantified by the increase in the adjusted R 2 of the model. In this context, R 2 gives the amount of variation in EDSS avg explained by the model. Additionally, the Pearson correlations with EDSS avg were calculated for lesion load and lesion count and a new variable we refer to as "average lesion size" (defined as lesion load divided by lesion count).

Validation
The temporally informed criterion standard count of new lesions appearing during the study, C G , ranged from 0 to 75 among the 60 subjects, with a median of 4 (interquartile range, 1-12). The connected-components count, C C , ranged from 0 to 14 with a median of 2 (interquartile range, 1-5). The proposed count, C P , ranged from 0 to 60 with a median of 4 (interquartile range, 1-15). Figure  2 provides an example of these counting techniques.
The correlation between C P and C G was 0.97, compared with the correlation of 0.67 between C C and C G . Figure 3 shows the scatterplots for the 2 linear associations, along with the line demonstrating a 1-to-1 relationship. The paired t test comparing C C and C G yielded a highly significant result (t 59 ϭ 4.19, P Ͻ .001), with C G being 6.9 lesions larger than C C on average (95% CI, 3.6 -10.2). The paired t test comparing C P and C G did not find a significant difference between the counts (t 59 ϭ Ϫ.83, P ϭ .41), with C P being 0.4 lesions larger than C G on average (95% CI, Ϫ1.3-0.5).

Reliability
For the fully automated count, the coefficient of variation was 0.19, compared with a CV of 0.22 for the automated lesion load.
Using the manual segmentation as a mask, we reduced the CV for the lesion count to 0.12, compared with a CV of 0.10 for the manual lesion load. In 1 case, automated lesion segmentation was discovered to have failed; this failure created a probability map with a drastically different Hessian structure and large regions of false-positive segmentation. With this scan removed, the CV of the fully automated lesion count remained at 0.19 and the CV of the manual segmentation-based lesion count dropped to Ͻ0.06, suggesting that the proposed count has equivalent or lower variability than the current clinical standard of lesion load.

Clinical-Radiologic Association
If we accounted for lesion load and age, the proposed lesion count was negatively associated with EDSS avg (t 58 ϭ Ϫ2.73, P Ͻ .01); this finding suggests that for a given lesion load and age, a higher count is associated with lower disease severity. The inclusion of lesion count in the model explains an additional 10% of the variance in EDSS avg compared with a model with only age and lesion load, providing support to the hypothesis that the proposed count contains disease information independent of other commonly used measures.
The Pearson correlation between lesion load and EDSS avg was small and did not reach significance (r ϭ 0.10, P ϭ .44); the same was true of the correlation between lesion count and EDSS avg (r ϭ Ϫ.12, P ϭ .36). However, average lesion size was significantly correlated with EDSS avg (r ϭ 0.35, P Ͻ .01); this correlation indicated that larger lesions were associated with higher disability.

DISCUSSION
In this article, we introduce a novel technique for obtaining crosssectional counts of pathologically distinct lesions and demonstrate it to be a valid, reliable, and clinically meaningful biomarker for MS disease status. Using information contained in the Hessian structure of lesion probability maps produced by automated segmentation methods, this technique counts distinct lesions by identifying regions that resemble the physiologic traits of distinct lesion centers.
The validity of this measure was established by comparing counts obtained at a single time point with criterion standard counts that incorporated temporal information on lesion development. The proposed count had a correlation of 0.97 with the criterion standard count, indicating the very strong validity of this measure. A count obtained using the connected-components method had only a 0.67 correlation with the criterion standard and appeared to strongly underestimate the number of lesions in individuals who developed Ͼ1 or 2 lesions per year during the study. This underestimation manifested in a highly significant difference between the connected-components counts and the criterion standard counts in a paired t test, whereas no difference was found between the proposed counts and the criterion standard counts. These findings demonstrate that the proposed technique yields a count consistent with the natural history of lesion formation.
Reliability was considered using a rich set of data from the NAIMS Cooperative. 19,23 In the NAIMS pilot study, 24 a clinically and radiologically stable subject was scanned 2 times at each of 7 different sites across the United States. The lesion count was obtained for all 14 scans of this subject, and the coefficient of variation of the counts was compared with that of lesion load in 2 contexts, to judge the reliability of the proposed measure. In the fully automated comparison, lesion count had a slightly lower CV than lesion load. This finding indicates that across repeat scans of the same brain, automated lesion count is a less variable measure than automated lesion load. In the manually supplemented comparison, lesion count had a slightly higher CV than lesion load, implying that manually obtained lesion load is a slightly less variable measure than semiautomated lesion count. On inspection, there appeared to be 1 scan in which automated lesion segmentation failed, producing an abnormal Hessian structure within the manually segmented lesion mask. With this scan removed, the CV of the semiautomated lesion count dropped to slightly more than half that of the manual lesion load. This result suggests that when automated lesion-segmentation methods perform as expected, the semiautomated lesion count is appreciably more reliable than the manual lesion load, a widely used measure of disease severity.
Clinically, the lesion-count measure appears to be a potentially important addition to commonly used radiologic biomarkers for MS. In a model accounting for lesion load and age, lesion count was highly significantly associated with EDSS. Most interesting, this association was negative, indicating that for subjects who have similar lesion loads, better outcomes are associated with more (and smaller) lesions rather than fewer (and larger) lesions. This finding lends support to the idea that neither the number of lesions nor the amount of tissue damage alone captures all relevant clinical information and instead suggests that they should be considered together. One way to conceptualize the combination of these metrics is average lesion size, which taps into the degree to which the brain can halt the growth of lesions and encourage lesional recovery 13,25,26 after incidence.
To investigate this concept more directly, we created a measure of average lesion size by dividing lesion load by lesion count. Pearson correlations with EDSS were then compared for the 3 biomarkers: lesion load, lesion count, and average lesion size. These findings provided further support for the combined importance of lesion load and lesion count, with both showing small and nonsignificant associations with EDSS. However, average lesion size showed a significant positive association with EDSS, consistent with the notion that the ability of the brain to slow or stop lesion growth is clinically relevant. These findings point to the importance of considering lesion count in MS research and provide further evidence of the validity of the proposed counting technique.
A limitation of the current study is the possibility of alternate explanations of confluence that are not accounted for in the design of the proposed count. It has been hypothesized that confluent lesions may occasionally occur as a result of the growth of older lesions or the expansion of pathologic processes. Future research should consider the degree to which this technique does or does not characterize these types of confluence as pathologically distinct lesions. Additionally, the current analyses do not account for the possibility of vascular comorbidity, which is a common and notable occurrence in patients with MS. Future work should investigate the performance of this algorithm in the presence of vascular lesions.
The lesion-count method presented in this article has several appealing features, including its low computational burden and its easy and flexible implementation. Computationally, the counting algorithm takes less than a minute to run once probability maps are obtained. The speed of the full technique varies depending on the lesion-segmentation method used but took approximately 25 minutes per subject as presented in this study. In terms of implementation, this method can be quickly and easily coded in any program capable of calculating the Hessian structure of a 3D image, a feature included in most image-processing packages. It can also be used with any lesionsegmentation method that yields a probability map; thus, it may be added to almost any pipeline regardless of the preferred segmentation algorithm.

CONCLUSIONS
This article introduces a novel and reliable fully automated method for counting pathologically distinct lesions using images obtained at a single time point, allowing an accurate reconstruction of the natural history of lesion formation without longitudinal data. Lesion count was found to be significantly associated with EDSS, independent of potential confounders such as lesion load and age, and the results suggest that individuals with more small lesions may have better clinical outcomes than those with fewer large lesions. This study also demonstrates the importance of obtaining both lesion count and lesion load by using them to construct a new MS biomarker, average lesion size, and showing that average lesion size has a significantly larger association with EDSS than both lesion load and lesion count. With further study, this technique and the findings it produces could set the stage for new lesion-level considerations in research and treatment of MS.