Improved Automatic Detection of New T2 Lesions in Multiple Sclerosis Using Deformation Fields

BACKGROUND AND PURPOSE: Detection of disease activity, defined as new/enlarging T2 lesions on brain MR imaging, has been proposed as a biomarker in MS. However, detection of new/enlarging T2 lesions can be hindered by several factors that can be overcome with image subtraction. The purpose of this study was to improve automated detection of new T2 lesions and reduce user interaction to eliminate inter- and intraobserver variability. MATERIALS AND METHODS: Multiparametric brain MR imaging was performed at 2 time points in 36 patients with new T2 lesions. Images were registered by using an affine transformation and the Demons algorithm to obtain a deformation field. After affine registration, images were subtracted and a threshold was applied to obtain a lesion mask, which was then refined by using the deformation field, intensity, and local information. This pipeline was compared with only applying a threshold, and with a state-of-the-art approach relying only on image intensities. To assess improvements, we compared the results of the different pipelines with the expert visual detection. RESULTS: The multichannel pipeline based on the deformation field obtained a detection Dice similarity coefficient close to 0.70, with a false-positive detection of 17.8% and a true-positive detection of 70.9%. A statistically significant correlation (r = 0.81, P value = 2.2688e-09) was found between visual detection and automated detection by using our approach. CONCLUSIONS: The deformation field–based approach proposed in this study for detecting new/enlarging T2 lesions resulted in significantly fewer false-positives while maintaining most true-positives and showed a good correlation with visual detection annotations. This approach could reduce user interaction and inter- and intraobserver variability.

M R imaging has become a core paraclinical tool for diagnosing and predicting long-term disability and treatment response in patients with multiple sclerosis. Of particular note, several criteria and strategies have been proposed for prompt identification of suboptimal response in individual patients based on a combination of clinical and MR imaging measures assessed during the first 6 -12 months after treatment initiation. [1][2][3][4][5][6] These criteria are related to detection of disease activity on follow-up brain MR imaging studies compared with baseline scans, defined as either gadolinium-enhancing lesions or new/enlarging T2 lesions. However, detection of active T2 lesions in patients with MS can be hindered by several factors, such as a high burden of inactive T2 lesions, the presence of small and confluent lesions, inadequate repositioning, and high interobserver variability. 7 Image subtraction after image registration can overcome these issues by visually cancelling stable disease (lesions that stay the same over time) and providing good visualization and quantification of active T2 lesions (either positively or negatively). 8,9 Techniques for automatic detection of active T2 lesions can be classified into 2 categories: intensity-based and deformationbased approaches. 10 In the former, successive scans are analyzed by point-to-point (voxel-to-voxel) comparison, whereas in the latter, deformation fields obtained by nonrigid registration of the 2 scans are analyzed.
Most of the proposed techniques to detect changes on follow-up images use an image-subtraction process that identifies new T2 lesions [11][12][13] and include statistical models of intensity changes between scans or other, more complex, supervised strategies. Although segmentation of subtraction images enables quantification of new, enlarging, and resolving MS lesions, automated image analysis that differentiates a true lesion change and noise or artifacts would save considerable time and effort.
Nonrigid registration techniques usually provide a discrete vector field that defines deformations occurring between 2 different images. This vector field can be used to detect evolving processes, including new T2 lesions. Several approaches that use deformation fields (DF) to detect positive changes occurring in longitudinal MR studies have been reported. 14,15 These approaches focus on detecting and explaining processes undergoing change (ie, lesions shrinking or growing), but not on detecting new lesions, a measure that is now under consideration as a biomarker for monitoring and predicting treatment response. 16 The purpose of this study was to improve automated detection of new T2 lesions on successive brain MR images, by using a novel approach that combines subtraction and DF analysis. This new pipeline will be compared with other approaches, in which a threshold is applied or a postprocessing step is incorporated on the basis of intensity rules.

Patients
We prospectively analyzed previously acquired data from a cohort of 36 patients with clinically isolated syndrome (CIS) or early relapsing MS (13 women and 23 men; 35.4 Ϯ 7.1 years of age) who underwent brain MR imaging in our center for diagnosis or for monitoring disease evolution or treatment response. All patients with CIS and early relapsing MS demonstrated new T2 lesions on the follow-up scans and were diagnosed according to recent definitions and criteria. 17,18 Two brain MR imaging acquisitions were obtained in each patient, the first within the first 3 months after the onset of symptoms (baseline [BL]) and the second at 12 months' follow-up after onset (FU). The Vall d'Hebron hospital's ethics committee approved the study, and written informed consent was signed by the participating patients.

Expert Analysis
All new and enlarging T2 lesions visually detected on the FU scan were annotated on T2-FLAIR images by using the semiautomated tool included in Jim 5.0 (http://www.xinapse.com/home.php). This task was performed by a trained technician who first detected changes visually by using the BL and FU scan and then delineated them semiautomatically by using a subtraction image and both scans. This task was later confirmed by an expert neuroradiologist. The results of this analysis served as the reference standard for comparisons in the study.

Preprocessing
On both BL and FU PD-weighted images, a brain mask was identified and delineated by using the FSL Brain Extraction Tool (bet2 command) (http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/BET) with the robust center estimation, neck clean-up, and default threshold parameters. The mask was then applied to the other coregistered images (T2-, T2-FLAIR-, and T1-weighted), and the N4 algorithm from the ITK library (http://www.itk.org/) 19 was used to correct for bias with the standard parameters for a maximum of 400 iterations. The last preprocessing step was to normalize BL and FU intensity values by using a histogram-matching approach.

Registration and Subtraction
In each patient, T1-and T2-FLAIR-weighted images from the same study were coregistered to the PD-weighted image by using a 3D affine transformation similar to that in previous works. 20 The Mattes Mutual Information cost function was minimized by Regular Step Gradient Descent Optimization (https://itk.org/ Doxygen320/html/classitk_1_1RegularStepGradientDescent Optimizer.html), and B-spline interpolation was applied. This framework was implemented by using ITK.
The same 3D affine-registration framework was also used before subtraction to warp the BL images to the FU space because patients with CIS and early relapsing MS present with small (or no) overall anatomic changes. 21 The registration was conducted between both PD-weighted images. After the transformation had been obtained, we applied it to the other images by using B-spline interpolation to subtract the BL PD-, T2-, and T2-FLAIRweighted images from their corresponding FU images. In the case of BL T2-FLAIR-weighted images, the 2 affine transformations were combined to avoid interpolating more than once.
Affine registration methods are robust to the presence of lesions, and when new lesions appear, deformable models usually show distortions to compensate for the anomalous regions. On the basis of the characteristics of these approaches, we were able to analyze the DF obtained after applying these nonrigid techniques to the registered images. In this study, we applied the multiresolution Demons registration approach 22 from ITK initialized with the previous affine transformation. Concretely, we used the DemonsRegistrationFilter (SD ϭ 1) (http://www.itk.org/Doxygen320/html/classitk_1_ 1DemonsRegistrationFilter.html) with MultiResolutionPDED-eformableRegistration (http://www.itk.org/Doxygen320/html/ classitk_1_1MultiResolutionPDEDeformableRegistration. html) (iterations ϭ 50, levels ϭ 2). This algorithm can produce large localized deformations and has been widely used in brain MR imaging.

Threshold
New and enlarging T2 lesions appear hyperintense in the subtraction image. However, certain regions outside the white matter may also appear hyperintense due to artifacts, noise, inhomogeneity, registration errors, or small anatomic differences. Because our goal was to detect new and enlarging T2 WM lesions, we restricted our search to areas within the WM. To define this region, we applied an automated tissue-segmentation algorithm 23 to the BL and FU scans. This nonparametric algorithm uses an atlas registered to the T1-weighted image in conjunction with the T1-, T2-, and PD-weighted images. This segmentation was applied before the registration step between the 2 image sets. After registration, a final WM mask was obtained as the union of the 2 WM segmentations in the FU space. After defining WM, we smoothed the subtracted images by using the ITK 3D Gaussian filter (Discrete-GaussianImageFilter; http://www.itk.org/Doxygen/html/classitk_ 1_1DiscreteGaussianImageFilter.html) with a 0.5 SD to reduce the impact of spurious hyperintense regions. 20 An automated threshold was then computed for each subtraction image (PD, T2, and T2-FLAIR) and applied separately to obtain 3 initial lesion masks. The thresholds were computed as the mean of the subtraction image within the WM plus 5 SDs to guarantee that only hyperintense regions were detected and to maintain a large number of true-positives (TPs), as proposed previously. 20 Lesions of Ͻ3 voxels were excluded to reduce the effects of noise.

Lesion Mask Combination
To differentiate between errors and true lesions in each mask, we used the intersection of the 3 masks (PD, T2, and T2-FLAIR). Because differences in the initial masks might still result in falsepositive (FP) detections of 1 or 2 voxels, we also applied the lesion size restriction to the combined mask to reduce this effect.
Afterward, the 2 different postprocessing approaches were used independently in order to compare them.

Postprocessing Based on Intensity
While the aforementioned restrictions usually exclude a large number of FPs, they do not completely eliminate this problem. As has been reported, 20 some FPs can arise from low intensities in the BL images, caused, for example, by skull-stripping errors. To reduce the effect of these factors and to include local information, we applied a set of suggested intensity-based rules to the BL and FU images 20 : • Global rule: To avoid regions with a low intensity, candidate lesions with a mean value under basal Ϫ 2 basal are discarded, where basal and basal are the mean and SD of the basal intensities inside the WM ROI. • Basal neighborhood ratio: New lesions should appear as WM in the basal image. To ensure that, we compute a ratio between the neighboring pixels of the candidate lesions ( lesion / neighbors ). If this ratio is Ͻ0.9, we discard the candidate lesion. That usually means that there is a dark spot that might appear as a hyperintensity in the subtraction image. • Follow-up neighborhood ratio: Similarly, new lesions should actually be lesions in the follow-up image. To ensure that, we compute the same ratio. If this ratio is Ͻ1, the candidate lesion has a lower intensity profile than its neighboring area, so we discard it.

Postprocessing Based on Deformation Fields
The Demons algorithm provides DF representing a transformation from the target image (FU scan) to the source image (BL scan). To compensate for hyperintense lesions, the DF go from outside the lesion to its center (shrinking it), as is illustrated in Figs 1 and 2. Vectors within and in the vicinity of the lesion have a higher modulus than those in other regions of the image. Moreover, no sinking patterns with independent behavior between neighboring vectors are observed far from lesions.
To be able to model and automatically detect this behavior, we defined 3 regional metrics computed from the DF inside each candidate lesion: • Divergence 15 : This vector operator is defined as the volume density of the outward flux of a vector field from an infinitesimal volume around a given point. Given a continuously differentiable vector field F ជ , the divergence at a given point is equal to the scalar-valued function: In our case, where G(i) j is the j component of the gradient in the i component of the vector field volume.
For new T2 lesions, deformations have an inward flux that is represented by a negative value (div DF of Ͻ0). Therefore, we excluded lesions that had a positive mean value. puted the scalar product between the DF vector and this concentric vector. Concentric vector fields should have an absolute mean value close to 1; therefore, we excluded all candidate lesions with an absolute value lower than 0.75. This value indicates that the deformation vector and the concentric vector have a maximum angle of 15°.

Evaluation and Statistical Analysis
To validate use of the DF and the benefits they provide when automatically detecting new T2 lesions, we compared the proposed pipeline to a state-of-the-art approach 20 with detection-based measures.
In this approach, a lesion is considered TP if it overlaps a ground truth lesion, FP is a detected lesion with no overlap, and FN is a lesion that has not been detected. The TP fraction (TPf) and FP fraction (FPf) are the ratio measures of TP versus ground truth lesions and FP versus all lesions found, respectively. Therefore, perfect detection would be 100% TPf and 0% FPf. To complement and summarize these measures, we also computed the Dice similarity coefficient (DSC): Furthermore, we also performed an evaluation of the actual overlap between lesions by using the volumetric DSC.
Finally, we also included the average surface distance measure from the MICCAI MS Lesion Segmentation Challenge 2008 (http://www.ia.unc.edu/MSseg/). 24 The border voxels of segmentation and reference are determined. For each voxel along one border, the closest voxel along the other border is determined (by using unsigned Euclidean distance in real-world distances). All these distances are stored, and their average is computed. This value is zero for a perfect segmentation.
A statistical analysis was performed to evaluate the significance of the results obtained. To determine the performance of each key step in our pipeline, we conducted 3 sets of experiments, each focusing on a different aspect. The naïve approach consisted of applying the threshold defined in the "Materials and Methods" section to each subtraction image. We also applied different postprocessing approaches to the initial masks separately, and finally, we compared the results of the threshold mask combination to our proposal and a state-of-the-art approach.
First, we performed a Lilliefors test on the measures evaluated and their differences. Due to the number of pipelines evaluated and the statistically proved non-normal distribution of the measures, pair-wise t tests were inappropriate. Hence, permutation tests 20,25 were used to determine significant differences among applying a threshold, using intensity and neighborhood rules, and using DF. Permutation tests yield the mean () and SD () of the fraction of times that the difference in a given measure for a given method is smaller than the remaining methods, with a P value Յ .05. The methods were then ranked by the mean and SD of the method with the highest measured value. Methods in the same rank had similar results, whereas methods in different ranks showed significant differences.
We also performed a Wilcoxon rank sum test among the DSC, TPf, and FPf results for each independent image after the threshold was applied. Finally, the Pearson correlation was used to analyze the manual annotations and the automatic detections obtained with our approach.

RESULTS
The mean results for new T2 lesion detection and segmentation by using each of the approaches are summarized in Table 1. The DSC results with our approach were 0.68 in terms of detection (regions) and 0.52 in terms of segmentation (volume). Moreover, we obtained the lowest average surface difference (7.89 mm) in contrast to the joint threshold (13.07 mm) and with intensity rules (30.80 mm). While the volumetric agreement was lower, it was high enough to validate our detection definition of 1 voxel overlap.

Impact of Postprocessing per Image
Our first set of experiments consisted of applying a threshold to PD-weighted, T2-weighted, and T2-FLAIR-weighted images separately. We compared this naïve approach with a state-of-the-art approach 26 based on intensity and spatial rules and the DF rules presented here on each image.
According to Table 1, application of a threshold alone missed some ground truth lesions and resulted in a large number of FPs. Lowering the threshold to include all ground truth detections would be counterproductive because of the number of FPs. In terms of sensitivity alone, both PD-weighted and FLAIR subtractions yielded similar results.
Rankings obtained by statistical permutation testing for the DSC are summarized in Table 2. Negative values indicate lower performance than the method with the highest DSC value. Rank 1 only included approaches that relied on the DF after applying a threshold, whereas rank 2 included approaches that used intensity and neighborhood rules for the PD and T2-FLAIR subtractions. Rank 3 included all methods based on thresholds with a negative P value. Because ranking between the approaches differed, we can conclude that there was a significant difference between using DF and intensity/neighboring rules.
Paired rank sum testing between strategies revealed no significant difference in DSC or TPf among the 3 image subtractions, thus indicating that all 3 images provided similar sensitivity for lesion detection. However, we obtained significant differences for FPf, suggesting that FP detection differed among the images. This difference supports our idea of combining the masks obtained for each subtraction.

Impact of the Lesion Mask Combination
When the initial masks for each image were analyzed independently, almost all new T2-WM lesions were detected. However, FP detections were visually different among the images in most cases and, therefore, highly related to the image being visualized.
To validate the assumption that combining the masks significantly improves the results, we performed a second set of experiments and comparisons by using rank sum testing between the lesion mask after applying a threshold to each image independently and the intersection of all 3 masks. Significant differences were found for FPf and DSC (P Ͻ .05) but not for TPf. Again, this finding suggests that combining all masks reduces the number of FPs without significantly affecting TP detections.

Pipeline Comparison
We also performed an analysis of the last group in Table 1 (mask combination in the 3 different strategies). In this case, we obtained significant differences (P Ͻ .05) for all 3 measures (DSC, TPf, and FPf) between the intersection mask and the 2 approaches based on postprocessing. This result indicates that the DSC improvement was due to the considerable decrease in FPs detrimental to the number of TPs. This is the usual trade-off encountered when dealing with postprocessing techniques, in which some TPs are excluded (eg, due to image artifacts) to reduce the number of FPs. We also found significant differences (P Ͻ .05) in all 3 measures between the 2 automatic approaches (our proposal and that of Ganiler et al 20 ), reinforcing the notion that our DF strategy yields better performance. Qualitative examples of the results obtained with our proposal are shown in Fig 3. A significant correlation (r ϭ 0.81, P ϭ 2.2688e-09) was found between annotations based on visual detection and our automated approach for detecting new T2 lesions (Fig 4). We then analyzed the effect of the 3 DF-based measures and found that they all had a similar impact in most cases; however, some FPs were only detected by one of them, with no apparent pattern.

Lesion Analysis per Volume
Finally, we analyzed lesion detection by groups of similar size. Table 3 summarizes the results before and after postprocessing by using the deformation field obtained. As expected, lesions with a small size (between 3 and 10 voxels) have a low detection rate (42.86%). Due to their small size, the deformation field cannot fully capture them and they are discarded during the postprocessing step. As the volume increases, the deformation field presents a clearer pattern that we can detect with the rules presented in this article. Even though for lesions of a medium size (between 11 and 50 voxels) the TPf is still lower than 50% (48.57%), this value increases for large lesions of Ͼ50 voxels (77.42%). Moreover, the TPf decreases from 69.23% before postprocessing to 23.08% with lesions of Ͻ7 voxels.

DISCUSSION
New/enlarging T2 lesion count is a common measure used to monitor and predict treatment response in patients with MS. [1][2][3][4][5][6] Trained radiologists perform this task by visual analysis of 2 successive MR images, a time-consuming task associated with high interobserver variability. 7 The pipeline proposed in this study may be of value for assisting or even replacing visual analysis for detecting active MS lesions on T2-weighted images. The method is completely autonomous and automated and does not require user input or a training set. Furthermore, the process is computationally fast because it mainly relies on subtraction and registration. With an optimized Demons algorithm, it takes only minutes to segment all new T2 lesions in a single patient, with a low number of FP detections.
We obtained significant results with a data base of 36 patients, and we also tested our algorithm without any modification with a small clinical trial dataset. This dataset had a reduced number of images (n ϭ 10) that were provided by 3 different centers. Even though promising results were obtained with this initial test (DSC for lesions ϭ 0.79, DSC for volume ϭ 0.60, TPf ϭ 74.15, FPf ϭ 9.61), an exhaustive analysis with a larger number of patients should be performed to prove that the method performs similarly with different acquisition setups.
However, currently, it is not possible to detect new black holes (even though a postprocessing step could be included to differentiate between new lesions and new black holes by using the T1weighted images).
Current studies are working on the definition and implementation of a new "no evidence of disease activity" treatment. 6,27 This decision model relies on, among others, the detection of new/enlarging T2 lesions as a biomarker and requires a high specificity and sensitivity because the number of FPs could suggest an undesirable change in treatment. Therefore, reducing the number of FPs when using automatic tools is a key factor. However, current subtraction techniques usually rely on intensity information, which can misguide detection due to local inhomogeneities or small changes. While these FPs can be reduced by using spatial information, a registration technique that overfits a free-form deformation incorporates this local information and provides better insight into changes occurring due to development of a new lesion or one that changes in size.
Automated algorithms usually obtain better scores when lesion count or lesion volume is high, but they often have shortcomings when the lesion volume or volume change is small. [11][12][13]20 We also compared ours to a current state-of-the-art technique that has been validated with 1.5T imaging. 3T imaging provides better resolution and a higher signal-to-noise ratio, from which registration techniques can benefit. Therefore, to demonstrate that DF provide a better means to differentiate subtraction artifacts and true disease activity (in terms of lesions), we used 3T imaging, in which DF provide a better understanding of evolving processes.

CONCLUSIONS
We have presented a new automated pipeline to detect new brain T2 lesions and positive changes in disease activity in patients with clinically isolated syndrome or early relapsing multiple sclerosis.
This technique relies on DF information and provides more reliable measurement of changes occurring between 2 successive MR images than other currently available approaches. Significant differences in accurate lesion detection were found between this technique and other current approaches, and a strong correlation and higher overlap were seen between our approach and visual lesion detection. These findings indicate that the proposed technique may be of value for application in clinical studies investigating disease activity, monitoring, and treatment effects, providing a decrease in user interaction and likely a reduction in inter-and intraobserver variability.