Improving Multiple Sclerosis Plaque Detection Using a Semiautomated Assistive Approach

The authors evaluated and validated a semiautomated software platform to facilitate detection of new lesions and improved MS lesions. Two neuroradiologists retrospectively assessed 161 MR imaging comparison study pairs acquired between 2009 and 2011. More comparison study pairs with new lesions and improved lesions were recorded by using the software compared with original radiology reports. BACKGROUND AND PURPOSE: Treating MS with disease-modifying drugs relies on accurate MR imaging follow-up to determine the treatment effect. We aimed to develop and validate a semiautomated software platform to facilitate detection of new lesions and improved lesions. MATERIALS AND METHODS: We developed VisTarsier to assist manual comparison of volumetric FLAIR sequences by using interstudy registration, resectioning, and color-map overlays that highlight new lesions and improved lesions. Using the software, 2 neuroradiologists retrospectively assessed MR imaging MS comparison study pairs acquired between 2009 and 2011 (161 comparison study pairs met the study inclusion criteria). Lesion detection and reading times were recorded. We tested inter- and intraobserver agreement and comparison with original clinical reports. Feedback was obtained from referring neurologists to assess the potential clinical impact. RESULTS: More comparison study pairs with new lesions (reader 1, n = 60; reader 2, n = 62) and improved lesions (reader 1, n = 28; reader 2, n = 39) were recorded by using the software compared with original radiology reports (new lesions, n = 20; improved lesions, n = 5); the difference reached statistical significance (P < .001). Interobserver lesion number agreement was substantial (≥1 new lesion: κ = 0.87; 95% CI, 0.79–0.95; ≥1 improved lesion: κ = 0.72; 95% CI, 0.59–0.85), and overall interobserver lesion number correlation was good (Spearman ρ: new lesion = 0.910, improved lesion = 0.774). Intraobserver agreement was very good (new lesion: κ = 1.0, improved lesion: κ = 0.94; 95% CI, 0.82–1.00). Mean reporting times were <3 minutes. Neurologists indicated retrospective management alterations in 79% of comparative study pairs with newly detected lesion changes. CONCLUSIONS: Using software that highlights changes between study pairs can improve lesion detection. Neurologist feedback indicated a likely impact on management.

worldwide, predominantly young adults. 1 During the past decade, a number of novel disease-modifying drugs have emerged that are effective during the early phases of the disease; reducing the frequency of relapses, potentially halting disease progression, and even reversing early neurologic deficits. 2 This choice in therapeutic options allows treating neurologists to alter management strategies when progression is detected. 2 Because most demyelinating events are asymptomatic, MR imaging has been the primary biomarker for disease progression, and both physical disability and cognitive function have been shown to have a nonplateauing association with white matter demyelinating lesion burden, as seen on FLAIR and T2-weighted sequences. [2][3][4][5][6] Recent advances in imaging, including 3T 3D volumetric T2 FLAIR sequences, allow better resolution of small demyelinating lesions, resulting in better clinicoradiologic correlation. 7,8 Despite advances in imaging techniques, conventional side-by-side comparison (CSSC) is often subject to a reader's expertise. 9 The sensitivity of detecting new lesions is also likely to be reduced when the section number is increased and scan planes are un-matched; however, to our knowledge, this reduction has not yet been investigated. In an attempt to facilitate accurate lesion-load and lesion-volume detection, much research has been devoted to fully automated computational approaches with unsatisfactory results. Robust lesion segmentation has been identified as a critical obstacle to widespread clinical adoption for several reasons: difficulties specific to MS, problems inherent to segmentation, and data variability. 10 A review of fully automated MS segmentation techniques concluded that basic data-driven methods are inherently inaccurate; supervised learning methods (such as artificial neural networks) require costly and extensive training on representative data; deformable models are better; and statistical models are most promising, though these also require training on representative data. 3 An alternative to total automation is to assist manual reporting with partial automation. A few semiautomated lesion-subtraction strategies have been used in the research setting on small patient populations with good lesion detection and interreader correlation. 11,12 Semiautomation without segmentation is inherently easier, more robust, and less affected by data variability because the lesion count is judged manually. The software can present a number of false-positives without a negative impact on accuracy.
Our aim was to design a nonsegmentation semiautomated assistive software platform that can be integrated into vendor-agnostic PACS and validated by application to a large number of existing routine clinical scans in patients with an established diagnosis of multiples sclerosis. The approach is to merely draw the attention of the radiologist to potentially new or improved lesions rather than automate the entire process, thus preserving the expertise of neuroradiologists in determining whether a finding is real.
Our hypothesis was that CSSCs of volumetric FLAIR studies in patients with MS were prone to false-negative errors in the perception of both new and improved lesions and that more lesions would be identified by using the assistive software with improved inter-and intrareader reliability. Secondly, we hypothesized that presenting this information to clinicians would likely have changed patient management.

Software Development
Detecting lesion change in studies obtained at 2 time points, "old" and "new," requires numerous steps including the following: 1) brain-surface extraction and masking (to remove skull and soft tissues of the head and neck), 2) coregistration and resectioning (to accurately align the 2 scans in all axes), 3) normalization of the FLAIR signal intensity (to remove global signal differences), and 4) calculating the difference in signal intensity between the old and new study at each point (to identify new T2 bright plaques and previously abnormal areas that have regained normal white matter signal). Changes between scans were presented as a color map superimposed on conventional FLAIR sequences (Fig 1). This was accomplished by bespoke code (given the trademark VisTarsier, henceforth VTS or "the software") with the inclusion of a number of open-source components (Fig 2).
Step 1: Brain surface was identified by conforming a reference model (by using BrainSuite from the University of Southern Cal-ifornia, http://brainsuite.org/) 13 to the FLAIR images. This brain surface was then used to mask out the skull and extracranial soft tissues.
Step 2: The "new" study was coregistered with the "old" study by performing a 6 df (axis of movement) rigid-body transformation, recovered by using mutual information as the distance metric. 14 Both the resulting transformation and the brain surface mask were stored in a separate PACS data base (by using DCM4CHE, http://sourceforge.net/projects/dcm4che/). 15 Both the old and new volumetric FLAIR datasets were resectioned into orthogonal axial, sagittal, and coronal planes, allowing exact comparison of any individual pixel regardless of the orientations at which the original scans were obtained. Trilinear interpolation (by using the ImageJ library; National Institutes of Health, Bethesda, Maryland) 16 was used to preserve image quality and minimize artifacts during these transformations.
Step 3: Image signal intensity was normalized by using histogram equalization to eliminate global differences.
Step 4: Using the now masked, coregistered, and normalized volumetric new and old FLAIR sequences, we computed a volumetric image containing signed pixel differences. We used both color and transparency to encode the changes between the 2 studies, transparency to encode the magnitude of change, and color to indicate the type of change, with red indicating new lesions (NLs) and green indicating improved lesions (ILs).
The resultant images were then viewed in a bespoke DICOM viewer, with the reader able to view all 3 planes for both old and new studies, and each point could be correlated in all view panes (Fig 1). The total processing time for steps 1-4 is approximately 1 minute on a typical desktop computer (1.6 GHz), including data retrieval and storage time. Rendering of the 3D and 2D perspectives requires approximately 10 ms per viewpoint, allowing rapid scrolling through the data. Most rendering time consists of trilinear interpolation.

Validation
Institutional ethics approval was obtained for this study. The hospital PACS was queried for MR imaging brain demyelinationprotocol studies performed on a single 3T magnet (Tim Trio, 12-channel head coil; Siemens, Erlangen, German) between 2009 and 2011 inclusive, for patients who had Ն2 studies during that period, yielding 367 studies. Eligibility criteria were the following: consecutive studies in patients with a confirmed diagnosis of multiple sclerosis (based on information provided on requisition forms) and availability of a diagnostic-quality MR imaging volumetric FLAIR sequence (FOV ϭ 250, 160 sections, section thickness ϭ 0.98 mm, matrix ϭ258 ϫ 258, TR ϭ 5000 ms, TE ϭ 350 ms, TI ϭ 1800 ms, 72 sel inversion recovery magnetic preparation). One hundred sixty-six comparison pairs (332 studies) met the above inclusion criteria. Of these, 5 comparative study pairs (CSP) had to be excluded due to a lack of exact lesion quantification in the issued radiology reports. A final total of 161 CSP (median time between scans, 343 Ϯ 174 days) of 153 individual patients (women ϭ 116, men ϭ 37; median age, 41.5 Ϯ10.2 years) with accompanying reports were thus included in the study. MR imaging-trained radiologists at our institution reported all studies. Of the 161 CSP, 43 had initial clinical reports by 1 of the authors (reader 1).
To assess interobserver characteristics and validate the detection ability of the software, 2 fellowship-trained neuroradiologists (readers 1 and 2) with 6 and 3 years' clinical experience, respectively, retrospectively assessed all CSP by using the software. The readers were blinded to each other's findings and to the existing radiology reports (median time between clinical report being issued and assessment with the software was 449 Ϯ 159.7 days). The time required to read a study by using the software was assessed in real-time, by using a digital stopwatch.
The CSP initially clinically read by reader 1 were reread using the PACS a second time, 12 months later, to assess intraobserver characteristics. These same CSP were also again read by reader 1 three months later, still using the software, with all images left-right reversed to reduce the risk of recalling individual  Preprocessing for change-detection on receipt of a new study. A pair of old and new studies are required, each containing a volumetric series used for change detection. In our case, this series uses the FLAIR protocol. Due to significant deformation in soft tissues outside the cranium, it is preferable to register the studies by using only the brain tissue. To this end, a brain-surface extraction tool (BrainSuite from the University of Southern California) 13 is fitted (1) and then used to mask the brain in the new study (2). Next, the equivalent series in the old study is retrieved and coregistered to the new study (3) by using the Mutual Information algorithm. The recovered transformation is stored in the PACS data base. Note that it is only necessary to mask the new study during registration and that rigid registration yielded sufficient accuracy after exclusion of the masked areas. DOF indicates degrees of freedom.
lesions. The time taken to reread the studies by reader 1 was also recorded in real-time by using a digital stopwatch.

Lesion Assessment
NLs were defined as those with new focal regions of increased T2 FLAIR signal in previously normal white matter. Due to the time interval between studies, no concentrically enlarging (worsening) plaques were identified.
ILs were defined as those with either concentric reduction in lesion size or global reduction in abnormal T2 FLAIR signal.
When using the software to assess NLs, the reader scrolled through axial colored change maps, with areas of increased FLAIR signal highlighted in red. Each time a candidate lesion was identified, the area was correlated to coregistered resectioned but otherwise conventional source FLAIR images in all 3 orthogonal planes of both new and old studies. The reader assessed the lesion as one would during conventional reporting (without the aid of any assistive software), judging whether the lesion represented a new demyelinating lesion, other pathology, or artifact. When the reader was satisfied that the lesion was indeed a true finding and represented a new demyelinating plaque, it was marked with 3D Cartesian coordinates.
This process was repeated for all lesions and was similarly repeated when examining the decreased FLAIR signal maps to identify ILs (highlighted in green).
When subsequently analyzing these marked lesions, the recorded coordinates for each read were automatically compared to ensure that the same lesions were being identified. Lesions with coordinates Ͻ2 mm apart were considered as 1 lesion. For lesions with coordinates Ͼ2 mm apart, both readers performed manual review of each lesion to determine whether both coordinates belonged to 1 large lesion marked in different locations or to 2 separately detected but adjacent lesions.

Statistical Analysis
The Cohen interrater reliability was used to measure and compare the agreement between the 2 readers and between the readers and the originally issued radiology reports. The Spearman correlation coefficient was also used to assess interreader correlation of overall lesion load. Three sets of binary subgroups were considered (Ն1 lesion, Ն2 lesions, and Ն3 lesions) when assessing in-terreader agreement. Univariate 2 and 2-group proportion analyses were conducted to compare the number of new or improved lesions identified by the clinical report and by the readers when using the assistive software. The time taken to complete the assessment was recorded for each reader and reported as averages. The Mann-Whitney rank sum test was used to compare the time taken to read the scan data when using the side-by-side comparison with that when using the software. For all statistical tests, a 2-sided ␣ value of .05 was used to indicate significance. Data were analyzed with STATA (Version 12.1; StataCorp, College Station, Texas).

Potential Clinical Impact
Questionnaires were sent to the referring neurologists concerning CSP if there was a change in lesion load when comparing the originally issued radiology report and the report of the readers using the software. Neurologists were asked to indicate whether their management strategies would have been changed retrospectively in regard to medication regimens, clinical follow-up interval, or MR imaging follow-up interval.
To substratify the comparison pairs with NL and IL, we considered 3 sets of dichotomized subgroups (CSP with Ն1 lesion, Ն2 lesions, and Ն3 lesions). For CSP with detected NL and IL, statistics indicating substantial interreader agreement were observed (Table 2). These values were reduced slightly due to reader 2 identifying slightly higher numbers of both NLs and ILs in each subgroup, resulting from an interreader difference in the interpretation of lobulated lesions as either 2 confluent lesions or 1 irregular lesion. The Spearman correlation coefficient demonstrated good overall interreader correlation (Spearman : NL ϭ 0.910, IL ϭ 0.774).
Comparing the subgroups of both NL and IL, readers detected a higher number of CSP with a changed lesion load compared to the original radiology reports (Fig 3A, -B), despite a wide variation of total background lesion load (On-line Fig 2).
Three false-negatives occurred by using the software; 2 NLs and 1 IL were described in 3 respective radiology reports, not detected by the readers.
Assessment of lesion location accuracy was calculated by using the total agreed base lesion load (defined as the lowest number of NLs or ILs that both readers agreed on per CSP) and showed good interreader location accuracy (NL location accuracy ϭ 94%, 313/ 333; IL location accuracy ϭ 96%, 70/73).
Despite identifying more NLs and ILs in a greater proportion of CSP when using the software, intraobserver agreement between  the first and second read of the "reader 1 subgroup" applying the software was very good and better than that with CSSC, though this did not reach statistical significance with the sample size limited to 43 CSP (Table 3 and On-line Table 1). Mean reporting times per CSP were Ͻ3 minutes (reader 1 ϭ 2 minutes 15 seconds Ϯ 1 minute 5 seconds and reader 2 ϭ 2 minutes 45 seconds Ϯ 1 minute 44 seconds), and there was an overall reduction in study reading times as the readers became more familiar with the software (mean read time of the first 25 studies: reader 1 ϭ 3 minutes 9 seconds, reader 2 ϭ 4 minutes 30 seconds versus a mean read time of the last 25 studies: reader 1 ϭ 1 minute 36 seconds, reader 2 ϭ 1 minute 49 seconds).
When we compared VTS with CSSC, there was a significant difference in read times (median interquartile range): VTS ϭ 1 minute 58 seconds (range, 1 minute 37 seconds to 2 minutes 52 seconds) compared with CSSC ϭ 3 minutes 41 seconds (range, 51 seconds to 4 minutes 12 seconds; P Ͻ .001).
Feedback forms were drafted for the 60 CSP that showed interval lesion load change when comparing the originally issued radiology reports and the lesion load detected by using VTS. A total of 47/60 completed feedback forms were returned (respondent rate of 78%). In 79% (37/47) of cases, neurologists reported that they would have been likely to change management strategies if the altered lesion load had been known at the time, prompting a change in either MR imaging follow-up interval, clinical follow-up interval, or therapeutic management (On-line Table 2).

DISCUSSION
Management strategy considerations for MS are based on clinical, biochemical, and imaging findings and are aimed at treating acute attacks, preventing relapses and progression, managing symptoms, and rehabilitation. 2,17 In recent years, a number of new agents have become available, targeting various multiple sclerosis disease pathways. 1,2 MR imaging plays an important role in detecting not only the total demyelinating lesion load but also, possibly more important, interval change in the number of demyelinating lesions, reflecting disease activity and potentially resulting in changes to treatment. 2 Conventional comparative image assessment is subjective, dependent on the skill and consistency of the reviewer. 9 To facilitate time-efficient, reproducible, and accurate lesion-load detection, many algorithms have been proposed for fully automated computer-assistive solutions. 3,18 These methods use different principles, including intensity-gradient features, 19 intensity thresholding, 20 intensity-histogram modeling of expected tissue classes, 21-23 fuzzy connectedness, 24 identification of nearest neighbors in a feature space, 25,26 or a combination of these. Methods such as Bayesian inference, expectation maximization, support-vector machines, k-nearest neighbor majority voting, and artificial neural networks are algorithmic approaches used to op-  timize segmentation. 18 All of these approaches tend to show promising results; however, the results are usually on small samples, often nonreproducible and unreliable, and have not entered into routine clinical use. 18 In smaller study populations, semiautomated assistive approaches have been investigated with promising results by using both MR imaging subtraction techniques and coregistered comparative volumetric FLAIR color-map overlays. 11,12 Our semiautomated radiology assistive platform is computationally fast and robust, successfully processing all 322 included studies. The software allows color maps superimposed on anatomic FLAIR sequences and direct comparison between old and new studies in exactly aligned axes as well as accurate localization of any given point in all 3 planes.
In our study population, the largest reported of its kind (161 CSP; 322 individual studies), a statistically significant number of increased CSP with NLs and ILs were detected when using the assistive software compared with the originally issued MR imaging reports generated with CSSC. On the basis of responses by referring neurologists, at least 79% of CSP with changes in detected lesion loads (VTS versus CSSC) were likely to have undergone a change in management if the altered lesion load had been appreciated at the time. This represents 22% (37/161) of the whole cohort; studies reported as "stable" that actually had sufficient change in disease burden to potentially alter management.
In addition to detecting NLs, our approach demonstrates a statistically significant improvement in detecting ILs. There is, however, a larger disparity between readers when assessing ILs. After we reviewed the discrepant lesions, this does not appear to stem from software failure but from intrinsic heterogeneity in lesions that appear to be reducing in size or signal intensity. Some lesions demonstrated an unequivocal concentric reduction in size. Many lesions, however, demonstrated diffuse or ill-defined signal normalization. It was these lesions that resulted in most interreader discordance. This difficulty in clearly defining the nature of improving demyelinating lesions is echoed in studies correlating the MR imaging appearances of demyelinating lesions with lesion pathology, highlighting the heterogeneity that also exists in radiologic-pathologic correlation. 27 Although the significance of new demyelinating lesions on MR imaging has been well established in the literature, 1,2,4-7 the clinical significance of "improved" demyelinating lesions is less clear.

Limitations
Our study has a number of limitations. One is the single-scanner/ single-sequence nature of the dataset. The software platform has been designed to be vendor-agnostic and should be able to accept any volumetric FLAIR sequence; however, this has not yet been tested, and likely, performance will vary depending on the quality of the source data. Indeed, vascular flow-induced artifacts through the anterior pons and inferior temporal lobes seen in our FLAIR sequence resulted in some difficulty in interpreting signal change in these regions, and improved sequence design may allow even greater lesion detection. Additionally, there is likely to be decreased performance if the 2 volumetric FLAIR sequences compared are from different MR imaging scanners or differ in their specifications, though again transformation, coregistration, and normalization are not dependent on identical sequences. Although not tested in this study, other volumetric sequences such as double inversion recovery should fulfill the criteria to be used with the software.
Other limitations of our study include the inability to comment on interobserver agreement on the original radiology reports because these were single-read by various MR imagingtrained radiologists in our department and a reread of all CSP by both readers was beyond the scope of this study.
Although we can also not comment on the time it took to read the original MR imaging studies in clinical practice (at our institution, we do not routinely record reporting times), we tried to address this, in part, by measuring the time taken to perform conventional interpretation by using CSSC on the PACS during the second reread by reader 1. Having done so, we nonetheless acknowledge that applying the software clinically may result in unforeseen program-related and user-related time delays. We are hoping to minimize these by incorporating the platform directly into a PACS workflow and by familiarizing MR imaging readers with the program; these changes will be explored in future work.
Although intrareader correlation was shown to be very good in the subset of cases that reader 1 reread (Table 3), the accuracy of the other radiology reports may have been influenced by factors related to the daily demands of a busy radiology department, such as time pressures and interruptions (factors not simulated when testing the software). All MR imaging readers in our department are experienced; thus, radiologic expertise is unlikely to present a limitation. If anything, it is likely that the lesion-detection improvement would be larger for radiologists with less neuroradiology experience.
We also acknowledge that we did not directly assess the clinical impact of a second read without the software; however, we believe we have, in part, explored this by having reader 1 additionally reread all the studies he originally assessed (n ϭ 43) by using CSSC. The agreement between the 2 reads was high (Table 3). More important, in only a single patient did the reads differ in categorization (On-line Table 1). The reports of a second read by using CSSC would have been, in all except 1 case, indistinguishable from the initial clinical reports; thus, no change in management would be expected. As such, retrospective changes to management reported by treating clinicians are attributable to the software, rather than to merely a second read.

Future Work
Our current development work is focused on deploying the software to the live clinical PACS workflow at our institution, which will allow us to carry out prospective research and ensure that the findings of this study are replicated in terms of improved lesion detection without the burden of false-positives. We are also hoping to make the software available to other institutions for further validation by introducing additional readers and by using a variety of FLAIR sequences. The functionality of the software can also be extended in the future by adding semiautomated segmentation for quantification of lesion volume.