Unsupervised Deep Learning for Stroke Lesion Segmentation on Follow-up CT Based on Generative Adversarial Networks

,

H emorrhagic transformation (HT) and malignant cerebral edema are severe complications after acute ischemic stroke (AIS), which frequently result in functional deterioration and death. [1][2][3] Computerguided visualization and segmentation of these hemorrhagic and infarct lesions can assist radiologists in detecting small lesions. 4,5 Furthermore, lesion volume computed from a segmentation predicts long-term functional outcome 3,6 and can be used to guide additional treatment such as decompressive craniectomy. 7 Compared with an AIS baseline (BL) NCCT, follow-up (FU) NCCT imaging of a hemorrhagic lesion is characterized by an attenuation increase, while infarct lesions are characterized by an attenuation decrease. 8,9 This attenuation change between NCCT scans can be exploited by specific deep learning algorithms to identify tissue changes and, in turn, can be used to obtain lesion segmentations.
Supervised deep learning with convolutional neural networks is the state-of-the-art computer-guided method for volumetric segmentation of hemorrhagic and infarct lesions in NCCT. 4,5,[8][9][10] The supervised part in the case of segmentation refers to the use of human-, often an expert radiologist, guided annotations per voxel that represent the ground truth of the lesion on NCCT. These annotations are subsequently used to optimize a convolutional neural network for automated segmentation. 11 However, acquiring manual annotations is time-consuming and is subject to intra-and interrater variability. As a result, it is difficult to create large data sets with comprehensive ground truth annotations. This issue is a challenge for the training of supervised deep learning models, affecting the performance and generalizability of these models.
Generative adversarial networks (GANs) are a type of deep learning model that can be used to generate new images or transform existing images. 12,13 Because a GAN is optimized without an explicitly defined ground truth, such as manual lesion annotations, it is considered an unsupervised deep learning method. Recently, Baumgartner et al 14 introduced the use of a GAN to transform an MR image of a patient with symptoms to a scan before symptom onset of Alzheimer disease. From this transformation, structural maps were extracted to visually represent pathologic changes relative to a generated BL MR imaging scan without Alzheimer disease. 14 Such structural pathology maps could subsequently be used to segment and quantify the pathologic changes.
The aim of this study was to accurately segment stroke lesions on followup NCCT scans with a GAN trained in an unsupervised manner. In line with Baumgartner et al, 14 we developed a GAN to remove hemorrhagic and ischemic stroke lesions from follow-up NCCT scans by generating difference maps with a lesion and BL NCCT scans without a lesion.

GANs for BL NCCT Generation
The GAN structure as adopted in this study consists of 2 competing deep learning models, referred to as generator and discriminator models. The generator model generates artificial images, while the discriminator model tries to distinguish the generated artificial image from the original images. 12,13 In this study, the generator receives as input a follow-up NCCT scan with the lesion and generates a difference map. This difference map is subtracted from the input follow-up NCCT scan with the lesion to generate an artificial BL NCCT scan without the lesion. Because an infarct lesion is visually subtle in AIS BL NCCTs acquired in the acute stage (0-6 hours after symptom onset), the transformation from a follow-up scan at 24 hours (24H) or 1 week (1W) with a well-defined lesion to a BL scan entails essentially the removal of the lesion. Subsequently, the discriminator model classifies the presented images as being either an original BL or a generated BL NCCT. This classification is used to provide feedback to the generator model and to optimize the difference map. 12,13 The generated difference map is expected to have high positive values at the location of a hemorrhagic lesion and negative values at the location of an infarct lesion on a follow-up NCCT. Similarly, the attenuation change between BL and follow-up NCCT is positive in the case of a HT and negative if edema or brain tissue necrosis occurs in the infarct lesion. Thresholding of the generated difference map values can then be used to obtain a lesion segmentation.
The proposed GAN method is optimized with 2 types of loss functions: the voxelwise absolute difference between generated BL and real BL NCCTs (L1-loss) and the binary cross-entropy of the discriminator (adversarial-loss) for classifying generated and real BL NCCTs. 12,13 Figure 1 presents the GAN model architecture we refer to as the follow-up to BL GAN (FU2BL-GAN).

Patient Populations
In this study, 820 patients were included between January 2018 and July 2021 in the training data set from the MR CLEAN-NO-IV (n ¼ 297), MR CLEAN-MED (n ¼ 377), and MR CLEAN- . The follow-up (FU) NCCT with lesion is clipped between Hounsfield unit ranges of 0À100 and 100À1000 and normalized to (À1) (double asterisks). The original BL NCCT is only clipped between 0 and 100 HU and normalized to (À1). The FU NCCT with a lesion is passed through the generator network to compute a difference map. This difference map is subtracted from the input FU NCCT to construct a generated BL NCCT. Original BL and generated BL are optimized on the basis of the absolute voxelwise difference (L1-loss) and the binary cross-entropy loss (adversarial-loss) of the discriminator networks classification (original or generated BL).

LATE (n ¼ 146) randomized controlled trials if BL and follow-up
NCCTs were available. Specific imaging protocols, inclusion, and exclusion criteria of each of these randomized controlled trials have been published previously. [15][16][17] Scans of these 820 patients were used to train the FU2BL-GAN. NCCT scans with lesion annotation from previously published studies by Konduri et al 18 21 In compliance with the declaration of Helsinki, informed consent has been received for the use of data for substudies from patients included in the training data randomized controlled trials and the validation and test data of ischemic and HT lesions. [15][16][17][18]20 The PrH data from Hssayeni et al 19 was accessed through physionet.org and obtained with a "Restricted Health Data License 1.5.0," because the authors stated that collection and sharing of the retrospectively collected anonymized and defaced CTs were authorized by the Iraq Ministry of Health Ethics board.

Training Data and Training Protocol
All NCCT volumes were converted from DICOM to NIfTI format with dcm2niix available in MRIcroGL, Version 1.2.20211006. 22 Elastix, Version 5.0.0 (https://elastix.lumc.nl/) was used to coregister the follow-up and BL NCCTs of the training data; 23 the scan with the thinnest slices was used as a moving image. Poor coregistration was detected by inspecting the overlay of the 2 images at the 30th, 50th, and 80th percentile sections. Up to 3 follow-up NCCTs were used per patient if clinical deterioration occurred within 8 hours after endovascular treatment or randomization (8 hours) and as part of the imaging protocols of 8-72 hours (24H) and 72 hours to 2 weeks (1W) after AIS. [15][16][17] To ensure stable optimization and prevent overfitting, per training iteration, we used 1 follow-up NCCT and 1 corresponding BL NCCT section (512 Â 512). Slices were sampled at random on the 10th and the 95th percentile sections. Furthermore, to emphasize the variation in attenuation between different tissues, the generator model received 2D slices from follow-up NCCTs with 2 channels on the basis of different Hounsfield unit ranges as input: The attenuation was clipped between both 0 and 100 HU for brain and infarct differentiation and 100 and 1000 HU for hemorrhage and skull differentiation. The images were subsequently normalized to a À1 to 1 range. BL NCCTs were only clipped between 0 and 100 HU and normalized to a À1 to 1 range. The discriminator network receives 2D slices of either generated BL or real BL NCCT scans. To make the FU2BL-GAN robust to differences in contrast and noise between the BL and follow-up NCCTs, we applied multiple intensity and noise-altering image augmentations (details available in the Online Supplemental Data). A batch size of 2 with a learning rate of 0.00002 for 500 epochs was used, subsequently linearly reduced to zero over the following 500 epochs (Nvidia TITAN V [https:// www.nvidia.com/en-us/titan/titan-v/] with 12-gb RAM). The Online Supplemental Data contains a detailed description of the FU2BL-GAN architecture.

Lesion Segmentation
To obtain lesion segmentations, we passed validation and test set NCCTs through the generator model to generate difference maps. Due to computational constraints, the validation set difference maps were computed every 10th training epoch. Subsequently, segmentations were obtained by applying a threshold to the difference maps. The resulting Dice similarity coefficient (DSC) of the segmentations relative to the ground truth was used to determine the optimal threshold for the difference map À0.2 to 10.3 with steps of 0.01 (equivalent to 0.5 HU). An automatically computed brain mask based on intensity thresholds and region growing was used to remove false-positive segmentations that were not allocated inside the skull. 9 In the Online Supplemental Data, validation set results are depicted. Finally, the optimal epoch and threshold were used to obtain segmentations for the test sets.

Evaluation and Outcome Metrics
Reported results were based on test set segmentations and were reported relative to expert-based ground truth segmentations. The DSC and Hausdorff distance in millimeters were used to compute spatial correspondence. Results from the FU2BL-GAN approach trained with L1 and adversarial loss (L11adv) were compared statistically with the Wilcoxon rank-sum test with a simpler approach trained with L1-loss (L1) only. Furthermore, the results of the FU2BL-GAN were compared with two 2D Unets trained on segmentations from the 24H and 1W infarct validation sets using the no new Unet (nnUnet) framework as a conventional supervised learning BL. 24 Volumetric correspondence between the ground truth and predicted segmentations were analyzed with Bland-Altman plots with bias (mean between methods) and limits of agreement (LoA, 61.96 SDs from the bias) and the intraclass correlation coefficient (ICC) with 95% CIs. The 2-way mixed-effects approach for consistency of a single fixed rate was used to describe differences between the FU2BL-GAN-based segmentations and the expert-based ground truth lesion segmentations. A subgroup analysis was performed for lesions of .10 mL to address the effect of lesion size on our outcome metrics. Results were reported as median with interquartile range (IQR) or mean with 95% CIs.

RESULTS
Ischemic and hemorrhagic lesions in our test sets were relatively small; the distribution of volumes was skewed toward smaller lesions. Ground truth lesion volume of test sets had a median of 35 mL (IQR, 16-78 mL) in the 24H and 66 mL (IQR, 29-125mL) and in the 1W infarct NCCTs, respectively. For the HT and PrH test sets respectively, the mean lesion size was 6 mL (IQR, 2-12 mL) and 6 mL (IQR, 1-12 mL). Characteristics of the training data can be found in the Online Supplemental Data. Training characteristics and the optimal difference map thresholds are available in the Online Supplemental Data.

Quantitative Results
As depicted in Figs 3 and the Online Supplemental Data, DSC and lesion volume were positively related. The median DSC of the FU2BL-GAN was 0.31 (IQR, 0.08-0.59) in the 24H infarct test set, 0.59 (IQR, 0.29-0.74) for the 1W infarct test set, 0.02 (IQR, 0-0.14) for the HT test set, and 0.08 (IQR, 0.01-0.35) for the PrH test set. The FU2BL-GAN (L11adv) had a statistically significant higher DSC than the model trained with only L1-loss (L1) for all test sets but a significantly lower DSC compared with the nnUnet approach (Fig 3). The subgroup of lesions of .10 mL (Fig 3B) had a higher DSC than the overall population (Fig 3A),   rows, the input NCCT of the PrH test set is the acute-phase NCCT with hemorrhagic lesions. For this case, the generated scan can be regarded as a prehemorrhagic stroke NCCT scan. Although lesions visually appear to be removed accurately, the generator model was not able to completely reconstruct 24H infarcted brain tissue similar to the BL NCCTs (columns 1 versus 3). False-positive hemorrhage segmentations were present when the input NCCT scan had beamhardening artifacts in the brain close to the skull or when the overall scan attenuation was higher (arrows in PrH and HT column 2). False-negative hemorrhage segmentations were present if the hemorrhage was small and the attenuation increase was low (row 6, poor HT). False-positive infarct segmentations occurred close to the ventricles and other locations, where CSF results in a hypoattenuated region (row 5). False-negative infarct segmentation errors mainly occurred in the 24H infarct data set because the infarct lesion was not yet significantly hypoattenuated (row 5).

DISCUSSION
Our study shows that when one uses a GAN deep learning structure, it is possible to obtain follow-up ischemic and large hemorrhagic lesion segmentations without using manually annotated training data. Although the visual quality of generated BL scans was not always optimal, lesion segmentation quality was often not affected. External validation in 4 test sets revealed reasonable segmentation quality in terms of DSC and good-to-excellent volumetric correspondence with the ground truth for follow-up infarct lesions in NCCT at 24H and 1W follow-up after AIS. In terms of DSC and volumetric correspondence, our work performs on a par with previous work on supervised deep-for-follow-up infarct lesion segmentation (DSC median, 0.57 [SD:0.26]; ICC, 0.88). 9 However, the presented unsupervised FU2BL-GAN did not outperform the supervised nnUnet benchmark model with respect to all outcome measures. Kuang et al 25 also used a GAN to segment infarct lesions but achieved much higher segmentation quality (DSC mean, 0.70 [SD, 0.12]). However, the approach by Kuang et al required a training set with manual lesion annotations because the adversarial (GAN) loss was used in addition to the supervised loss functions. DSC and volumetric correspondence for segmenting the HT and PrH lesions were worse than those of existing supervised methods. 4,5,8,10 Poor detection and segmentation of hemorrhagic lesions are likely due to the small lesion size in our test sets and an under-representation of hemorrhages in the training data.
The unsupervised approach to training is a major advantage compared with conventional supervised deep learning methods. With the growing availability of unlabeled and weakly labeled imaging data bases, unsupervised GAN-based approaches can be used without the manual annotation effort for automated lesion segmentation. However, the downside of the presented approach is the requirement of paired training images, coregistered images with and without lesions from the same patient. Such high-quality registration is often difficult to achieve when considering medical imaging because most organs, tissues, and body parts deform or move between acquisition moments. Because the brain only slightly deforms and moves between acquisition moments, the use of a GAN-based lesion segmentation method similar to the presented FU2BL-GAN seems promising for other brain pathologies.
One of the main shortcomings of the presented FU2BL-GAN is that it can only be trained on CT slices sampled at random. Because not every section in an NCCT volume of a patient contains an initial AIS lesion and only a minority of the volumes contain a hemorrhagic lesion, the FU2BL-GAN likely experienced an under-representation of brain lesions. This under-representation during the training of NCCT slices with a lesion, especially with a hemorrhagic lesion, compared with slices without lesions, is known to result in poorer segmentation performance; in technical literature, this is often referred to as the "class imbalance problem." 26 In contrast, supervised deep learning methods often use adjusted sampling techniques that require ground truth annotations; 8 most class (nonlesion tissue) is undersampled relative to the minority class (the lesion) to balance class representation. A valuable improvement in our FU2BL-GAN would be to manually classify slices for lesion presence and volumes for the presence of a hemorrhage. Although these section-or volume-level annotations would take some time to acquire, such sparse annotation methods are still less time-consuming than manually segmenting lesions required for supervised deep learning. Alternatively, automated NCCT-section classification algorithms for infarct or hemorrhage presence can be used to classify NCCT slices on the basis of lesion presence. 27 Subsequently, this information can be used to select training data for further improvement of the FU2BL-GAN.
Although the test sets used in this study are from multiple centers, it remains largely unclear what scanners, settings, and postprocessing methods were used. Furthermore, Konduri et al 18 reported extensive exclusion criteria related to the image quality and noise level, excluding 93 of 280 patients in their data set. These factors influence the ability to generalize results from this study and require additional external validation on subgroups and other data sets.

CONCLUSIONS
The presented FU2BL-GAN is an unsupervised deep learning approach trained without manual lesion annotations to segment stroke lesions. With the FU2BL-GAN, it is feasible to obtain automated infarct lesion segmentations with moderate DSC and good volumetric correspondence.