Orbits using Deep Learning Volume Quantification in CT Images of Fully Automated Segmentation of Globes for

,

O pen-globe injuries are traumatic full-thickness defects of the ocular wall.Although frequently diagnosed on clinical evaluation, open-globe injuries involving the sclera may not be obvious on clinical examination and require surgical exploration for definitive diagnosis and repair. 1,2When thorough ocular examination of the anterior segment is limited by periorbital edema and hemorrhage, blepharospasm, or hyphema, imaging can be helpful to establish the diagnosis of occult open-globe injury. 3CT is the preferred imaging technique for assessment of the extent and severity of suspected traumatic injury to the globe. 4,5Direct CT imaging findings include altered globe contours or volumes, evidence of scleral discontinuity, or intraocular foreign bodies or gas. 6An additional indirect imaging finding is alteration of anterior chamber depth (ACD), which may either be decreased or increased depending on anterior or posterior segment location of injury, respectively. 7,80][11] In a case series specifically evaluating occult open-globe injuries, CT had similar low sensitivity ranging from 56%-68%. 6ccurate and reliable quantification of globe volumes in the event of an ocular trauma can provide clinicians with valuable diagnostic information. 6Manual segmentation of the globe contours by radiologists, though considered the criterion standard, is a timeconsuming and labor-intensive process. 12,13Furthermore, it is also observer dependent.Automated techniques for globe detection can remedy the pitfalls of manual segmentation. 14revious works have proposed the use of semiautomated and automated techniques to measure ocular volume from CT images in the context of surgical planning.Bekes et al 12 proposed a geometric model-based segmentation of the globes along with lenses, optic nerves, and optic chiasms in CT images.Because of the lack of a criterion standard, they did not report Dice scores.However, they estimated accuracy using the simultaneous truth and performancelevel estimation algorithm published by Warfield et al 15 and reported mean sensitivity values of 97.41% and 98.04% and specificity values of 98.42% and 97.90% for the left and the right globes, respectively.Harrigan et al 13 used optimized registration and fusion methods for a multi-atlas framework 16 for automated detection of optic nerves along with eye globes and muscles on clinically acquired CT images.They reported mean Dice and Hausdorff distance (HD) of 0.84 and 5.27 mm, respectively.Another work by Aghdasi et al 17 segmented the optic nerve, globe, and extraocular muscles for skull-based surgeries in a 2-step process.The approximate boundaries of the globe were first determined followed by 2D shape fitting of the voxels inside the boundary.On 30 publicly available datasets, they reported an average Dice of 0.81 and 0.79 and 95% HD of 3mm and 2.89mm for the right and the left globes, respectively.
9][20] UNET (Fig 1A ), 21 a fully connected deep learning CNN with its multiscale encoder-decoder type architecture, is a popular choice in many of these semantic segmentation problems.Another popular architecture, ResNet, 22,23 is a single-scale setup that improves gradient backpropagation flow with increased speed of convergence 24 by learning residual features.
In this work, we combine the multiscale framework of UNET with elements that learn residual features and propose a fully automated workflow that allows for fast, accurate, and robust detection of globe contours.The proposed approach uses a deep learning-based CNN, 2D Modified Residual UNET architecture (MRes-UNET2D), and axial CT images of the orbits to predict globe contours, which are then used to quantify globe volumes.

Convolutional Neural Network
Figure 1C shows the MRes-UNET2D used in this work.The network uses high-resolution 2D axial CT images of the orbits as inputs and yields contours for the globes, which are then used to compute globe volumes.The feature analysis path of the architecture uses a series of residual elements to generate multiscale abstract representations of the input images.The residual element used in our work, 25 shown in Fig 1B, uses a convolution layer, a short-range skip connection, followed by batch-normalization 26 and a rectified nonlinear activation.A dropout 27 layer is introduced between the analysis and synthesis paths to improve regularization.The synthesis path of the architecture allows accurate localization by reconstructing high-resolution feature maps while adding contextual information from the corresponding level in the analysis path using long-range skip connections.A final convolution layer combines the feature maps in the native resolution space to yield pixel-wise probability maps for the labels of interest.All convolutions in the main architecture consist of 2D convolution kernels with kernel size of 3 Â 3.

Study Population and Imaging Protocol
A cohort of 107 consecutive CT orbit subjects (age, 45 6 20 years; 63 men and 44 women) older than 18 years of age, imaged over a 3-year period between January 2015 and December 2017, were identified retrospectively with the approval of the local institutional review board.These subjects presented no imaging or clinical evidence of openglobe injuries.CT images from these subjects came from 3 different CT scanners from 2 different manufacturers, Aquilion (Toshiba Medical Systems) (77 subjects), Somatom Definition AS1 (22 subjects), and Somatom Definition Flash (Siemens) (8 subjects).CT images for each subject were acquired according to the following clinical protocol: 120 kVp, 150 mAs, matrix size of 512 Â 512, field of view ranging from 125 to 240 mm, and in-plane resolution ranging from 0.25 mm to 0.46 mm.The section thickness used was either 1 or 2 mm.
Three observers, consisting of a neuroradiologist with certificate of added qualification and 2 residents, agreed on a protocol to mark the globe contours on the CT images by using an in-house Matlab (MathWorks)-based graphical user interface.This included manually tracing the boundary pixels of the globes on axial crosssections while excluding eyelids, insertions of the extraocular muscles, and optic nerves.The observers used the sagittal crosssections for reference.The graphical user interface provided the observers with tools to adjust the window level (WL), window width (WW), and zoom level, and edit or delete contours to accurately trace the boundaries at a pixel level.No further processing was done on the contours after they were finalized by the observers.
The subjects in our study were randomly split into 3 groups: 80 subjects in the training cohort, 18 subjects in test cohort 1, and 9 subjects in test cohort 2. To measure interobserver variability, each observer annotated the left and the right globe contours for subjects in test cohort 2, blinded to the annotations by others.A consensus observer was generated by using a majority voting scheme on the individual observer contours.The subjects in the training cohort and test cohort 1 were randomly split between the 3 observers.
An overview of the proposed workflow is shown in

Network Implementation
A Dice similarity-based loss function was used to maximize the overlap between the predicted globe masks and the ground truth masks.We used the following definition of Dice loss: Here, r n and p n refer to the ground truth and the predicted posterior values at the n th pixel, respectively.
Two different window settings were used to study the impact of Hounsfield unit (HU) windowing on model performance.The WL and WW ([WL, WW]) for the 2 experiments were selected to be [50, 200] and [0, 200] HU.In these experiments, the training images retained their original image resolutions, which ranged from 0.25 mm to 0.46 mm in-plane.We also trained an additional model in which all training image volumes were resampled to a common grid by using cubic spline interpolation.This was done to test if resampling the images to a common resolution provides any improvement to the performance of the model.The resolution for the common grid was obtained from the average resolution of the training set: 0.3 mm in-plane and 2-mm section thickness.All experiments were implemented in Python by using Keras (https://keras.io)with TensorFlow 28 backend.The training was performed on a Linux server running Ubuntu, with a Tesla P100 (NVIDIA) and 16-GB VRAM.The MRes-UNET2D architecture, with approximately 133,000 trainable weights, was trained with the following parameters: optimizer ¼ ADAM, 29 maximum epochs ¼ 120, batch size ¼ 5, learning rate ¼ 1e À3 , decay factor ¼ 0.1.The learning rate was optimized for Dice loss by monitoring the training and the validation loss curves for convergence for a range of learning rates along with performance evaluation on the validation images.The MRes-UNET2D model used in this work is also available at https://github.com/spacl-ua/globe-volumes.
We also implemented a standard UNET 21 architecture for comparison.The convolution layers were zero-padded, with 3 Â 3 convolution kernels, to yield predictions, which were the same size as input.The cross-entropy loss function used in the original paper was modified to a binary cross-entropy loss for the binary classification problem.The training and the validation images for this UNET were the same as those used for training the MRes-UNET2D model with HU windowing set to [WL ¼ 50, WW ¼ 200].The training parameters for UNET were as follows: optimizer ¼ ADAM, 29 maximum epochs ¼ 120, batch size ¼ 5, learning rate ¼ 1e À3 , and decay factor ¼ 0.1.

Network Evaluation
The generalizability of the models was evaluated by using the following performance metrics: Dice, precision, recall, 95% HD, and volume difference.These evaluation metrics are defined as follows: Here GT refers to the ground truth and P to the predictions from the network.The one-sided HD between point sets G s ¼ g 1 ; g 2 ; . . .; g n f g and P s ¼ fp 1 ; p 2 ; . . .; p n g is d H G s ; P s ð Þ.We used the 95th percentile (P 95 Þ of HD, referred to as 95% HD, because it is slightly more stable to small outliers compared with taking the maximum value.V P and V GT refer to the total globe volumes computed from the predicted globe contours and ground truth for a subject, respectively.Higher values of Dice, precision, and recall imply good performance.Lower values of 95% HD imply smaller deviation in the predicted contour compared with the ground truth. Pair-wise Dice similarity scores were calculated on test cohort 2 between the annotations from the 3 observers, the consensus observer, and the predictions from MRes-UNET2D.For each subject, we also calculated the inter-globe volume difference (IGVD), which is the volume difference in milliliters between the left and the right globe.
To test the generalizability of MRes-UNET2D on cases with suspected globe injuries, we also evaluated the model on 3 subjects with varying degrees of globe injuries, with conspicuity on CT imaging ranging from subtle to obvious.These 3 cases were outside of our study cohort and were identified by the radiologists retrospectively as test cases with globe injuries.

Statistical Analysis
A nonparametric Kruskal-Wallis test was performed to determine whether there were any significant differences in the performance of MRes-UNET2D between different image preprocessing settings.This test was repeated to compare for significant differences between the standard UNET and MRes-UNET2D.A 2-sided paired Wilcoxon signed rank test was performed to assess the null hypothesis that the difference in globe volumes predicted by MRes-UNET2D on test cohort 1 and ground truth annotations come from a distribution with zero median.The significance level was selected as .05for all of these tests.A Bland-Altman analysis was performed to assess the agreement in the computed globe volumes per hemisphere between the human observers, MRes-UNET2D, and the consensus observer.To determine the variation between observers, reproducibility coefficient and coefficient of variation statistics were computed.We also tested for the null hypothesis that the IGVD values from MRes-UNET2D, consensus observer, and the 3 observers come from the same distribution by using the nonparametric Kruskal-Wallis test.

RESULTS
The training of our deep learning model, MRes-UNET2D, took approximately 5 hours.In the test phase, the end-to-end prediction time for a volume (512 Â 512 Â 40) was approximately 5 seconds on 2 NVIDIA P100 GPUs.The actual prediction time, excluding the pre-and postprocessing times, was approximately 680 ms per volume.Figure 3A shows representative axial CT images corresponding to 2 different window settings.The training and the validation loss curves for 1 instance of the MRes-UNET2D are shown in Fig 3B .A comparison of the effect of preprocessing on the performance of the MRes-UNET2D on test cohort 1 is shown in Table 1.The first 2 columns correspond to the different window settings.The third column shows the performance of the network when all images are resampled to a common grid of 0.3-mm resolution inplane and 2-mm section thickness.Results of the nonparametric Kruskal-Wallis tests indicated that we were unable to reject the null hypothesis that the Dice scores (P ¼ .39),average volume difference (AVD) (P ¼ .57),or 95% HD (P ¼ .87)from MRes-UNET2D for the different image preprocessing schemes come from the same distribution.Overall, we observe that slight tions in preprocessing did not result in any significant differences in model performance.For subsequent evaluations, we selected the model with windowing [50, 200] because on average, it yielded the smallest HD and AVD with improved Dice overlap scores among the 3 models.
Figure 4 shows the manual annotation (red) and globe contour predictions from MRes-UNET2D (blue) on a few representative CT images.On average, MRes-UNET2D achieved Dice scores of 0.95 with respect to the ground truth, with high precision and recall values of 96% and 95%, respectively.The average 95% HD was only 1.5 mm, with a 5.3% error in the estimation of total globe volume.The 2sided paired Wilcoxon signed rank test revealed no significant differences (P ¼ .72) in the median globe volumes from the ground truth and MRes-UNET2D predictions on test cohort 1.
Table 1 also compares the average performance of MRes-UNET2D to a standard UNET architecture on test cohort 1, where with 10Â fewer trainable parameters, MRes-UNET2D obtains lower mean HD and AVD values while also improving on the mean Dice and precision scores (Fig 5).However, we did not find this difference in performance to be significant for Dice (P ¼ .43),precision (P ¼ .22),recall (P ¼ .72),95% HD (P ¼ .36),and AVD (P ¼ .55).
Table 2 shows pair-wise Dice overlap metrics for the 3 observers, consensus observer, and our model on test cohort 2. MRes-UNET2D achieved average Dice scores of 0.97 and 0.95 with respect to the consensus observer and the individual observers, respectively.The average Dice between the observers, calculated as an average of   Dice scores between observer 1 versus observer 2, observer 2 versus observer 3, and 1 versus observer 3, was 0.94, whereas this value was 0.97 with respect to the consensus observer.We also performed Bland-Altman analysis to compare the agreement in globe volumes per hemisphere from the 3 observers and our model, with respect to the consensus observer.We observed tighter limits of agreement (coefficient of variation ¼ 2.1% and reproducibility coefficient ¼ 3.8%) for MRes-UNET2D (Fig 6A) compared with the human observers (Fig 6B -D).
Figure 7A shows the histogram of IGVD values from the entire cohort under study (n ¼ 98, À0.01 6 0.33 mL) excluding test cohort 2. The boxplot in Fig 7B compares IGVD values from the consensus observer, MRes-UNET2D, and the human observers on test cohort 2. The mean (6 standard deviation) IGVD from the network was 0.05 6 0.24 mL compared with À0.10 6 0.25 mL, À0.13 6 0.23 mL, À0.09 6 0.5 mL, and 0.20 6 0.45 mL from the consensus observer and observers 1, 2, and 3, respectively.We were unable to reject the null hypothesis that the IGVD values in test cohort 2 from the consensus, MRes-UNET2D, and the 3 observers come from the same distribution (P ¼ .3).
Figure 7C shows the globe contours predicted by MRes-UNET2D on the 3 subjects with suspected globe injuries along with the IGVD computed for each case.We also computed a z score, a measure of distance in terms of standard deviation from the population mean, for each of the subjects.For subjects 1, 2, and 3 in Fig 7C , the IGVDs were À4.62 mL, 2.32 mL, and 1.22 mL, respectively.The z score values were 14.16, 7.12, and 3.77 for subjects 1, 2, and 3, respectively.The IGVD for subject 1, for instance, is indicative of a smaller left globe compared with the right.Subject 3 highlights a case with subtle globe injury.The z score distance quantifies that the IGVD of 1.22 mL is approximately 3.77 standard deviations away from the mean IGVD from the cohort of normal subjects, depicting abnormality in globe volumes.

DISCUSSION
In this study, we show that our deep learning network, MRes-UNET2D, can provide accurate and reliable detection of globe contours and quantification of globe volumes.With fast prediction times and performance approaching an average human observer, we show that globe contour predictions, as well as volume estimates, can be made available to radiologists in clinically feasible times.We also observe that using the proposed deep learning CNN yields improved Dice scores compared with average Dice scores ranging from 0.80 to 0.85 by using traditional non-deep learning-based schemes described previously in the literature. 12,13,17The mean 95% HD was also lowered to 1.5 mm compared with approximately 2.89 mm to 3 mm. 17e show that MRes-UNET2D works well across images with different fields of view as well as resolutions.The network does not need any special processing in terms of changing image resolution to a common grid; the images can be trained and tested in their native resolution.We observe that minor variations in window level to change contrast between soft tissue and background bones did not result in a significant performance difference between the models.Furthermore, it is important to note that the training and the testing data in this work come from multiple  scanners across manufacturers.Therefore, it can be stated the proposed network is robust to changes in acquisition parameters and scanner hardware variations across manufacturers.We show that using a deep learning network can provide reliable and consistent contour and volume estimates, thereby, reducing the issues associated with interobserver variability.We observe that the deep learning model's predictions are more in agreement (Dice ¼ 0.95) with the individual observer contours compared with the agreement between observers (Dice ¼ 0.94).
We see that the model, though never trained on images with suspected globe injuries, generalized well to these images on the limited test cases used in this study.The IGVD values and z scores from these cases appear to be useful indicators of suspected globe injuries and provide quantitative information regarding the extent of deviation from a normal cohort.
Our proposed technique has limitations.The training cohort entirely consists of subjects with no imaging or clinical findings of globe injury.Although we observe generalizability of the model on a few cases with globe injuries, we currently do not have ground truth annotations to quantitatively validate the performance of our model on these cases.However, this limitation can be overcome by fine-tuning the MRes-UNET2D model using training data that includes these cases.
Although CT provides superior assessment of size and location of intraocular foreign bodies, compared with competing imaging modalities, it has moderate sensitivity for detecting open-globe injuries.0][11] Using the IGVD values from the normal study population as a baseline, we can quantitatively compare the IGVD for a given CT image with the population IGVD value and automatically identify globe-related abnormalities if the differences between the globe volumes diverge from the normal population distribution.This comparison could potentially provide additional valuable information to a radiologist, in clinically feasible times, to understand whether any subtle globe injuries exist.We will look at introducing additional parameters such as globe contour distortions, ACD, anterior and posterior segment volumes, and lens thickness along with IGVD to quantitatively predict the FIG 6. Evaluation on test cohort 2 (cohort used for interobserver variability between the observers).The Bland-Altman plots to depict agreement in globe volume estimates (left and right) from (A) MRes-UNET2D, (B) observer 1, (C) observer 2, and (D) observer 3 with respect to the consensus observer.The consensus observer was created by using a majority voting scheme on the individual observer contours.The coefficient of variation (CV) and reproducibility coefficient (RPC) for each analysis are also shown.
presence of globe injuries and measure the degree of injury from a scale subtle injuries to globe ruptures.

CONCLUSIONS
In this work, we proposed a 2D deep learning architecture, MRes-UNET2D, to detect globe contours in axial CT images of the orbits.We showed that the proposed CNN model, trained and validated on CT images from 80 subjects with no imaging or clinical findings of globe injuries, obtained an average Dice score of 0.95, with less than 5.3% error in globe volume estimates.The performance of MRes-UNET2D approached interobserver variability between 3 human observers.The analysis of images from subjects with known globe injuries demonstrated the utility of IGVD as a quantitative marker for trauma.

FIG 1 .
FIG 1. A, Architectures for (A) a standard 2D UNET and (C) a Modified Residual UNET 2D (MRes-UNET2D).B, The multiscale architecture in MRes-UNET2D consists of a series of (B) residual elements at every resolution level.The contextual information is propagated by using a series of long-and short-range skip connections.The input to the architecture consists of preprocessed axial CT images of the orbits, and the output image contains contours for the left and right globes.
schemes were used: random in-plane translations (610 pixels in each direction), in-plane rotations selected uniformly from [À15°, 15°], left/right image flips, 2D elastic deformations, and image zoom.During the training process, augmented images were generated in run time on every training image batch.Any 3 of the aforementioned augmentation schemes were randomly selected, and an augmented image was generated by sequentially applying the selected schemes on each image in a training batch.

FIG 2 .
FIG 2. A, Train phase and (B) test phase of the MRes-UNET2D architecture.The deep learning model's parameters are updated by using image-label pairs in the training set.After the loss converges, the learned network parameters are used to predict the globe contours on test images.

Table 1 :a
Evaluation of MRes-UNET2D and UNET2D on test cohort 1 (n = 18) a Values in the table are mean (standard deviation).

FIG 4 .
FIG 4. Globe contour predictions from MRes-UNET2D.The predicted contours are overlaid in blue on representative axial CT images of the orbits from 2 test subjects.The manual annotations are overlaid in red for reference.The inset shows a close-up comparison of the predictions.

FIG 5 .
FIG 5. Boxplot comparison of the performances of MRes-UNET2D and UNET2D on test cohort 1.The different panels compare the performances of the deep learning models on Dice, 95% Hausdorff distance (95% HD), volume difference (VD), precision, and recall.Among the 3 MRes-UNET2Ds, we selected [window level (WL), window width (WW)] ¼ [50, 200] because it yielded the best performance across all evaluation metrics.