Convolutional Neural Network to Stratify the Malignancy Risk of Thyroid Nodules: Diagnostic Performance Compared with the American College of Radiology Thyroid Imaging Reporting and Data System Implemented by Experienced Radiologists

BACKGROUND AND PURPOSE: Comparison of the diagnostic performance for thyroid cancer on ultrasound between a convolutional neural network and visual assessment by radiologists has been inconsistent. Thus, we aimed to evaluate the diagnostic performance of the convolutional neural network compared with the American College of Radiology Thyroid Imaging Reporting and Data System (TI-RADS) for the diagnosis of thyroid cancer using ultrasound images. MATERIALS AND METHODS: From March 2019 to September 2019, seven hundred sixty thyroid nodules ( $ 10 mm) in 757 patients were diagnosed as benign or malignant through ﬁ ne-needle aspiration, core needle biopsy, or an operation. Experienced radiologists assessed the sonographic descriptors of the nodules, and 1 of 5 American College of Radiology TI-RADS categories was assigned. The convolutional neural network provided malignancy risk percentages for nodules based on sonographic images. Sensitivity, spec-i ﬁ city, accuracy, positive predictive value, and negative predictive value were calculated with cutoff values using the Youden index and compared between the convolutional neural network and the American College of Radiology TI-RADS. Areas under the receiver operating characteristic curve were also compared. RESULTS: Of 760 nodules, 176 (23.2%) were malignant. At an optimal threshold derived from the Youden index, sensitivity and negative predictive values were higher with the convolutional neural network than with the American College of Radiology TI-RADS (81.8% versus 73.9%, P ¼ .009; 94.0% versus 92.2%, P ¼ .046). Speci ﬁ city, accuracy, and positive predictive values were lower with the convolutional neural network than with the American College of Radiology TI-RADS (86.1% versus 93.7%, P , .001; 85.1% versus 89.1%, P ¼ .003; and 64.0% versus 77.8%, P , .001). The area under the curve of the convolutional neural network was higher than that of the American College of Radiology TI-RADS (0.917 versus 0.891, P ¼ .017). CONCLUSIONS: The convolutional neural network provided diagnostic performance comparable with that of the American College of Radiology TI-RADS categories assigned by experienced radiologists.

T hyroid ultrasound (US) is the best tool to evaluate thyroid nodules for ultrasound-guided fine-needle aspiration (US-FNA). 1,2 However, the diagnostic performance of US varies because it is operator-dependent, and interobserver variability is inevitable. 3,4 To overcome this limitation, studies have been conducted on the computerized diagnosis of thyroid cancer with US images. [5][6][7][8] The convolutional neural network (CNN) is a deep learning technique that incorporates fully trainable models and can potentially cover various medical imaging tasks. 9 Recently, multiple CNN models have been investigated for the diagnosis of thyroid cancer. [10][11][12][13][14][15][16][17] Computerized algorithms were designed to predict thyroid cancer, and the deep CNN was used to differentiate malignant and benign thyroid nodules on the basis of US images.
Findings of past studies have been inconsistent when the diagnostic performance of the CNN was compared with visual assessment by radiologists. Even when US images were assessed according to published guidelines, the diagnostic performance of the CNN could be inferior to or favorable compared with that of radiologists, and in some studies even superior. [10][11][12]15 This variation might be due to unpredictable human judgment as well as differing algorithms that were developed by researchers or corporations individually; radiologists have been known to make their own final assessment, with guidelines being simply a point of reference. Thus, we aimed to compare the diagnostic performance of a CNN with a well-established guideline, the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS), which reduces benign FNAs with high specificity and accuracy in an era when the overdiagnosis and overtreatment of thyroid cancer have become issues of concern. 18-22 ACR TI-RADS guides the diagnosis of thyroid cancer through a summation of points assigned to each US feature and then classifies nodules into 5 categories, TI-RADS (TR) 1 to TR5. 23 In our institution, the radiologist performing the US prospectively records the US features of all thyroid nodules expected to undergo US-FNA or US-guided core needle biopsy (US-CNB), and each thyroid nodule is assigned to 1 of the 5 ACR TI-RADS categories, TR1 to TR5, according to the recorded US features.
Therefore, the aim of this study was to evaluate the diagnostic performance of the CNN compared with ACR TI-RADS for the diagnosis of thyroid cancer using US images.

Study Population
From March 2019 to September 2019, US-FNA or US-CNB was initially performed on 1096 thyroid nodules measuring $10 mm in 1087 patients 19 years of age or older in Severance Hospital. Of the original 1096 nodules, 259 were excluded because they did not receive further management such as repeat FNA or an operation after US-FNA showed the results as nondiagnostic (n ¼ 125 in FNA; n ¼ 1 in CNB). Exclusions were also due to atypia of undetermined significance/follicular lesion of undetermined significance (n ¼ 107 in FNA), indeterminate (n ¼ 6 in CNB), follicular neoplasm (n ¼ 3 in FNA; n ¼ 4 in CNB), or suspicion for malignancy (n ¼ 13). Seventy-seven nodules were also excluded because they were aspirated by an inexperienced radiologist who had ,1 year of experience dedicated to thyroid imaging. The remaining 760 nodules met 1 of the following criteria: 1) nodules with benign or malignant results on US-FNA or US-CNB (n ¼ 551), 2) nodules that underwent an operation (n ¼ 191), and 3) nodules that were confirmed as benign on repeat US-FNA or US-CNB after initial cytology results of nondiagnostic (n ¼ 4) or atypia of undetermined significance/follicular lesion of undetermined significance (n ¼ 14). Finally, 760 thyroid nodules in 757 patients were included (Fig 1). Three patients had 2 nodules that were aspirated from both sides of the thyroid gland.

US Image Acquisition
All US examinations were performed using a 7-to 17-mHz linear transducer (EPIQ 7; Phillips Healthcare). One of 5 radiologists dedicated to thyroid imaging with 6-21 years of experience performed the US examinations and subsequent US-FNAs. The radiologist who performed the US-FNA prospectively recorded the US features of each thyroid nodule with respect to composition, echogenicity, shape, margin, and calcifications. 2,24 Composition was assessed as solid, predominantly solid (solid component $50%), or predominantly cystic (solid component ,50%) or spongiform. Echogenicity was assessed as hyperechoic (hyperechogenicity compared with the surrounding thyroid parenchyma), isoechoic (isoechogenicity compared with the surrounding thyroid parenchyma), hypoechoic (hypoechogenicity compared with the surrounding thyroid parenchyma), or markedly hypoechoic (hypoechogenicity compared with the strap muscles). Shape was assessed as parallel or nonparallel (greater in the anteroposterior dimension than the transverse dimension, taller-than-wide). Margin was assessed as well-defined, microlobulated, or irregular. Calcifications were classified as eggshell calcifications, macrocalcifications, microcalcifications, mixed calcifications, or no calcifications.

Image Analyses
A representative US image of each thyroid nodule was selected by an experienced radiologist (J.Y.K. with 18 years of experience dedicated to thyroid imaging), and the chosen images were stored as JPEG images in the PACS. The radiologist (J.Y.K.) drew a square ROI to cover the targeted thyroid nodule entirely using the Windows 10 Paint program. The extracted ROIs were analyzed by the deep CNN, and malignancy risk was shown as a percentage between 0 and 100 for each thyroid nodule (Fig 2). The deep CNN implementation was based on an algorithm that was trained (fine-tuned) with 589 thyroid nodule datasets from our institution. 10 Using 3 pretrained CNNs, AlexNet, GoogLeNet, and InceptionResNetV2, we created thyroid classifiers and collected the area under the receiver operating characteristic (ROC) curve (AUC) corresponding to each CNN using Matlab 2019a (MathWorks). These classifiers and AUCs were then used to produce the mean of classification scores expressed as posterior probability in which the AUCs were used as weights. This process yields more objective results by gathering various opinions and tends to hold the final result if predictions are the same and follows the higher score if predictions contradict (see more details in the previous studies). 10,25 One radiologist (G.R.K.) with 7 years of experience dedicated to thyroid imaging arranged the previously recorded US features to match the US descriptors used in the ACR TI-RADS and summed up the score of each nodule as follows: TR1 (0-1 point), TR2 (2 points), TR3 (3 points), TR4 (4-6 points), and TR5 ($7 points). 21 Regarding the US features of ACR TI-RADS, "predominantly cystic" nodules were considered to have cystic or almost completely cystic composition, and "predominantly solid" nodules were considered to have mixed cystic and solid composition. "Solid" nodules were considered to have solid or almost completely solid composition. An echogenicity of "marked hypoechoic" was regarded as "very hypoechoic." "Well-defined" margins were regarded as smooth and microlobulated, and "irregular" margins were regarded as lobulated or irregular. "Eggshell calcifications" were regarded as peripheral (rim) calcifications. and "mixed calcification" and "microcalcifications" were regarded as punctate echogenic.

Data and Statistical Analysis
Benign results on US-FNA or US-CNB and benign or malignant histopathologic results from an operation and follow-up US-FNA or US-CNB were the reference standards for analysis. On the basis of these results, we calculated the malignancy risk of the 5 categories of ACR TI-RADS, respectively. Each nodule that had its percentage of malignancy risk calculated by the CNN was re-categorized into 1 of the 5 TR categories according to the malignancy risk range suggested for each TR category by ACR TI-RADS. 22,26 Malignancy risk was also calculated for those TR categories created from the CNN (CNN-TR).
Variables were compared between the benign and malignant nodules using the Mann-Whitney U test and the x 2 test or the Fisher exact test. Diagnostic performances including sensitivity, specificity, accuracy, positive predictive value, and negative predictive value for predicting thyroid malignancy were calculated for the CNN and ACR TI-RADS with 95% confidence intervals. The cutoff value to diagnose thyroid malignancy was defined using the Youden index in the CNN (malignancy risk percentage as a continuous variable) and ACR TI-RADS (TR category as an ordinal variable). 27 Logistic regression using the generalized estimating equation method was used to test the significance of comparisons with adjustments for correlated observations of clustered data. The AUCs of the CNN using a malignancy risk percentage between 0 and 100 and ACR TI-RADS categories using a TR category from 1 to 5 were compared as continuous values using the DeLong method. 28 All statistical analyses were performed with SAS (Version 9.4; SAS Institute) and SPSS 25.0 for Windows (IBM). Statistical significance was defined with P values , .05.

Study Population and Nodule Characteristics
In 760 thyroid nodules, 176 (23.2%) were malignant. Final diagnoses of the 176 malignant nodules were confirmed through surgical resection (n ¼ 142; one hundred thirty-two papillary thyroid carcinomas, 5 follicular carcinomas, 2 poorly differentiated carcinomas, 1 medullary carcinoma, 1 Hurthle cell carcinoma, and 1 squamous cell carcinoma) and US-FNA (n ¼ 34; 33 papillary thyroid carcinomas and 1 small-cell carcinoma). The median size of all 176 nodules was 20 mm (interquartile range, 14-30 mm). The median age of the 757 patients was 51 years (interquartile range, The US features of the benign and malignant nodules according to ACR TI-RADS and their distributions are described in Table 1. The median size of the benign nodules was 23 mm, which was larger than the that of malignant nodules (median, 14 mm; P , . 001). Solid or almost completely solid composition (161 of 176, 91.5%), hypoechoic or very hypoechoic echogenicity (153 of 176, 86.9%), taller-than-wide shape (59 of 176, 33.5%), lobulated or irregular margins (132 of 176, 75.0%), and punctate echogenic foci (95 of 176, 54.0%) were frequently seen in the malignant nodules (P , . 001, respectively). Table 2 summarizes the malignancy risk of each category in ACR TI-RADS and CNN-TR that was calculated after nodules were re-categorized according to the malignancy-risk ranges suggested by ACR TI-RADS. 22,26 The malignancy risk of ACR TR5 was 77.8% (130 of 167), which was much higher than the suggested malignancy risk of 20%. The malignancy risks of ACR TR1 to TR4 were within the risk ranges suggested by the ACR. According to the CNN, 403 thyroid nodules had malignancy risks higher than 20% and were re-categorized as CNN-TR5. Among 403 nodules, 167 were thyroid cancers (41.4%). Of 760 nodules, 307 nodules that had a 5%-20% range of malignancy risk according to the CNN were re-categorized to CNN-TR4 and 9 of these 307 (2.9%) nodules were thyroid cancers.

Comparing the Diagnostic Performances of CNN and ACR TI-RADS
According to the cutoff value found using the Youden index in the CNN and ACR TR categories, respectively, thyroid nodules  with a malignancy risk of 52.6% or higher in the CNN or nodules equal to or higher than TR5 according to ACR TI-RADS were considered malignant. The diagnostic performances of the ACR TI-RADS and CNN are summarized in Table 3

DISCUSSION
Our study demonstrates that the CNN shows diagnostic performance comparable with that of ACR TI-RADS when experienced radiologists assigned US descriptors and scored their observations. The malignancy risk of each ACR TR category in our study was within the range suggested by ACR TI-RADS. In our study, the sensitivity (81.8%), specificity (86.1%), and accuracy (85.1%) of the CNN were within ranges similar to those reported in previous publications on the deep CNN for the diagnosis of thyroid cancer. 10,11,16,17 At an optimal threshold derived from the Youden index, our CNN was more sensitive but less specific and accurate compared with the ACR  12 had lower specificity (68.2%) and accuracy (73.4%) compared with radiologists for the diagnosis of thyroid cancer, even though it achieved similar sensitivity (81.4%). In our study, the specificity and accuracy of the CNN were somewhat higher than those reported in Kim et al. The different frequencies of punctate echogenic foci (considered as microcalcifications) in malignant nodules (54.0% in our study versus 72.1% in Kim et al) might be 1 explanation because Kim et al suggested the recognition of microcalcifications as a cause of inaccuracy for the CNN in their study. In addition, the inferior performance of the CNN in their study was thought to originate from manual manipulation for segmentation and human-designed features applied to the computer-aided diagnosis system. Moreover, the experience level of the performing operator had an effect on the performance of computer-aided diagnosis because of the manual manipulation required for computer-aided diagnosis. 29 Unlike the traditional machine learning algorithm or the traditional commercial system that is connected to US machines and already applied in clinical practice, 5,12,29 the recently introduced deep CNN is not limited to or influenced by human-designed features known to represent thyroid cancer on US, though its operational principles for diagnosing thyroid malignancy are not yet completely explained by humans. In our study, the radiologist just drew a square ROI covering the entire targeted nodule without any human interference with the diagnostic process of the CNN. Instead of using features engineered by humans, the deep CNN extracts image information directly from imaging data, and the CNN might be able to recognize cancer-specific US features that are not identified explicitly by the naked eye. 30 Because US is performed and interpreted by humans, any diagnosis of thyroid cancer based on US images is subjective, thus requiring experience and expertise. 3,4 Recent studies have evaluated the computer-aided diagnosis of thyroid cancer, which incorporates texture analysis and machine learning and deep learning techniques for US images; the authors reported that computer-aided diagnosis showed comparable and even higher diagnostic performance compared with radiologists. 5,6,11,29 While artificial intelligence (AI) is not yet considered ready for a clinical setting, 31 computer software is already thought to have several strong advantages over radiologists because its use can overcome human variation and provide diagnostic reproducibility and consistency in image interpretation. However, past studies have shown greatly differing results when the diagnostic performance of the CNN is compared with human interpretation. This might be due to the diversity of assessments possible by radiologists as well as the different algorithms developed by individual researchers or corporations. Despite referring to guidelines, radiologists might eventually reach diagnoses independently on the basis of their individual expertise and experience. On the other hand, we intended to directly compare the performances of our CNN with that of an established guideline, ACR TI-RADS, which is known to have a high specificity and positive predictive value without sacrificing sensitivity, and to further use this knowledge to help radiologists achieve optimal performances with the ACR TI-RADS. 20,21,32,33 We used results found with prospectively recorded descriptors that were obtained during real-time evaluations of entire 3D nodules instead of those collected through a retrospective human review of single US images. This choice might represent ACR TI-RADS more properly and objectively than a new individual human review. The experienced radiologists in the study of Li et al 11 showed a less specific and accurate performance than the CNN; the radiologists in the study of Li et al showed low specificities of 57.1%-68.6% and low accuracies of 72.7%-78.8% compared with the previous studies and our study. Regarding this matter, Li et al replied that their reviewers were burdened due to the larger subject sample and subsequent large amounts of image reviews needed. 11,34,35 In this study, the malignancy risk of each ACR TR category was within the theoretic percentage of malignancy risk, which meant that nodules had been assessed appropriately with ACR TI-RADS. The ACR TR categories of our study showed enough specificity and accuracy for diagnosis, fulfilling the original goals of ACR TI-RADS to decrease biopsies with benign findings and improve accuracy. On the other hand, radiologists have shown a wide range in diagnostic performance with ACR TI-RADS because sensitivity has been reported to be 81.7%-96.7%; specificity, 47.7%-77.3%; and accuracy, 69.3%-84.9%. 20,21,32,36 This inconsistency in performance might be caused by the different experience levels of the radiologists or by the different cutoff values of each study. In our study, we were able to conduct a relatively objective validation of ACR TI-RADS by experienced radiologists using US features and to compare its diagnostic performance with that of the AI diagnosis. The diagnostic performance of the CNN was comparable with that of ACR TI-RADS with a somewhat higher AUC for thyroid cancer. Given that a recent study reported that alteration of ACR TI-RADS by AI led to improvement in specificity, the adequate modification and fusion of the settled guidelines and AI, ie, AI-powered US, might be a potential aid to better diagnostic performance and implementation of AI. 36,37 This study has several limitations. First, US examinations are performed in real-time. The process of image acquisition such as capturing 2D-US images and selecting a representative image from the acquired images is inevitably operator-dependent. Additionally, there are limits to how much 2D US images can represent the entire thyroid nodule. AI studies that analyze 3D-US images might be of more help in the future. 37 Second, we used data prospectively recorded in our institutional data base, in which US features were described with different terminology than that suggested by the ACR guidelines. Because information about "anechoic," "ill-defined" or "extrathyroidal," and large comet-tail artifacts was not collected during the study period, despite being listed in the ACR guidelines, this issue might be a limitation of our study. However, we did not conduct an intentional retrospective review for this study because we aimed to investigate ACR TI-RADS itself and not the man-made final assessments. Third, our institution is a tertiary center, and we included thyroid nodules that underwent US-FNA or US-CNB only, which meant that surgical histopathology was unavailable. Thus, there might be false-negative or false-positive results, even though the rates would be very low with a false-negative rate of ,3% and a false-positive rate of about 3%-4%. 38 Fourth, the ROC-derived cutoff value that we used to calculate diagnostic performance cannot be accepted as a diagnostic standard in real clinical practice without further validation.

CONCLUSIONS
The CNN provided diagnostic performance comparable with that of the ACR TI-RADS categories assigned by experienced radiologists. Before AI can be used to diagnose thyroid cancer, a thorough evaluation of AI diagnosis compared with pre-existing guidelines is needed, and our study should be able to present a relatively objective comparison of diagnostic performances between the ACR TI-RADS and CNN for thyroid cancer. Adequate modification and fusion of the ACR TI-RADS and CNN that takes advantage of their unique characteristics will help optimize overall diagnostic performance.