Radiomics Study of Thyroid Ultrasound for Predicting BRAF Mutation in Papillary Thyroid Carcinoma: Preliminary Results

BACKGROUND AND PURPOSE: It is not known how radiomics using ultrasound images contribute to the detection of BRAF mutation. This study aimed to evaluate whether a radiomics study of gray-scale ultrasound can predict the presence or absence of B-Raf proto-oncogene, serine/threonine kinase (BRAF) mutation in papillary thyroid cancer. MATERIALS AND METHODS: The study retrospectively included 96 thyroid nodules that were surgically confirmed papillary thyroid cancers between January 2012 and June 2013. BRAF mutation was positive in 48 nodules and negative in 48 nodules. For analysis, ROIs from the nodules were demarcated manually on both longitudinal and transverse sonographic images. We extracted a total of 86 radiomics features derived from histogram parameters, gray-level co-occurrence matrix, intensity size zone matrix, and shape features. These features were used to build 3 different classifier models, including logistic regression, support vector machine, and random forest using 5-fold cross-validation. The performance including accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and area under the receiver operating characteristic curve, of the different models was evaluated. RESULTS: The incidence of high-suspicion nodules diagnosed on ultrasound was higher in the BRAF mutation–positive group than in the mutation–negative group (P  =  .004). The radiomics approach demonstrated that all classification models showed moderate performance for predicting the presence of BRAF mutation in papillary thyroid cancers with an area under the curve value of 0.651, accuracy of 64.3%, sensitivity of 66.8%, and specificity of 61.8%, on average, for the 3 models. CONCLUSIONS: Radiomics study using thyroid sonography is limited in predicting the BRAF mutation status of papillary thyroid carcinoma. Further studies will be needed to validate our results using various diagnostic methods.

P apillary thyroid carcinoma (PTC) is the most common type of thyroid malignancy and accounts for the rapidly increasing incidence of thyroid cancer worldwide. 1,2 The B-Raf proto-oncogene, serine/threonine kinase (BRAF) mutation plays a central role in the pathogenesis of PTC, promoting carcinogenesis through the action of the mitogen-activated protein kinase pathway. 3,4 The frequency of the BRAF mutation in PTC has been reported to range from 29% to 83% and is known to be the most common genetic alteration in PTC. 5,6 Many studies have reported that the BRAF mutation is associated with poor clinicopathologic outcomes, such as a high incidence of advanced clinical stage, extrathyroidal extension, and increased recurrence. [6][7][8][9] These results suggest that preoperative knowledge of the BRAF mutation status can be helpful in categorizing patients as high risk and planning an appropriate treatment strategy. According to 2015 American Thyroid Association guidelines, active surveillance of PTC has emerged as a safe alternative to surgical intervention in low-risk patient with PTCs. 10 In this era, preoperative knowledge of the BRAF mutation status can be one of preoperative modulators for planning an appropriate treatment strategy, such as the determination of an early surgical intervention.
Several studies have investigated whether gray-scale ultrasound (US) findings could predict the presence of the BRAF mutation in PTC and have reported controversial results. Kabaker et al 11 reported that most of the suspicious US findings, including a taller-than-wide shape, ill-defined margin, hypoechogenicity, calcifications, and absent halo were associated with BRAF mutation positivity, and Hahn et al 12 reported that hypoechogenicity and nonparallel orientation were associated with BRAF mutation positivity. Conversely, other studies have found no close correlation between suspicious US features and the BRAF mutation. 9, 13,14 With these various results, visual interpretation of US images has limitations, including a high dependency on the radiologist's experience and interobserver variation. In addition, substantial objective information from the image may not be evaluated through visual interpretation. Radiomics, which automatically extracts innumerable high-dimensional features from images, has recently emerged and shows promising results for decision support. 15 Previous studies have reported that histograms and texture analyses of US are useful for differentiating benign and malignant thyroid nodules. [16][17][18][19][20][21] To our knowledge, there have been no published studies aimed at identifying the presence of BRAF mutation using radiomics features of US.
Therefore, the purpose of this study was to evaluate whether radiomics study of gray-scale US could predict the presence or absence of BRAF mutation in PTC.

Patient Selection
The institutional review board of our institution (Samsung Medical Center) approved this retrospective study. We retrospectively reviewed our institutional data base to identify patients with surgically confirmed PTC who underwent preoperative thyroid US and successful DNA sequencing for BRAF mutations between January 2012 and June 2013. The exclusion criteria were as follows: 1) nodule diameter of ,10 mm in small nodules because the ROI method has lower accuracy and current guidelines do not recommend fine-needle aspiration for nodules with a diameter of ,10 mm; 22,23 2) lack of precise correlation between pathology, the BRAF mutation study, and US findings in patients with multiple nodules; and 3) both transverse and longitudinal US images not being available. Finally, this study included a total of 96 PTCs from 96 patients (mean age, 44.9 6 13.2 years; range, 19-77 years). The final surgical diagnoses and BRAF mutation results of the thyroid nodules were analyzed.

US Examinations and Image Evaluation
All patients underwent preoperative thyroid US using an iU22 Vision 2010 machine (Philips Healthcare, Seattle, Washington) with a commercially available 7-to 12-MHz linear-array transducer. All scans were performed by 1 of 7 radiologists with between 2 and 15 years of experience in thyroid US. Longitudinal and transverse images were obtained for each nodule.
One radiologist (M.-r.K.) retrospectively reviewed preoperative US and assessed image features, and another radiologist (J.H.S.) with 15 years of experience in thyroid US supervised this step. According to the Korean Thyroid Imaging Reporting and Data System (K-TIRADS), 22 all thyroid nodules were evaluated for internal content, echogenicity, shape, orientation, margin, and calcifications. The final category assessment was divided into 5 categories according to the K-TIRADS as follows: category 1, no nodule; category 2, benign nodule; category 3, low-suspicion nodule; category 4, intermediate-suspicion nodule; and category 5, high-suspicion nodule.

Radiomics Feature Analysis
The most representative transverse and longitudinal images of each tumor were selected for radiomics feature extraction. An ROI in the thyroid tumor was delineated manually along the border of each tumor on representative US images using MRIcron software (http://www.mricro.com/mricron) by 2 radiologists (M.-r.K. and J.H.S.). The intraclass correlation coefficient was computed to assess the reproducibility of features using 2 sets of ROIs. The first set of ROIs was used for the radiomics analysis.
A total of 86 radiomics features were extracted using open-source radiomics software, Py-Radiomics (https:// www.radiomics.io/pyradiomics.html). 24 Forty-three features were computed for each technique (transverse and longitudinal images). Features computed from both orientations were considered. The features were grouped into shape (6 features), histogram-based (19 features), intensity size zone matrix (ISZM, 2 features), and gray-level co-occurrence matrix (GLCM, 16 features). The histogram-based features were computed from 64-bin histograms calculated over the intratumoral intensity range. The GLCM features assess textural information and reflect intratumoral heterogeneity using a 2D histogram with 64 bins. A total of 8 matrices corresponding to eight 2D directions with an offset of 1 were computed and then averaged to yield a single matrix. The averaged matrix was used to compute the GLCM features. The ISZM features were also related to texture using blobs of similar intensity and differing sizes. We constructed a 32 Â 256 matrix in which the first dimension was binned intensity and the second dimension was the size of the blobs. Further details regarding the features are given in On-line Tables 1-3.
Due to the lack of external validation data, we applied 5-fold cross-validation to separate our data into training and test sets to reduce overfitting. Models were built using the training set only and tested on a left-out test set. Each model was trained using 80% of the data (n = 77) and later tested on the remaining 20% of the data (n = 18). Feature selection was performed using minimum redundancy maximum relevance (mRMR) from the training set. 25 The number of chosen features of mRMR was set using a grid search between 3 and 11. Within the cross-validation, the optimal number of features was chosen on the basis of the maximum performance in the test set on average for the 3 classifiers (On-line Figure). The selected features were used as input to train 3 different classifier models, including logistic regression, support vector machine using the linear kernel, and random forest with 50 trees. As for tuning the hyperparameters of the support vector machine, we tried different kernels, including linear, quadratic, and radial basis functions for the support vector machine, and linear kernel worked the best. The random forest classifier has feature-selection capabilities. However, the other 2 models, logistic regression and support vector machine, do not have such capabilities. We adopted an external feature-selection procedure (ie, mRMR) so that all 3 models were subjected to the same feature-selection procedure.
The trained classifiers were further tested on a left-out test fold. Because we adopted 5-fold cross-validation, we repeated the procedures of feature selection, model training, and testing steps 5 times, each time leaving out a different test fold. The performance of the classifier models was assessed on the basis of accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and area under the receiver operating characteristic curve (AUC).

Statistical Analysis
To compare BRAF mutation-positive and -negative PTCs, we analyzed categoric variables using a x 2 or Fisher exact test, while continuous variables were analyzed using a Student t test. A P value , .05 was considered statistically significant.
The 96 nodules consisted of 48 BRAF mutation-positive PTCs and 48 BRAF mutation-negative PTCs. Clinical characteristics, including patient age and sex, were not significantly different. The mean tumor size was 1.73 6 0.85 cm (range, 1-6 cm). The mean tumor size was not significantly different between the 2 groups (BRAF mutation-positive group: 1.59 6 0.57 cm; BRAF mutation-negative group: 1.87 6 1.05 cm, P ¼ .12). Central and lateral lymph node metastases were not significantly different (P ¼ .15 and P ¼ 1.00, respectively). Among US characteristics, echogenicity was significantly different between the 2 groups (P ¼ .012). BRAF mutation positive groups significantly showed nonparallel orientation (P ¼ .007) ( Table 2). The incidence of K-TIRADS category 5 (high suspicion) was higher than that of K-TIRADS 3 (low suspicion) or 4 (intermediate suspicion) in the BRAF mutation-positive group (P ¼ .004). The intraclass correlation coefficient of 86 features was a mean of 0.89 6 of 0.09 as shown in On-line Table 4. We adopted 5-fold cross-validation; thus, the selected features varied from fold to fold. There were 2 features that were selected .3 times: the mean   (Figure). We also performed a different feature-selection approach (ie, Pearson correlation-based feature selection) to see if it led to better performance. We computed the correlation between all possible pairs of features, and if the correlation exceeded 0.5, we kept the feature that had a higher correlation with the mutation status for a given pair of features. After feature selection, 3 models were trained and tested. The results of 3 classifiers using a simple Pearson correlation-based feature selection are given in On-line Table 6. Results showed that training performance was better than those using the mRMR feature selection, but test performance was worse, which implied that models were overfitting. The averages for the 3 classifier models were as follows: accuracy, 58.6% (range, 56.16%-60.42%); sensitivity, 63.9% (range, 60.44%-66.89%); and specificity, 53.8% (range, 52.22%-56.67%). The receiver operating characteristic of the 3 models yielded a relatively low AUC of 0.61 on average. The test performance in terms of AUC showed the lower bound of the confidence interval as 0.52, slightly above the chance level (ie, 0.50).

DISCUSSION
Since the introduction of radiomics, many previous studies have tried to investigate the relationship between image characteristics and genetic mutations in various malignancies, including lung, colon, brain, and breast cancers. [26][27][28][29] They proposed a CT-or MR imaging-based radiomics model to detect gene mutation status as a noninvasive method. These models were useful to predict the presence of gene mutations in malignancies.
To our knowledge, this is the first study to apply radiomics in the estimation of BRAF mutation in patients with PTC. We evaluated the ability of radiomics, using various machine learning approaches, to help predict the presence of the BRAF mutation in patients with PTC. In our study, BRAF-mutated PTCs tended to show nonparallel orientation and marked hypoechogenicity, similar to findings in some previous studies. 12,13 Although visual assessment of thyroid nodules suggested that high-suspicion findings on US were significantly more frequent in BRAF-mutated PTCs, radiomics demonstrated that all classification models failed to show excellent performance for predicting the presence of BRAF mutation in PTCs.
Radiomics is usually performed using tomographic images, including CT, MR imaging, or PET images because these modalities can acquire 3D volume data and data acquisition can be standardized by setting scan parameters of the machines so that they are identical. 15 US has several limitations in quantitative analysis in contrast to tomographic images: Only 2D data can be acquired through this technique along with lack of representative features due to a limited amount of image data, operator dependency, and dependency on US machines. 30 These factors may have affected our results. However, US is the most widely used standard imaging tool in thyroid pathology and is very helpful in discriminating between malignant and benign thyroid nodules. Until now, a number of studies have been published that have reported that quantitative features extracted from US images have favorable results. [16][17][18][19][20][21] Further studies with a larger amount of data will be necessary.
In our study, 3 different classifiers were applied to demonstrate the effectiveness of the chosen features. A simple model, such as logistic regression, has few parameters and is interpretable; conversely, complex models, such as support vector machine and random forest, are difficult to interpret and have many parameters. No superiority among the classifiers was noted in this study, and the difference among the AUCs of the 3 classifiers was very small; this finding indicates that choosing any classifier did not affect the overall performance. One possible reason for this result could be that the selected features were not very effective; thus, the results remained comparable regardless of the simple or complex classifier model.
In many studies using machine learning, the performance of the unseen test set tends to be lower than that of the training set because data overfitting might occur in the training set. In particular, using too many trees in the random forest classifier might inflate the performance measures in the training set. 31 We conducted a Pearson correlation, one of the other ways to perform feature selection. Results showed that training performance increased, but test performance decreased; these findings imply that the selected features overfitted the training data.
Two features, mean (histogram-based feature) and informational measures of correlation (texture feature) of the longitudinal image, were selected .3 times. Thus, they were important features to explain the BRAF mutation. There was a relatively small overlap between the 2 features when we computed a Pearson correlation (r ¼ 0.15 with P ¼ .14). The mean value reflects echogenicity of the ROI, which is compatible with our visual assessment of US images. 16 The informational measure of correlation is related to the heterogeneity of the ROI and thus could have a potential correlation with pathology. BRAF genetic alterations coexist in thyroid tumors in which some cells provide a basis for mutation and others do not have mutations, forming intratumor heterogeneity. Intratumor heterogeneity may foster tumor evolution and adaptation. 32,33 Our study had several limitations. First, this was a retrospective study from a single institution, which introduces the possibility of selection bias. Additionally, the small datasets in our study made it difficult to achieve reliable results. With added samples, applying deep learning approaches combined with electronic medical records might be possible; this process might improve the overall performance. Second, although preoperative US was performed with the same US machine set with similar parameters to avoid equipment-based variability, this feature and patient-related factors may still have influenced the pixel intensity of US images. 30 Third, we focused on predicting the BRAF status of patients with papillary thyroid cancer. Our main goal was not to contrast healthy controls and patients with papillary thyroid cancer. Still, including healthy controls would lead to less positive bias. Last, the lack of external validation data is also a limitation of this study. Our results from this study need to be further validated in a larger dataset to better assess their potential clinical use.

CONCLUSIONS
Our preliminary study shows that radiomics study of thyroid US was limited in predicting BRAF mutation in PTC.