Abstract
BACKGROUND AND PURPOSE: The Neck Imaging Reporting and Data System was introduced to assess the probability of recurrence in surveillance imaging after treatment of head and neck cancer. This study investigated inter- and intrareader agreement in interpreting contrast-enhanced CT after treatment of oral cavity and oropharyngeal squamous cell carcinoma.
MATERIALS AND METHODS: This retrospective study analyzed CT datasets of 101 patients. Four radiologists provided the Neck Imaging Reporting and Data System reports for the primary site and neck (cervical lymph nodes). The Kendall's coefficient of concordance (W), Fleiss κ (κF), the Kendall's rank correlation coefficient (τB), and weighted κ statistics (κw) were calculated to assess inter- and intrareader agreement.
RESULTS: Overall, interreader agreement was strong or moderate for both the primary site (W = 0.74, κF = 0.48) and the neck (W = 0.80, κF = 0.50), depending on the statistics applied. Interreader agreement was higher in patients with proved recurrence at the primary site (W = 0.96 versus 0.56, κF = 0.65 versus 0.30) or in the neck (W = 0.78 versus 0.56, κF = 0.41 versus 0.29). Intrareader agreement was moderate to strong or almost perfect at the primary site (range τB = 0.67–0.82, κw = 0.85–0.96) and strong or almost perfect in the neck (range τB = 0.76–0.86, κw = 0.89–0.95).
CONCLUSIONS: The Neck Imaging Reporting and Data System used for surveillance contrast-enhanced CT after treatment of oral cavity and oropharyngeal squamous cell carcinoma provides acceptable score reproducibility with limitations in patients with posttherapeutic changes but no cancer recurrence.
ABBREVIATIONS:
- BI
- Breast Imaging
- CECT
- contrast-enhanced CT
- LI
- Liver Imaging
- NI
- Neck Imaging
- OCSCC
- oral cavity squamous cell carcinoma
- OPSCC
- oropharyngeal squamous cell carcinoma
- PI
- Prostate Imaging
- RADS
- Reporting and Data System
Oral cavity squamous cell carcinoma (OCSCC) is the most common malignancy of the head and neck but might soon be overtaken by oropharyngeal squamous cell carcinoma (OPSCC), whose incidence is rapidly rising, mainly because its occurrence is related to the human papillomavirus.1-3 Smoking and alcohol use are outstanding risk factors with synergistic effects.4 While some authors use OCSCC for cancers in both locations, we think it is important to separate them. The oral cavity is separated from the oropharynx by the junction of the hard and soft palates above and the circumvallate papillae located at the transition from the anterior two-thirds to the posterior third of the tongue below.5
After completion of curative treatment for OCSCC or OPSCC, patients are enrolled in a program of continuous surveillance imaging and clinical examinations. Surveillance imaging can be performed using CT, MR imaging, or PET/CT and PET/MR imaging.5,6 Radiologists interpreting posttherapeutic imaging studies in these patients typically focus on the detection of submucosal recurrence at the primary cancer site and the identification of suspicious lymph nodes in the neck. Mucosal recurrence might also be seen in surveillance imaging but is a domain of referring clinicians. Especially in patients who underwent high-dose radiation therapy, the best surveillance can be ensured with a combination of clinical examinations, high-resolution imaging, and possibly endoscopy.7
Interpretation of posttherapeutic neck imaging studies in these patients is often challenging for radiologists. In this setting, nonstandardized framing is the common way to rate the probability of cancer recurrence. Reporting and Data Systems (RADS) provide standardized terminology and guidance toward a final score reflecting the probability of malignancy in patients enrolled in cancer surveillance programs. Following the introduction of such a system for breast imaging (BI-RADS) in 1997, several RADS for different organs and body regions (eg, PI-RADS for the prostate and LI-RADS for the liver) have been published and also become highly appreciated by referring clinicians, not in the least because they improve comparability and reproducibility.8-10
In 2016, the Neck Imaging Reporting and Data System (NI-RADS) was introduced by the American College of Radiology and has shown a promising initial performance.11-13 Defined features and findings lead to a numeric value that reflects the probability of cancer recurrence and is directly linked to recommendations for measures to be taken for further patient management.
The major motivation to perform this study was to test NI-RADS for its reliability in interpreting contrast-enhanced CT (CECT), which is, by far, the most common technique used for the surveillance of patients with head and neck cancer in our institution, to obtain evidence to support its implementation as a reporting standard for imaging studies and discussion of findings with referring physicians from the department of oral and maxillofacial surgery.
MATERIALS AND METHODS
Patient Population
This retrospective study was approved by our institutional review board, and written informed consent was obtained from all patients. In the records of our weekly interdisciplinary conferences (of radiologists and oral and maxillofacial surgeons) held between June 2017 and July 2019, we identified 123 consecutive patients for whom CECT studies performed at our department or by an external institution were available, and 101 patients (41 women, 60 men; median age, 64 years) were finally included in this study. A flow chart of participants is provided in Fig 1. A total of 202 target sites (primary cancer site and neck for each patient) were evaluated. Of the patients included, 72 had OCSCC localized in the mouth floor (n = 22), the anterior two-thirds of the tongue (n = 19), the hard palate (n = 3), and the gingival, labial, or buccal mucosa (n = 28). Twenty-nine patients had OPSCC localized in the posterior third (base) of the tongue (n = 13), the soft palate (n = 2), the palatine tonsils (n = 13), and the posterior oropharyngeal wall (n = 1).
Imaging
Of the 101 CECT studies included, 72 were performed in our department, and 29, by an external institution. In our department, we perform neck CECT scans on an 80-section CT scanner (Aquilion PRIME; Canon Medical Systems, Otawara, Japan). Our standard protocol includes scout-based automated selection of tube voltages between 100 and 130 kV and tube current modulation between 60 and 600 mA, a tube rotation time of 0.75 seconds, collimated section thickness of 0.5 mm, and a pitch factor of 0.813. Seventy-five milliliters of contrast medium (iopamidol, Imeron 400; Bracco, Milan, Italy) is injected as a split bolus: the first bolus of 50 mL at a flow rate of 2.5 mL/s and the second bolus of 25 mL 55 seconds later at a flow rate of 3.5 mL/s, followed by a 40 mL saline chaser at a flow rate of 2.5 mL/s. The helical scan starts with a delay of 18 seconds after the start of the second bolus injection.
Image quality of the CECT datasets was rated on a 4-point scale (1, excellent; 2, good; 3, acceptable; 4, not acceptable) to ensure that the dataset allows adequate assessment of the primary site, which is often and primarily affected by metal artifacts. A rating of 4 means that the primary site cannot be evaluated for cancer recurrence.
Inclusion Criteria
Status posttreatment of OCSCC or OPSCC and recorded case discussion in our weekly interdisciplinary conference (departments of radiology and of oral and maxillofacial surgery).
CECT within 3–12 months after treatment or prior surveillance imaging.
CECT imaging-quality requirements.
Split bolus injection of contrast medium resulting in a combined vascular and delayed phase in 1 acquisition.
Arms positioned below the head and neck (next to the chest and abdomen).
Image quality rating of 1 (excellent), 2 (good), or 3 (acceptable).
Confirmation study either as:
Subsequent surveillance imaging (CECT, MRI, PET) no earlier than 3 months after the CECT study included or
Histopathologic study.
Exclusion Criteria
Failure to meet CECT quality requirements:
Single bolus injection of contrast medium resulting in a single delayed phase.
Arms positioned over the head.
Image quality rating of 4 (not acceptable).
No subsequent confirmation study.
Readers and Reporting Process
Four radiologists with different levels of experience (A, 3 years and ∼300 prior reports of neck CECT; B, 4 years and ∼300 reports of neck CECT; C, 7 years and ∼700 reports of neck CECT; D, 15 years and ∼3300 reports of neck CECT) reviewed the 101 cases included in our analysis. Radiologists A and B were grouped as less experienced; C and D, as more experienced readers. Radiologist D is specialized in imaging of the head and neck. At no time were any of the 4 radiologists involved in the interdisciplinary conferences from which patients were included in this study. Anonymized patients were reordered using random numbers assigned by Excel (Version 16.16.10; Microsoft, Redmond, Washington). Readers had access to previous imaging studies (before and after treatment, if available), and they were aware of clinical information to simulate a real reporting situation. Subsequent imaging findings, diagnoses, or clinical examination reports were not available to the 4 readers. After 3 months, radiologists A, B, C, and D were asked again to report on the CECT datasets of the same 101 patients now presented in a newly randomized order. Each of the 2 serial rating sessions was performed in 4 rounds with 25, 25, 25, and 26 patients and a break of 1 week between each round. Another radiologist who was not part of the NI-RADS reader group (E, 6 years of experience and ∼400 CECT examinations of the neck) rated the image quality.
NI-RADS Scoring System
Reports of imaging findings were based on the NI-RADS White Paper published in 2018, which was well-studied and jointly discussed by our readers and the authors.11 NI-RADS scores between 1 and 4, reflecting increasing probabilities of cancer recurrence, are assigned separately for the primary site and for cervical lymph nodes (“neck”). NI-RADS 0 is only used as a preliminary score in cases in which prior images have been obtained but are not available at the time of reading and therefore were not required in our study design. NI-RADS 1 is assigned for expected posttherapeutic changes like the typical superficial diffuse linear contrast enhancement in the primary site and absence of residual abnormal, new, or enlarged lymph nodes in the neck. NI-RADS 2 for the primary site is subdivided into 2a for focal superficial enhancement and 2b for deep, ill-defined enhancement. NI-RADS 2 for the neck indicates residual abnormal or new, enlarged lymph nodes without new necrosis or extranodal extension. NI-RADS 3 is assigned for discrete masses in the primary site and new necrosis or extranodal extension of lymph node involvement in the neck. NI-RADS 4 indicates definitive primary site or nodal radiologically or even histopathologically proved recurrence.
Data Analysis
Statistical analysis was performed using R Studio (Version 1.1.383; http://rstudio.org/download/desktop) with the “irr” package installed. The heatmap (Fig 2) was generated using R Studio and the “gplots” package. The flowchart was issued using draw.io (Version 10.8.0; JGraph, Northampton, UK).
Subgroups were formed according to readers’ experience (more-versus-less experienced), the results of the confirmation studies (no recurrence versus recurrence), and the probability of cancer recurrence based on the NI-RADS scores of most readers (NI-RADS 1 and 2 versus NI-RADS 3 and 4).
The Kendall's W (W) and Fleiss κ (κF) were calculated to test interreader agreement. Calculation of W included a correction factor for tied ranks, and its statistical significance was assessed using the χ2 test. The Kendall's rank correlation coefficient τB and the Cohen weighted κ (κw) were computed to quantify either interreader agreement between 2 readers or intrareader agreement. Calculation of κw provided weighted disagreements according to their squared distance from perfect agreement.
W and τB were interpreted on the basis of the guidelines of Schmidt,14 proposing a 5-step classification: 0.10–0.29, very weak agreement; 0.30–0.49, weak agreement; 0.50–0.69, moderate agreement; 0.70–0.89, strong agreement; 0.90–1.00, very strong agreement. Interpretation of κF and κw followed the recommendations of Landis and Koch:15 < 0.20, slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; 0.81–1.00, (almost) perfect agreement.
Recurrence rates were calculated from the NI-RADS scores of most readers. In case of tied scores, the score assigned by the most experienced reader D was decisive.
RESULTS
Figure 2 provides an overview of rating distributions for all 101 patients in the form of a heatmap. It also includes results of the confirmation studies with arrows indicating exemplary cases with perfect or poor agreement among raters. Numbers next to the arrows indicate the figure in which the cases are presented (Figs 3⇓–6).
Depending on the statistical tests used, overall interreader agreement (Table 1) was strong or moderate for both the primary site (W = 0.74, κF = 0.48) and the neck (W = 0.80, κF = 0.50). Less experienced readers showed higher interreader agreement for the primary site (τB = 0.82 versus 0.50, κw = 0.96 versus 0.80) and the neck (τB = 0.96 versus 0.60, κw = 0.99 versus 0.76). Other subgroups were formed according to the results of the confirmation studies. A total of 13 patients were diagnosed with cancer recurrence. Seven patients had simultaneous cancer recurrence at the primary site and in the neck, while 3 patients each had cancer recurrence at the primary site or in the neck. In patients without proved recurrence, interreader agreement was moderate or fair for the primary site (W = 0.56, κF = 0.30) and the neck (W = 0.56, κF = 0.29). By contrast, interreader agreement in patients with proved recurrence was very strong or substantial for the primary site (W = 0.96, κF = 0.65) and strong or moderate for the neck (W = 0.78, κF = 0.41). When forming merged NI-RADS categories according to high and low suspicion of cancer recurrence, we found higher interreader agreement for NI-RADS 3/4 than NI-RADS 1/2 for both the primary site (W = 0.85 versus 0.51, κF = 0.56 versus 0.23) and the neck (W = 0.59 versus 0.56, κF = 0.44 versus 0.26).
Intrareader agreement (Table 2) for the primary site ranged from moderate to strong (τB = 0.67–0.82) or almost perfect (κw = 0.85–0.96). Intrareader agreement for the neck was strong (τB = 0.76–0.86) or almost perfect (κw = 0.89–0.95).
All statistical analyses conducted to test inter- and intrareader agreement showed statistical significance (P < .05).
Recurrence rates (Table 3) were between 3.57% (NI-RADS 1) and 100% (NI-RADS 4) for the primary site and 0% (NI-RADS 1) and 83.33% (NI-RADS 4) for lymph nodes (Table 3). Patients without histopathology for confirmation of their diagnosis were followed up for a median of 351 days (range, 159–772 days), defined by the date of their last surveillance imaging study.
DISCUSSION
Inter- and intrareader agreement is important for estimating the reliability of any diagnostic test. To the best of our knowledge, a study investigating inter- and intrareader agreement of NI-RADS scores has not been published. However, we can discuss our results for NI-RADS with those other investigators’ results obtained for the reliability of RADS in other organs. Published data give a very diverse picture. A study similar to ours in terms of statistical methods and results was published by Irshad et al,16 who assessed consecutive versions of BI-RADS including 5 readers and 104 mammographic examinations. They found an overall interreader agreement of 0.65 and 0.57 (Fleiss κ), while overall intrareader agreement was 0.84 and 0.78 (Cohen weighted κ). A study by Smith et al17 determined the reliability of PI-RADS in the interpretation of multiparametric MR imaging of the prostate, including 4 readers and 102 examinations, again similar to our study design. However, by contrast, they reported an overall interreader agreement of 0.24 (Fleiss κ) and an overall intrareader agreement of 0.43–0.67 (Cohen κ).
When we compared the 2 studies with our results, the difference in overall interreader agreement stood out first. Our results obtained with NI-RADS (κF = 0.48 and 0.50) are much better than findings reported by other investigators for PI-RADS but inferior to results achieved with BI-RADS. NI-RADS showed a very high intrareader agreement (κw = 0.85–0.96 and κw = 0.89–0.95), especially against the poor values obtained in the PI-RADS study. Thus, our results are encouraging because they suggest that there is the potential for improving interreader agreement. Given that the NI-RADS lexicon and decision tree can only be used fully when interpreting PET/CT or PET/MR imaging, we expect that interreader agreement can be considerably improved using either of these modalities. Especially, NI-RADS categories 1 and 2 (2a and 2b) are defined more clearly when additional information on FDG uptake is available.
Apart from our findings regarding absolute overall agreement, our analysis also provides some interesting results regarding the subgroups formed. Unexpectedly, overall interreader agreement for both the primary site and the neck was higher between the 2 less experienced readers than between the 2 more experienced readers. Furthermore, interreader agreement for the absence of recurrence in lymph nodes was poorer than we expected. A possible explanation emerged from discussions with the readers after completion of the study: The definition for assigning a lymph node to NI-RADS 2 is “mildly enlarging without specific morphologically abnormal features such as new necrosis or extracapsular spread,” which was perceived as rather vague.11 Some kind of measurable threshold might significantly increase agreement among raters. Other results of our study suggest adequate sensitivity of NI-RADS. Interreader agreement was significantly higher in cases of proved cancer recurrence compared with patients without recurrence.
Coincidentally low recurrence rates in the group classified as NI-RADS 1 as well as high recurrence rates in groups with NI-RADS scores of 3 and 4 suggest that NI-RADS is a powerful tool for discrimination of patients with a low-versus-high risk of cancer recurrence. No patients assigned scores of 2a for the primary site had cancer recurrence, which might be attributable to the relatively small number of cases or greater variability in the interpretation of findings, as already discussed above. Recurrence rates calculated in our study are based on majority decision but align very well with initially published data.11,18,19
While calculation of κ coefficients is by far the most common statistical test to quantify inter- and intrareader agreement,20,21 there are also more differentiated approaches addressing other aspects of inter- and intrareader agreement.22 Other investigators primarily recommend κ statistics for testing nominal scaled data.23,24 From our standpoint, NI-RADS scores should be regarded as ordinal data because rising values represent a rising probability of cancer recurrence. Therefore, the Kendall's coefficient of concordance (used to determine interreader agreement for >2 readers) and the Kendall's rank correlation coefficient (interreader agreement with 2 raters or intrareader agreement) should be most appropriate.25 When we compared the result pairs of statistical methods in our study, it is apparent that values of W are always higher than those of κF but values of τB are always lower than those of κw, while their relationships stay basically constant. The intraclass correlation is also used to determine inter- and intrareader agreement; however, it should only be used for underlying continuous data. We therefore chose not to calculate intraclass correlation statistics for the discrete data provided by NI-RADS.
This study, although retrospective, was designed to put readers in a real-world clinical reporting situation. This means that the readers had access to information on OCSCC/OPSCC localization as defined by the multidisciplinary cancer conference, surgical and radiotherapeutic procedures, and pre-existing illnesses. This information is available to reporting radiologists in the clinical setting and is important for appropriately and comprehensively interpreting imaging findings and assessing the patient’s condition. On the other hand, there were actions to reduce possible bias. Cases were presented in randomized order, and anonymization of patient data was performed to lower a possible detection bias. The 101 CECT datasets were split into 4 rating sessions (25, 25, 25, and 26) to minimize possible over- or underratings because of readers’ raised awareness and altered perception of similarities and differences when comparing cases with others they have recently seen in the artificial reading situation.
Clinically suspected OCSCC or OPSCC and posttherapeutic surveillance are the most frequent indications for neck imaging in our institution, with CECT being much more commonly used than MR imaging. Future studies should investigate inter- and intrareader agreement of NI-RADS, not only for other malignancies (eg, larynx and salivary glands) but also for different imaging modalities (CECT, MR imaging, PET/CT and PET/MR imaging). The role of PET/CT and PET/MR imaging in up- or downgrading lesions seen on CECT or MR imaging without PET should also be of interest in studies, especially prospectively designed, studies.
Limitations
Four radiologists reported imaging findings in this study. While radiologists A and B were relatively close in terms of work experience (years and number of examinations), C and D were wider apart. Although C could easily be classified as more experienced than A and B, a work experience closer to D would have been desirable to ensure ideally balanced subgroups. Subdividing readers into 3 groups with an additional group of intermediate experience might also yield interesting additional results. Because we just started to integrate NI-RADS as a reporting system in our institution, future studies could address these limitations. As readers become more familiar with using NI-RADS and shared experience grows, common approaches might emerge and improve interreader agreement. Although all 4 radiologists were well-acquainted with the literature on NI-RADS, a joint discussion of exemplar cases from our department might have improved interreader and even intrareader agreement. Beyond that, in our opinion, more experience might also lead to higher rates of NI-RADS 2a/b scores being assigned because findings in this category are more difficult to express in prosaic reports because referring clinicians expect a clear decision between “suspected recurrence” versus “no suspected recurrence.” We determined recurrence rate as a secondary outcome. Although it attests to the good discriminatory power of NI-RADS, future studies investigating the validity of NI-RADS should define a longer follow-up period of at least 1 year.
CONCLUSIONS
NI-RADS used for interpreting CECT after treatment of OCSCC and OPSCC provides acceptable score reproducibility. A major strength of this standardized approach is the good interreader agreement in patients with proved cancer recurrence and overall intrareader agreement in general. At the same time, there are limitations in terms of interreader agreement in patients with posttherapeutic changes but no cancer recurrence. Although only determined as secondary outcomes, recurrence rates in our patients were similar to those in preliminary published data.
Footnotes
Disclosures: Stefan Markus Niehues—UNRELATED: Grants/Grants Pending: German Research Foundation grant*; Payment for Lectures Including Service on Speakers Bureaus: speakers honorarium for Canon, Guerbet, Bracco, Teleflex. Bernd Hamm—UNRELATED: Grants/Grants Pending: Abbott Laboratories, Actelion Pharmaceuticals, Bayer Schering Pharma, Bayer Vital, Bracco Group, Bristol-Myers Squibb, Charité Research Organization GmbH, Deutsche Krebshilfe, Deutsche Stiftung für Herzforschung, Essex Pharma, European Union Programs, FIBREX Medical Inc, Focused Ultrasound Surgery Foundation, Fraunhofer Gesellschaft, Guerbet, INC Research, InSightec Ltd, Ipsen Biopharmaceuticals, Kendle/MorphoSys AG, Lilly Deutschland GmbH, Lundbeck GmbH, MeVis Medical Solutions AG, Nexus Oncology, Novartis, Parexel CRO Service, Perceptive Innovations, Pfizer GmbH, Philipps Healthcare, Sanofis-Aventis SA, Siemens, Spectranetics GmbH, Terumo Medical Corporation, TNS Healthcare GmbH, Toshiba, UCB, Wyeth Pharmaceuticals, Zukunftsfond Berlin/TSB Medici.* *Money paid to the institution.
References
- Received August 21, 2019.
- Accepted after revision March 8, 2020.
- © 2020 by American Journal of Neuroradiology