Reliability of CT Angiography in Cerebral Vasospasm: A Systematic Review of the Literature and an Inter- and Intraobserver Study

In Part I of this study, articles reporting the reliability of CTA up to May 2018 were systematically searched and evaluated. In Part II, 11 raters independently graded 17 arterial segments in each of 50 patients with SAH for the presence of vasospasm using a 4-category scale. Raters were additionally asked to judge the presence of any moderate/severe vasospasm and whether findings would justify augmentation of medical treatment or conventional angiography ± balloon angioplasty. The systematic review revealed few studies with heterogeneous vasospasm definitions. In Part II, the authors found interrater reliability to be moderate at best (ϰ > 0.6), even when results were stratified according to specialty and experience. They conclude that the diagnosis of vasospasm using CTA alone was not sufficiently repeatable among observers to support its general use to guide decisions in the clinical management of patients with SAH. BACKGROUND AND PURPOSE: Computed tomography angiography offers a non-invasive alternative to DSA for the assessment of cerebral vasospasm following subarachnoid hemorrhage but there is limited evidence regarding its reliability. Our aim was to perform a systematic review (Part I) and to assess (Part II) the inter- and intraobserver reliability of CTA in the diagnosis of cerebral vasospasm. MATERIALS AND METHODS: In Part I, articles reporting the reliability of CTA up to May 2018 were systematically searched and evaluated. In Part II, 11 raters independently graded 17 arterial segments in each of 50 patients with SAH for the presence of vasospasm using a 4-category scale. Raters were additionally asked to judge the presence of any moderate/severe vasospasm (≥ 50% narrowing) and whether findings would justify augmentation of medical treatment or conventional angiography ± balloon angioplasty. Four raters took part in the intraobserver reliability study. RESULTS: In Part I, the systematic review revealed few studies with heterogeneous vasospasm definitions. In Part II, we found interrater reliability to be moderate at best (κ ≤ 0.6), even when results were stratified according to specialty and experience. Intrarater reliability was substantial (κ > 0.6) in 3/4 readers. In the per arterial segment analysis, substantial agreement was reached only for the middle cerebral arteries, and only when senior raters’ judgments were dichotomized (presence or absence of ≥50% narrowing). Agreement on the medical or angiographic management of vasospasm based on CTA alone was less than substantial (κ ≤ 0.6). CONCLUSIONS: The diagnosis of vasospasm using CTA alone was not sufficiently repeatable among observers to support its general use to guide decisions in the clinical management of patients with SAH.

C erebral vasospasm is the main cause of delayed cerebral ischemia after rupture of an intracranial aneurysm. [1][2][3] To detect and manage vasospasm, CTA and DSA are commonly used, particularly in comatose or sedated patients, to guide the use of medical and/or endovascular interventions that aim to prevent poor outcomes. 3,4 The reliability of CTA in this context has not been rigorously evaluated. [5][6][7][8] This article is divided in 2 parts. First, we systematically reviewed the literature on the CTA evaluation of vasospasm with emphasis on grading classifications and interobserver reliability. Second, we performed a local reliability study to assess whether clinicians agreed in making the diagnosis of moderate or severe vasospasm using CTA and in recommending further investigations or treatments based on CTA results in a series of 50 patients.

Part I: Systematic Review
Classification systems used to quantify the degree of vasospasm with DSA and/or CTA and intra-/interobserver agreement studies on the diagnosis of vasospasm using CTA were systematically reviewed. A detailed protocol for the search strategy was prepared according to the Preferred Reported Items for Systematic Reviews and Meta-Analysis statement. 9 The EMBASE, CINAHL, Evidence-Based Medicine (EBM), Cochrane, and MEDLINE databases were searched with no starting date specification, capturing English and French publications up to May 3, 2018. The search strategy is available in On-line Tables 1-5. One author (B.F.) tested the search strategy for its ability to recover pertinent articles. The data were collected and reviewed in detail by 2 authors with 5 and 6 years of experience, respectively (B.F. and L.L.-G.). Discrepancies were resolved by consensus.

Part II: Reliability Study
The Guidelines for Reporting Reliability and Agreement Studies were followed. 10 The Centre de Recherche du Centre Hospitalier Universitaire de Montreal review board waived informed consent to access the patients' clinical and radiologic data. Written informed consent was obtained from all raters participating in the study.
Patients. We assembled a portfolio of 50 patients. The number of patients was predefined using the method of Donner and Rotondi 11 and the kappaSize package 12 in R, Version 3.4.4 (https://www.r-project.org/), 13 taking into account pragmatic factors such as the willingness of observers to complete segment-bysegment evaluations.
All consecutive patients presenting to our institution from January 2005 to May 2017 with nontraumatic nonperimesencephalic SAH 14 and who had undergone at least 2 CTAs (one on admission, the other follow-up CTA performed 2-21 days later to assess the presence of vasospasm) were retrospectively recovered from our radiology information system. The 2-to 21-day interval was chosen to cover the typical vasospasm window, assuming that the initial CTA could potentially be delayed by up to 24 hours since the onset of symptoms. 4,15,16 The admission CTA was used as a reference when evaluating vasospasm on follow-up CTA. 7,8 To minimize the issue of the k paradox, CTAs were chosen in approximately equal proportions of vasospasm severity with reference to the official radiology report. [17][18][19] We did not exclude examinations degraded by coil or clip artifacts unless the study was rendered nondiagnostic (1 patient with 3 coiled aneurysms). All radiologic studies were de-identified and sent to the PACS for this study. The retrieved clinical information included demographic data (age, sex) and initial SAH-related patient characteristics (Hunt and Hess scale, 20 hydrocephalus, ventricular drainage, craniectomy, anatomic location of the culprit vascular lesion causing the SAH, type of treatment of the vascular lesion) as well as the reason for performing the follow-up CTA in search of vasospasm and the time delay between the initial and follow-up CTA. One author (L.L.-G.) retrospectively graded each admission noncontrast head CT for SAH using the modified Fisher scale. 21 Readers. Eleven clinicians involved in the diagnosis and management of vasospasm, from different specialties (radiology, neurosurgery) and with different levels of experience in the management of vasospasm from our institution participated in the study. Readers were stratified by experience as junior (residents and fellows) or senior (attending physicians with $5 years of experience; range, 5-35 years).
Evaluations and Categories. For each case, readers were asked to grade the degree of vasospasm of 17 arterial segments on a 4-category scale (none, mild [,50%], moderate [50%-74%], and severe [$75% narrowing]) compared with the initial CTA. 8,22 Arterial segments were predefined as proximal (intracranial internal carotid arteries, A1, M1, and P1 segments, basilar and vertebral arteries) or distal (A2-3, M2-3, P2-3 segments), as previously reported. 7 For each patient, there were 3 additional questions (yes/no): 1) Is there moderate-severe vasospasm at any location? 2) Presuming the presence of a new neurologic deficit corresponding to the territory of the artery affected by vasospasm, would you recommend a change in medical management? 3) Would you recommend DSA with or without balloon angioplasty? The latter 2 clinical decisions were based on the readers' clinical experience.
All the aforementioned readings were performed using only the pair of de-identified admission and follow-up CTAs, with the reader blinded to the initial report, other clinical information, or follow-up imaging as well as other reader assessments. For the intrarater portion of the study, all cases were read a second time by 4 raters. The second set of readings was permutated and performed at least 4 weeks following the initial readings to minimize recall bias.
Data Analyses. Intra-and interobserver reliability statistics were computed using STATA/IC, Version 14.2 (StataCorp, College Station, Texas) and R 13 using the irr package 23 under the supervision of a statistician (M.C.). Cohen and Fleiss k reliability coefficients were calculated for intraobserver and multirater interobserver analyses, respectively, using 1000 bootstrap samples (bias-corrected) to obtain 95% confidence intervals. In the per-patient analysis, the 3 main questions generated dichotomous results (yes/no). In the persegment analysis, the 4-point grading system for vasospasm (none, mild, moderate, severe) generated an ordinal scale (0-3). We did not add weightings to the k calculation for the latter data. This scale was then dichotomized (none-mild versus moderate-severe, corresponding to ,50 versus $50% arterial narrowing) for a secondary analysis. An exploratory analysis was also performed, removing all arterial segments obscured by clip or coil artifacts. All k coefficients were stratified according to experience (junior versus senior) and specialty among senior readers (diagnostic versus interventional neuroradiology). k coefficients were interpreted using Landis and Koch guidelines, 24 predefining k . 0.6 as "substantial agreement."

Part I: Systematic Review
A total of 5761 titles were reviewed, 2780 abstracts were examined, and 304 full-text articles were read in detail, leaving 14 articles for the systematic review (On-line Fig 1). In these studies, 8 different classification systems were used (with 3-5 categories) with various arbitrary cutoffs (On-line Table 6). All 14 were diagnostic accuracy studies, but 3 also assessed interobserver agreement on cerebral vasospasm using CTA. One of the interobserver variability studies had 3 raters, while the other 2 studies had 2.
No study assessed intraobserver reliability. Results are summarized in On-line Table 7. The degree of blinding of the raters was not reliably reported. The paucity of data and the heterogeneity of methods and end points precluded the performance of a metaanalysis.

Part II: Reliability Study
Patients and their characteristics are summarized in Table 1 and On-line  Table 2.
The interobserver agreement for the detection of moderateto-severe vasospasm ($50% narrowing in any segment) was only fair (k = 0.340; 95% CI, 0.232À0.462) for all raters (Fig 1, data in On-line Table 8). Agreement between senior raters improved to moderate (k = 0.433; 95% CI, 0.266-0.582). Perfect agreement was found for few patients: Six of 50 (12%) patients were judged by all raters to have moderate-severe vasospasm, while 3/50 (6%) were judged by all raters not to have any vasospasm. These proportions improved slightly when only senior readers were considered, reaching 7/50 (14%) and 16/50 (32%), respectively. There were significant differences (p , .001) in the proportions of patients judged to have moderate-to-severe vasospasm between junior and senior raters, as well as between diagnostic and interventional neuroradiologists (On-line Table 9 and On-line Fig 3A).
Interobserver agreement regarding augmentation of medical treatment based on CTA alone was fair (k = 0.245; 95% CI, 0.179-0.336) (Fig 1, data in On-line Table 8). Interobserver agreement on recommending DSA 6 balloon angioplasty based on CTA alone was also fair (k = 0.272; 95% CI, 0.159-0.415) (Fig 1, data in On-line Table 8). There were only 2/50 (4%) cases in which all raters agreed that DSA 6 angioplasty should be performed (On-line Fig 3C).  The segment-by-segment analysis evaluating the presence of moderate-severe vasospasm did not reach substantial agreement for any arterial segment when considering all raters (On-line  On-line Tables 11  and 12). Judgments regarding proximal segments (M1, A1, vertebral and basilar arteries) were, in general, more repeatable than judgments regarding distal segments, with the exception of the internal carotid artery, for which agreement remained only slight. When only senior raters were considered, agreement for the presence of moderate-severe vasospasm (dichotomous scale) in the M1 segments improved to substantial (On-line Fig  4D). Examples of maximal agreement and maximal disagreement for the M1 segments are illustrated in Online Fig 5. An exploratory analysis was conducted to examine the role of metal artifacts from endovascular coils or surgical clips in affecting the repeatability of diagnoses. Thirty-six A1, 13 A2-A3, 11 ICA, and 7 M1 segments were excluded from this analysis, retaining only arterial segments that were not obscured by artifacts (On-line Fig 6, full data in On-line Table 13). k coefficients remained below the 0.6 threshold, except for the left M1 segment, which reached the substantial agreement threshold among all observers.
Four observers completed the intraobserver study (3 interventional neuroradiologists and 1 radiology resident). Intraobserver reliability was "substantial" (k . 0.6) in 3/4 readers (Fig 2, data in On-line Table 10) for the detection of moderate-to-severe vasospasm. For the management recommendations based on imaging findings, intraobserver reliability was substantial for 3 readers for medical management and for 2 readers for the decision to perform DSA.

DISCUSSION
The systematic review revealed a wide variation in grading systems of vasospasm. In addition, few studies evaluated the reliability of CTA for the diagnosis of vasospasm. Most reports were primarily diagnostic accuracy studies, dedicated to a comparison with DSA, while agreement studies were limited to 2-3 raters. The relatively good agreement between readers in these studies could not be confirmed in our center.  The main finding of our reliability study, which included a higher number of observers and wider range of experience compared with prior reports, is that the reliability of CTA alone in the diagnosis of moderate-to-severe cerebral vasospasm was not substantially repeatable between observers, even when verdicts were dichotomized and even when analyses were restricted to experienced raters.
This problem has previously been identified. 25 When a noninvasive imaging test is proposed to replace a more invasive one (here conventional angiography), the emphasis is usually placed on diagnostic accuracy, rather than on studying the repeatability of judgments made with the new imaging technique by multiple observers. 10,25 A major difficulty is the shift in the clinical spectrum of patients who are undergoing the test, which naturally occurs as the imaging test becomes widely accepted in routine clinical applications. Initially, for the diagnostic accuracy studies, the new test is likely to be compared with DSA for patients on the "severe" end of the spectrum (patients for whom DSA is judged to be indicated). Later on, at the time of clinical usage, the less invasive test is increasingly used in patients who have less severe disease, for whom DSA would not necessarily be performed. The consequence is that the relatively good interobserver agreement typically found between 2 or 3 expert raters in early diagnostic accuracy studies cannot be reproduced in real-world practice.
The poor repeatability of CTA judgments on vasospasm may not be surprising when one considers a number of unresolved problems. Despite the fact that CTA has been used for decades in the diagnosis of post-SAH cerebral vasospasm, there is no consensus on diagnostic criteria and our systematic review revealed a wide range of grading scales with various arbitrary cutoff values. Whether observers can reliably differentiate 25% or 30% luminal narrowing is questionable, given the small caliber of arteries and the limited spatial resolution of CTA. Even if precise and reliable measurements were possible, there is no agreement on which baseline reference value should be used. Some authors, similar to our study, use the baseline examination (when available) as a reference while others use the ipsilateral or contralateral "uninvolved" arteries. Various methods (eyeballing versus measuring with a caliper, for example) are commonly used. The rationale for various grading scales and the exact procedural methodology have not been clearly stated or validated.
In general, scales should be valid, reliable, and clinically relevant. 26 A scale that is too complex or that introduces too many categories is less repeatable, leaving too much room for variations in interpretation by different observers. There is also no standardized way to summarize the findings of each arterial segment in 1 global verdict that concerns the individual patient. We chose a grading system that had been used previously. 8,22 We also predefined a 50% diameter narrowing cutoff to explore whether better agreement could be achieved by simplifying responses through dichotomization and also because the 50% threshold has been suggested to correlate with decreased cerebral perfusion in the setting of cerebral vasospasm. 2,[27][28][29] Further exploration of the diagnostic value of other threshold values may be warranted, given the low reliability found in the current study as well as possible lack of specificity of the 50% narrowing cutoff to predict ischemia. 15,28,30 The primary end point of the study was the identification of moderate-to-severe vasospasm in any arterial segment for a given patient. If agreement between observers failed to reach the substantial (k . 0.6) threshold, even for experienced raters, at least the intrarater agreement was better for most readers. There remains hope that better interobserver agreement could be attained, at least for major proximal arterial segments (excluding the carotid arteries), with more precise definitions and standardized procedures.
No matter how vasospasm is interpreted, the most important questions concern the clinical relevance of the CTA verdict. We were interested in exploring whether agreement existed, if not for the grade of vasospasm, at least for the clinical significance of the CTA findings in terms of whether medical treatment should be increased or whether DSA 6 balloon angioplasty should be performed. We were careful not to provide clinical information that could bias the interpretation of the clinical history by different observers. 31 Despite this effort and contrary to a previous report, 7 interrater agreement remained below the substantial level.
A striking finding was the difference in the severity of vasospasm adjudicated by readers of different experiences and specialties. Diagnostic radiologists had a higher proportion of moderateto-severe vasospasm diagnoses compared with interventional neuroradiologists. Junior readers had a higher proportion of positive answers to the 3 main questions, compared with more experienced readers. Interventional neuroradiologists, who are exposed to the most severe end of the disease spectrum because they may be required to treat these patients, may have a higher threshold for diagnosing severe vasospasm than other clinicians.
Segment-by-segment analyses were performed to explore potential ways to improve the reliability of CTA. This analysis also showed lower agreement than previously reported. 7 However, the relatively low reliability of the assessment of the carotid arteries and the relatively better agreement found in the evaluation of middle cerebral arteries have previously been noted. 6 These observations suggest that readers looking for vasospasm on CTA should perhaps focus on more reliable segments such as the M1 segments and basilar artery and not on less reliable segments such as carotid arteries or distal vessels.
The present study did not select the best cases or exclude patients with clip or coil artifacts, to reflect real-life clinical conditions. Including mainly patients without metal artifacts, as in a previous report, 6 would hardly have reflected normal clinical usage, for most patients at risk of vasospasm have already undergone endovascular or surgical treatment of the aneurysm at the time of the CTA assessment. Metal artifacts remain a limitation of CTA, a problem that could perhaps be mitigated by specific algorithms and/or dual-energy CT. 32 However, agreement improved only minimally when a second analysis excluding arterial segments obscured by beam-hardening artifacts was performed.
Several factors may explain the lower reliability we found compared with previous reports. Reliability is the complex product of interactions among the test, subjects, raters, and the context of assessment. 33,34 k coefficients based on a limited number of subjects and raters can result in overestimates. 35 Our study included a larger number of readers and index cases than in previous reports. The spectrum of patients included in our study also differed, as we discussed previously. Grading scales, exact methods of measurement, and statistical analyses also differed. One frequent source of artificial inflation of agreement is the lack of blinding of raters to other raters' observations, to the reference test, or to clinical information or clues. 31 In our study, observers independently analyzed anonymized images; they could not access the clinical or radiologic file of patients.
CTA is performed at our institution in patients with severe SAH grades in whom neurologic monitoring is difficult or impossible and/or when transcranial Doppler sonography findings are concerning for cerebral vasospasm. Angiography was performed in only 8% of studied patients; only 1 patient received intra-arterial milrinone. This is in contrast to almost 50% of patients undergoing angioplasty in the study by Shankar et al. 7 A retrospective review at our institution during the same period revealed that most patients with vasospasm treated with angioplasty had lateralizing neurologic symptoms despite maximal medical treatment. They were directly referred to the angiography suite without CTA. In such patients, the addition of CTA in the investigation of vasospasm may be superfluous. It seems that in our institution, CTA is most often used as a screening test for patients in intensive care, who are difficult to monitor clinically. Unfortunately, our study suggests that in this context, CTA interpretations are not repeatable enough to guide management decisions. If CTA is to be used as a screening test in patients with a low prior probability of symptomatic vasospasm, our study suggests that interpretations should be cautious, perhaps using a diagnostic threshold at a higher level than 50% narrowing, limited to more reliable proximal arterial segments.
Our study had several limitations. The portfolio of cases was artificially constructed, and raters were self-selected. A different set of cases and observers could have produced different results. Given our low prevalence of severe vasospasm in individual arterial segments, our k coefficients could potentially be underestimated due to a skewed rating distribution in the per-segment analysis. 19 We did not try to differentiate focal-versus-diffuse vasospasm when a narrowing was present. 30 The experimental setup and the use of a portfolio differ from the case-by-case evaluation of real patients, and we can only surmise that responders took the time and care to respond as if they were evaluating real patients.

CONCLUSIONS
The systematic review found few reliability studies, limited to 2-3 readers. Our agreement study, which included a larger number of observers, revealed that the diagnosis of moderate-severe vasospasm was not sufficiently repeatable to support the use of CTA alone to guide decisions in the clinical management of patients with SAH. The repeatability of verdicts could potentially be improved by raising the diagnostic threshold above 50% narrowing for substantial vasospasm, by focusing on proximal arterial segments such as the M1 and basilar arteries, and by standardizing interpretation protocols.