Inter- and Intraobserver Agreement in Scoring Angiographic Results of Intra-Arterial Stroke Therapy

BACKGROUND AND PURPOSE: Angiographic results are commonly used as surrogate markers of the success of intra-arterial therapies for acute stroke. Inter- and intraobserver agreement in judging angiographic results remain poorly characterized. Our goal was to assess 2 commonly used revascularization scales. MATERIALS AND METHODS: A portfolio of 148 pre- and post treatment images of 37 cases of proximal anterior circulation occlusions was electronically sent to 12 expert observers who were asked to grade treatment outcomes according to recanalization (of arterial occlusive lesion) or reperfusion (TICI) scales. Three expert observers had to score treatment outcomes by using a similar portfolio of 32 patients or when they had full access to all angiographic data, twice for each method 3–12 months apart. Results were analyzed by using κ statistics. RESULTS: Agreement among 9 responding observers was moderate for both the TICI (κ = 0.45 ± 0.01) and arterial occlusive lesion (κ = 0.39 ± 0.16) scales. Agreement was similar (moderate) when 3 observers had access to a portfolio (κ = 0.59 ± 0.06 and 0.49 ± 0.07, respectively) or to the full angiographic data (κ = 0.54 ± 0.06 and 0.59 ± 0.07, respectively). Intraobserver agreement was “fair to moderate” for both methods. Interobserver agreement became “substantial” (>0.6) when outcomes were dichotomized into “success” (TICI 2b, 3; arterial occlusive lesion II, III or “failure”; the results were judged more favorably when the arterial occlusive lesion rather than the TICI scale was used. CONCLUSIONS: There is an important variability in the assessment of angiographic outcomes of endovascular treatments, invalidating comparisons among publications. A simple dichotomous judgment can be used as a surrogate outcome when treatments are assessed by the same observers in randomized trials.

C urrent therapies of acute stroke aim at rapid restoration of blood flow or revascularization of the occluded territory to salvage ischemic brain tissue. A gamut of methods and devices has been introduced to accomplish revascularization. [1][2][3][4] While all may agree that the well-being of the patient at the end of treatment is the most important outcome, 5 we also need surrogate markers of mechanistic efficacy, directly linked to the effect we are aiming for, to more expediently determine which method or de-vice should be selected to be tested in a more rigorous fashion, because the heterogeneity of presentations ensures that large trials will be needed to show differences in clinical outcomes. In addition, regulatory agencies approve devices according to their ability to restore blood flow. 6 Thus angiographic scoring systems and a new vocabulary (such as Thrombolysis in Myocardial Infarction [TIMI], TICI, arterial occlusive lesion [AOL], described below) are now used to adjudicate and compare angiographic results of acute stroke therapies. [7][8][9][10][11][12] The precision of outcome scales must be assessed before their widespread use. Testing can be accomplished by asking various individuals to repeatedly but independently categorize the angiographic results of the same patients and by studying intra-and interobserver agreement of the resulting verdicts. Despite notes of concern 13,14 and except for small studies limited to 2-3 observers introducing unusual scales 15,16 or comparing 2 scoring systems obtained from consensus reading, 17 the inter-and intraobserver agreement among multiple observers for commonly used systems has not been rigorously assessed. The aim of the present work was to assess the precision and reproducibility of 2 angiographic outcome scales of intra-arterial therapies, one for recanalization and one for reperfusion: The primary arterial occlusive lesion recanalization scoring method, initially proposed for the Interventional Management of Stroke (IMS) I analyses, 17 and the Thrombolysis in Cerebral Infarction perfusion categories, proposed by the Technology Assessment Committees of the American Society of Interventional and Therapeutic Neuroradiology and the Society of Interventional Radiology. 7 These scales (with or without some modifications) are being used in recent trials on intra-arterial stroke therapy, such as IMS II and III, 18

MATERIALS AND METHODS
The primary aim of this work was to evaluate the intra-and interobserver variability in adjudicating outcomes of treatment according to 2 ordinal scales commonly used to assess angiographic results of intra-arterial thrombectomy. The evaluation comprised 3 parts, 2 by electronic surveys; the third evaluation was designed to resemble clinical work and to validate the results obtained by the electronic surveys: 1) an electronic survey to assess interobserver agreement among 9 different expert "external" readers regarding the angiographic outcomes of 37 cases; 2) a similar electronic survey, modified and reduced to match the 32 patients to be analyzed in part 3, to assess intra-and interobserver agreement independently twice, 12 months apart, by 3 expert "internal" readers; and 3) an intra-and interobserver study of the same 32 patients by the same 3 expert readers having access to the full set of angiographic data, directly on the hospital PACS, independently adjudicating results twice, 3-12 months apart, to be compared with the survey of part 2.

Part 1: Electronic Survey with 9 Observers
To minimize variability due to different angiographic equipment, number and type of projections, and selection of final images from various series and to ease the participation of external readers, we assembled a portfolio of 148 images that could be sent electronically to and easily assessed by multiple observers. All anonymized images were retrieved by one author (L.E.) from the PACS of one institution. The portfolio comprised paired (postero-anterior and lateral projections) selected pre-and post treatment late arterial phase angiograms of 37 cases. Cases included 32 consecutive patients who had been treated endovascularly for acute anterior circulation strokes in a single institution during 9 months (January to September 2011). For part 1, 5 additional cases were constructed by using intermediate-phase results of complex interventions in 5 patients already included, in an attempt to better balance the proportions of the marginal sums of the contingency tables and hopefully minimize paradoxes of statistics. 21 On each page of the electronic version sent to reviewers, 2 pretreatment and 2 post treatment images were displayed side by side. No clinical information was provided. There was no training of observers. The part 1 electronic portfolio was sent to 12 expert interventional neuroradiologists, selected because they had designed studies or trials on transarterial stroke therapy. Nine, with 5-27 years of clinical experience, answered, working in 6 different centers; 4 were from Canada; 3, from the United States; and 2, from France. One observer answered the questionnaire twice 3 months apart. Observers were given the task of grading each pair of images according to the 4-value AOL scale 8 and the 5-value TICI scale. 7 The explicit definitions of the 2 scales appeared in explanatory boxes beside the answering boxes for each case.

Part 2: Electronic Survey with 3 Readers
The electronic questionnaire, modified to include only the 32 real patients (excluding the 5 "constructed cases" added to part 1), was administered twice, 12 months apart, to the 3 internal interventional neuroradiologists involved in the treatment of acute stroke who participated in part 3 of the study.

Part 3: Intra-and Interobserver Agreement Using All Angiographic Images
To verify that the artificial conditions imposed by the electronic survey did not affect results and to better assess intraobserver agreement, the same 3 observers were asked to grade the angiographic outcomes of the same 32 patients, by using the full set of angiographic data presented by 3 authors not participating in the evaluation of cases (L.E., C.R., M.G.) directly on the PACS, independently twice, 3-12 months apart.

Scores and Dichotomies
To assess intracranial reperfusion, readers were asked to use the TICI score as described by Higashida et al 7 : grade 0, no perfusion, no antegrade flow beyond the point of occlusion; grade 1, penetration with minimal perfusion; grade 2, partial perfusion; grade 2a, only partial filling (less than two-thirds) of the entire vascular territory visualized; grade 2b, complete filling of all of the expected vascular territory visualized but filling more slowly than normal; and grade 3, complete perfusion.
To assess arterial recanalization, readers were asked to use the AOL score 22 : score 0, no recanalization of the primary occlusive lesion; score I, incomplete or partial recanalization of the primary occlusive lesion with no distal flow; score II, incomplete or partial recanalization of the primary occlusive lesion with any distal flow; and score III, complete recanalization of the primary occlusion with any distal flow.

Statistical Analyses
The multirater statistics were computed by using the macro MAGREE with SAS, Version 9.3 (SAS Institute, Cary, North Carolina). This macro implements the methodology of Fleiss et al, 27 measuring the agreement when the number of raters is Ͼ2. This method also allowed identifying, for each scale, the categories in which the most frequent disagreements occurred. values were interpreted according to Landis and Koch, 27 with coefficients of 0 ϭ poor; 0.01-0.20 ϭ slight; 0.21-0.40 ϭ fair; 0.41-0.60 ϭ moderate; 0.61-0.80 ϭ substantial; and 0.81-1.0 ϭ almost-perfect agreement.

Patients
The portfolio included 32 consecutive patients (17 women; mean age, 63 Ϯ 12). In addition to intra-arterial therapy, patients received IV-rtPA in 61% of the cases. The mean delay between symptoms and thrombectomy was 199 Ϯ 47 minutes. The most frequent occlusions were located on the M1 segment of the middle cerebral artery (n ϭ 19; 60%) or on the distal internal carotid artery (T-occlusion; n ϭ 10; 32%). The most frequent thrombectomy methods used during this period were an aspiration system (n ϭ 13; 41%) or a Stentriever (Trevo; Stryker, Kalamazoo, Michigan) system (n ϭ 14; 43%). Characteristics of patients are summarized in the On-line Table.

Survey with 9 Observers
There were large discrepancies in the adjudication of angiographic outcomes, with, for example, complete perfusion (TICI 3) being attributed to a wide range (8%-49%) of patients or com-plete recanalization (AOL III) in 22%-65% of patients, depending on observers. Table 1 summarizes the values obtained when the 9 observers scored angiographic outcomes according to the TICI reperfusion categories (overall agreement, ϭ 0.446 Ϯ 0.013). Table 2 summarizes the values when the categories were dichotomized as success (2b, 3) versus failure (0, 1, 2a) (overall agreement, ϭ 0.616 Ϯ 0.025). coefficients of pairs of observers that reached "substantial agreement" ( Ͼ0.6) increased from 9% to 60% with dichotomization. The TICI category that was the subject of most disagreements was 2b ( ϭ 0.242 Ϯ 0.025). Table 3 summarizes the values when angiographic outcomes were categorized according to the AOL recanalization categories (overall agreement, ϭ 0.394 Ϯ 0.016). Table 4 results were obtained when they were analyzed as success (II, III) or failure (0, I) (overall agreement, ϭ 0.762 Ϯ 0.025). coefficients of pairs of observers that reached substantial agreement ( Ͼ0.6) increased from 16% to 91% with dichotomization. The AOL category that was the subject of most disagreements was II ( ϭ 0.188 Ϯ 0.025).
The endovascular intervention was successful in 68%-87% of patients according to various observers when success was defined in terms of recanalization (AOL II or III) but in only 32%-62% of   patients when success was defined in terms of reperfusion (TICI 2b or 3).

Intra-and Interobserver Agreement with Electronic or Full Datasets
The results of parts 2 and 3 are summarized in Tables 5 and 6. Intraobserver agreement between 2 sessions for experts having access to the full set of angiographic data on 32 patients was only slight to fair (0 -0.4) in most cases, with only 1 observer reaching substantial agreement for the AOL scores. The values of the interobserver agreement obtained by comparing answers to the electronic questionnaire and the verdicts of the first and second reading sessions when observers had access to all images were similar and always below 0.6 (less than substantial).
When results were dichotomized, intraobserver agreement remained fair for 2 of 3 observers assessing reperfusion with the TICI scale but reached substantial agreement for all 3 readers when they assessed recanalization with the AOL scale; the interobserver agreement improved to substantial or Ͼ0.6 with dichotomization of the results of the electronic survey for both scales and for one of the PACS sessions, but not the other.
The endovascular intervention was successful in 72%-79% of patients according to 3 observers having access to all images when success was defined in terms of recanalization (AOL II or III) but in only 35%-59% of patients when success was defined in terms of reperfusion (TICI 2b or 3).

DISCUSSION
The salient features of this work are the following: 1) Agreement in adjudicating angiographic results of endovascular interventions among multiple observers is, at best, fair to moderate; 2) this problem is not limited to divergent interpretations of the definitions of the various categories by various experts because intra-observer agreement was similarly poor when the same experts re-evaluated the same results twice on 2 independent occasions; 3) difficulties were not caused by the artificial and limited information provided by the survey we used to assess interobserver variations because similarly poor results were obtained when 3 observers were given access to all angiographic images; 4) coefficients reached more reassuring substantial agreement values when results were dichotomized into successes and failures; 5) the AOL recanalization scoring system seemed more repeatable than the TICI reperfusion scheme; and 6) the AOL II, III recanalization categories provided a more frequent verdict of success compared with the 2b, 3 TICI categories when dichotomization was used.
To evaluate our interventions, we had no choice but to reduce the variety and heterogeneity naturally found in clinical results to a (preferably small) number of categories and terms to name these categories that will determine what counts as a success or failure, in a common language that will allow tabulation of results and both valid comparisons between groups and communication of results among clinicians. When new categories (such as the TICI scale) are proposed, definitions can be provided as a sort of manual of translation, rules to translate the concrete results obtained in each particular case to a common language. As with any language, translation and communication by using our new terms may fail. If the meaning of such terms as TICI 2b or AOL II can be intentionally defined by explicit descriptions, whether these definitions and categories succeed in fixing the referents (ie, in reidentifying the same angiographic outcomes when they are seen by different observers or by the same observers at different time points) must be empirically tested to ensure that the new language does what it was designed to do, to allow valid comparisons and unambiguous communication. Both TICI and AOL scales had poor concordance when the same results were judged by different   observers or by the same observers on 2 different occasions. This finding suggests that results of various case series or registries should not be compared when angiographic outcomes have not been adjudicated by the same observer. 20,28 statistics provide a measure of agreement that takes into account the role of chance in the occurrence of concordant verdicts when estimating agreement between observers. 29 Depending on prevalence, statistics are liable to paradoxes, such as high agreement but low values, when the distribution of the cases is imbalanced among categories. 21 This problem may partly explain the low values of the intraobserver study on the 32-patient sample, in which tables were asymmetric. We believe that paradoxes do not explain the poor precision found in the 37-patient sample assessed by the 9 reviewers, more balanced by the introduction of 5 "intermediate cases" and when agreement was low for each category.
Difficult questions arise when one tackles the problem of agreement regarding a treatment outcome. Surely there must be some reality regarding revascularization, but in the absence of a standard criterion, truth regarding the verdict of the test, its accuracy, is impossible to capture. To construct and assess our scales, we are left with "validity," a vague concept that attempts to secure the link between the measure and the phenomenon of interest, such as whether the scale makes intuitive sense (face validity), whether it conforms to theory (construct validity), and whether it allows the prediction of an important clinical outcome (predictive validity). 30 These considerations were taken into account when scales were designed. Revascularization can be conceived as angiographic recanalization of the primary arterial occlusive lesion (what the AOL attempts to capture) or by reperfusion in the arterial bed distal to the occlusion. 22 Reperfusion can be assessed by TIMI, originally used to estimate coronary blood flow after percutaneous angioplasty 31 and used in Prolyse in Acute Cerebral Thromboembolism II, 32 or by TICI, introduced by Higashida et al 7 to intuitively adapt the TIMI scheme to the cerebral circulation and used in IMS-II and III. 18 The results obtained with the AOL classification recanalization system in one study should not be compared with those obtained by using the TICI reperfusion scale in another, the former being more frequently associated with a verdict of success than the latter, at least in this study.
Interobserver disagreement in adjudicating treatment results may be caused by multiple problems: intrinsic ambiguities in the definitions of the classifications; discrepancies in the various ways the definitions are interpreted by various readers; and even if the definitions were understood in the same way, discrepancies in applying the definitions to individual cases. The current study cannot disentangle these various reasons for the discordance among observers. One way to improve agreement would be to modify the proposed classification and retest in a trial-and-error fashion the same portfolio to progressively improve repeatability. It was not feasible to independently test all classification systems and their various modifications in the same study by using the same portfolio and the same observers. Other scales could have been more repeatable than the ones we tested.
The TICI system has been criticized for internal inconsistencies, particularly regarding the 2a, 2b, 3 categories, 13 a problem clearly exposed by the present work, which confirms 2b as the category with the most frequent disagreements. The AOL recanalization categories, however, were also subject to discrepancies in interpretation; the contentious category was AOL II. It may be impossible to obtain consensus for these "gray zone" cases.
Modifications of the TICI scale have been proposed, for example, to get rid of the difficult 2b, 3 distinction between "complete" and "complete-but-slow" perfusion. 33 Others have proposed entirely different classifications. 16 The interobserver agreement between 2 radiologists assessing angiographic results according to the TIMI classification scheme in 38 patients has shown low-weighted values (0.4; 95% CI, 0.2-0.6), 15 similar to those in our study. On the other hand, another report has previously shown better agreement among 3 observers by using the TIMI or a new Qureshi grading system in 15 randomly selected patients, with values in the range of 0.7. 16 By adding statistical noise, variability or lack of precision may affect results of studies comparing 2 treatments. This will, by necessity, impose methodologic adjustments such as increasing the number of patients to be recruited in the study to show a difference between 2 groups. The variability we observed in judging the success of the procedure was probably underestimated because there are many other sources of discrepancies in a core lab context compared with the electronic survey: There are more images, from various series, by using various projections and diverse equipment from various centers. Legitimate strategies to enhance precision include standardization of angiographic projections and techniques, using an operation manual, refining criteria defining the score classes, and training (or even certifying) the observers. Repeating the measurements by a number of observers, with resolution of discrepancies by consensus, can succeed in achieving a precision that is artificial; it is unclear, however, if such verdicts are more valid. 34 For sure, predictive validity is important; clinical outcomes have been correlated to revascularization in acute ischemic stroke, 9 though many other factors (collateral circulation, penumbra, eloquence of the vascular territory, and so forth) may also impact outcomes. 7 To attempt to incorporate all potential factors in a single, intuitively appealing but complex scale with multiple categories may increase the variability of interpretations and, consequently, decrease precision, when the time comes to assess interobserver agreement. 35 Various other scales have been described 15,16,36 in attempts to intuitively enhance validity. The real test for any scale, however, will eventually come with usage, if it is used. To propose yet another classification could only add to the confusion that already plagues this field. 14 The present work suggests that it is possible to live with the current reperfusion and recanalization scales, provided a number of precautions are kept in mind. First, any comparison among revascularization methods or devices should be given credibility when performed within a randomized trial, by using predefined, simple ordinal scales adjudicated by an independent central laboratory staffed with experienced observers masked to treatment allocation. While the verdicts of various observers can be alarmingly divergent for the 2 scales we studied, values can reach satisfactory levels of agreement (0.6) when results are dichotomized as success or failure. Surrogate angiographic outcomes can serve a useful function if only to help explain or understand reasons for disappointing clinical results. Clinical outcomes, typically translated into the modified Rankin Scale scores at 3 months, are probably better end points for major clinical trials. 5,18

Limitations
The electronic survey was designed to ease the assessment by multiple interventional neuroradiologists. Nine of 12 potential observers responded. Readers had access to only 4 selected images to evaluate results according to the TICI and AOL scores. We can only speculate what the results would be if missing responses had been available. How seriously observers worked to come to verdicts can always be questioned, and the context of the assessment is certainly different from typical clinical work or from a core lab context. It may not be realistic to expect readers to assess perfusion on a few static images. The intra-or interobserver agreement was, however, no better when 3 neuroradiologists had access to all images. Posterior circulation strokes were not included, and this feature may have decreased variability in interpretations. New recommendations on angiographic revascularization grading scales have now been published. 36 They include a modified TICI scale, slightly different from the one we used, which is also subject to variability in interpretation. 37

CONCLUSIONS
Recanalization or reperfusion scales are interpreted differently by different observers. Rather than yet another classification scheme, we propose to dichotomize results for analyses and comparisons.