Use of Standardized Uptake Value Ratios Decreases Interreader Variability of [18F] Florbetapir PET Brain Scan Interpretation

BACKGROUND AND PURPOSE: Fluorine-18 florbetapir is a recently developed β-amyloid plaque positron-emission tomography imaging agent with high sensitivity, specificity, and accuracy in the detection of moderate-to-frequent cerebral cortical β-amyloid plaque. However, the FDA has expressed concerns about the consistency of interpretation of [18F] florbetapir PET brain scans. We hypothesized that incorporating automated cerebral-to-whole-cerebellar standardized uptake value ratios into [18F] florbetapir PET brain scan interpretation would reduce this interreader variability. MATERIALS AND METHODS: This randomized, blinded-reader study used previously acquired [18F] florbetapir scans from 30 anonymized patients who were enrolled in the Alzheimer's Disease Neuroimaging Initiative 2. In 4 separate, blinded-reading sessions, 5 readers classified 30 cases as positive or negative for significant β-amyloid deposition either qualitatively alone or qualitatively with additional adjunct software that determined standardized uptake value ratios. A κ coefficient was used to calculate interreader agreement with and without the use of standardized uptake value ratios. RESULTS: There was complete interreader agreement on 20/30 cases of [18F] florbetapir PET brain scans by using qualitative interpretation and on 27/30 scans interpreted with the adjunct use of standardized uptake value ratios. The κ coefficient for the studies read with standardized uptake value ratios (0.92) was significantly higher compared with the qualitatively read studies (0.69, P = .006). CONCLUSIONS: Use of standardized uptake value ratios improves interreader agreement in the interpretation of [18F] florbetapir images.

ters the brain and specifically binds to cortical fibrillar ␤-amyloid. 5,6 Pathologic findings demonstrate the high sensitivity (87%), specificity (95%), and accuracy (90%) of [ 18 F] florbetapir in the detection of moderate-to-frequent cortical ␤-amyloid plaque by using the Consortium to Establish a Registry for Alzheimer's Disease criteria. 7 Sufficient concern about the consistency of [ 18 F] florbetapir PET brain scan interpretation led the FDA to withhold approval until an interpretation training program was implemented to reduce interreader variability. [8][9][10] Imaging and pathologic studies have previously demonstrated that cerebral cortical regions, including the frontal lobe, parietal lobe, temporal lobe, precuneus, and anterior and posterior cingulate gyrus, are regions in which ␤-amyloid deposition is commonly found in patients with AD. 5,11 These findings motivated quantitative analysis of [ 18 F] florbetapir PET brain images, comparing differential binding between these cortical regions to the whole cerebellum, a site not prone to amyloid deposition, expressed as cerebral-to-whole-cerebellar standardized uptake value ratios (SUVr). 5,12 Subsequent pathologic analysis demonstrated high sensitivity (97%), specificity (100%), and accuracy (98%) between [ 18 F] florbetapir PET SUVr and postmortem immunohistochemical measurements of ␤-amyloid. 5,7,12 These studies showed that a mean [ 18 F] florbetapir SUVr value of Ͼ1.17 was strongly associated with an intermediate-to-high likelihood of a neuropathologic diagnosis of AD. 5,12 However, SUVr are onerous to calculate manually, and manual placement of ROIs is prone to variability. Computer assistance could provide an easier method to incorporate SUVr into the interpretation process, 13,14 and semiautomated software has been developed to facilitate SUVr calculations. The current standardized methodology for amyloid PET brain interpretation does not use SUVr, 14 which might be useful to improve reader agreement. We hypothesized that adding SUVr to qualitative image interpretation of [ 18 F] florbetapir PET brain scans would reduce interreader variability.

Participants
This randomized, blinded-reader study used previous [ 18 F] florbetapir scans from 60 anonymized patients (37 men and 23 women) who were enrolled in the Alzheimer's Disease Neuroimaging Initiative (ADNI) 2 and had already provided written informed consent. Fluorine-18 florbetapir PET brain scans were obtained at multiple sites; however, all sites followed the same ADNI 2 protocol. 15 Studies were randomly selected from the ADNI 2 population data base and anonymized. Patient age ranged from 55 to 94 years; each patient had an established clinical diagnosis of normal, early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), or early AD (EAD). We chose these groups (normal, EMCI, LMCI, and EAD) because their [ 18 F] florbetapir PET studies were expected to be the most challenging to interpret and, therefore, could better test the potential benefit of using SUVr. In consultation with a biostatistician before the study, we determined that with 5 readers, approximately 30 cases would provide enough statistical power to test whether the addition of SUVr would provide a substantial improvement in reader consistency for our study. Due to the large ADNI data base and the need to only include these groups, an uninvolved third party was used to select cases accordingly. The investigators played no role in choosing the cases, to maintain the blinded nature of the study. This study was approved by the local institutional review board.

Image Analysis and Reader Study
Five nuclear medicine board-certified or eligible physicians with no prior clinical experience in interpreting [ 18 F] florbetapir PET brain scans (though some had experience with the studies in a research setting) and with varying years of clinical experience in nuclear medicine (3 readers had Յ3 years of experience, 1 reader had 8 years of experience, and 1 reader had 18 years of experience) underwent on-line [ 18 F] florbetapir PET clinical training (http:// www.amyvidtraining.com; Avid Radiopharmaceuticals, Philadelphia, Pennsylvania) before initiation of the study. The training included information about [ 18 F] florbetapir, ␤-amyloid, and Alzheimer disease followed by demonstrations and self-assessment cases on study interpretation.
Fluorine-18 florbetapir PET brain scans of 60 participants were given 2 unique and random identifiers; each case was evaluated twice, once qualitatively and once with the inclusion of SUVr information. Each case was classified as either positive or negative for cortical ␤-amyloid deposition. All 60 cases were assigned to 4 reading sessions separated by at least 72 hours to avoid reader memory. All assignments were random except that no case was repeated during an individual session. For each case, the order of qualitative and SUVr-aided reads was also random.
We divided the 60 cases into 2 groups with the initial 30 cases to be used as a lead-in to give all readers a more similar experience in evaluating amyloid PET studies and using the SUVr software. The remaining 30 test cases were used for the test set. Readers never received feedback about their interpretations. Readers qualitatively interpreted the scans by using MIMfusion (MIM Software Inc, Cleveland Ohio) by determining the presence or absence of cortical ␤-amyloid deposition according to the clinical interpretation methodology. Axial, sagittal, and coronal images were presented, and the reader could manipulate image contrast to accentuate the gray-white interface as recommended by the training program.
SUVr cases were reviewed qualitatively on MIMfusion, and additionally SUVr were calculated automatically by using Scenium (Siemens, Erlangen, Germany). The SUVr software automatically registers [ 18 F] florbetapir scans to the Montreal Neurological Institute atlas space and calculates the SUVr values without requiring any input from the user. 16 However, the reader could review and manipulate the ROIs to fit the cerebral cortical gray matter and whole cerebellum, to assure proper anatomic registration. Readers recorded average cortical SUVr of these regions for each case along with the positive/negative interpretation. Because the SUVr values for all 5 readers were very similar, the manipulation of ROIs was at most minimal in all cases. Cortex-to-wholecerebellum SUVr values were automatically calculated and presented by the software for 6 anatomically relevant cortical ROIs: frontal lobe, lateral parietal lobe, lateral temporal lobe, precuneus, and anterior and posterior cingulate gyrus, as well as the mean SUVr of these regions. 12 Prior studies demonstrated that with the aid of the average SUVr from these anatomic regions, sensitivity and specificity of reads increased by using both clinical diagnosis 17 and pathology 7 as comparisons. Although we chose to use the Scenium software to calculate SUVr, commercial programs from other vendors are available for calculating SUVr. 5,12 A recent study 18 showed a high correlation between SUVr calculated by Scenium and other methods. 5,7,19 Prior research suggested that a threshold for amyloid positivity was at SUVr Ն 1.17. 5,12 All readers were informed of this threshold. However, the SUVr value was available to the reader as an adjunct to assist in the primary qualitative interpretation. Therefore, the final interpretation relied on the reader's overall judgment, incorporating both the qualitative image data and SUVr. Sample cases are shown in Figs 1-3.

Statistical Analyses
The data were analyzed by using the R Statistical Computing Environment (http://www.r-project.org/). 20 To assess interrater reliability, we calculated the Fleiss multirater statistics for the 2 conditions separately (qualitative versus qualitative ϩ SUVr) by using the "irr" package. 21 Confidence intervals for each were calculated separately via bootstrap by using 1000 replicates. Statistical comparison was accomplished by the method for comparing correlated statistics described by Vanbelle and Albert. 22 Briefly, for 2 correlated values, a difference score can be calculated such that The distribution of this statistic can be estimated via bootstrap by calculating the difference score for q subsamples with replacement. A new estimator is then calculated that under the null hypothesis ( qualitative ϭ suvr ) follows a t distribution with q-1 df: For the current study, a bootstrap analysis by using 1000 replicates was used for hypothesis testing. An ␣ of .05 was set as the threshold for statistical significance.    Kramer tests. We did not measure intrareader data per recommendations of the annual Clinical Trials Methodology Workshop of the Radiological Society of North America (https:// www.rsna.org/Clinical_Trials_Methodology_Workshop.aspx).

RESULTS
One of the input images could not be successfully registered to the template for quantification due to nonstandard patient positioning. Most important, no disease process (eg, normal pressure hydrocephalus) or anatomic variant precluded this case being aligned to the template, and no similar issue arose with the other 59 cases. In this case, the readers could do only qualitative interpretation for both the qualitative and qualitative ϩ SUVr reads. There was complete interreader agreement on 20 of 30 cases of [ 18 F] PET brain scans by using qualitative-only interpretation and 27 of 30 scans interpreted with the adjunct use of the SUVr (Online Table). Quantitative measures of interrater reliability confirmed excellent agreement between raters when using qualitative analysis alone ( ϭ 0.69; 95% CI, 0.50 -0.82). The addition of SUVr data resulted in near-perfect agreement ( ϭ 0.92; 95% CI, 0.79 -0.97). Interrater agreement was significantly increased with the addition of SUVr data (t ϭ 2.51, P ϭ .006) after adjusting for the correlated nature of the data.
Group differences in global SUVr were statistically significant (P Ͻ .005) with group means and SDs as follows: normal (1.

DISCUSSION
AD is the most common dementia to affect the elderly and traditionally has been diagnosed clinically. 1,2 However, 10%-20% of patients clinically diagnosed with AD lack pathologic findings at postmortem examination. 24 Improved diagnosis could aid in medical and personal decision-making. Furthermore, on the basis of the suspected role of ␤-amyloid in the pathophysiology of AD, it has emerged as a potential drug target. In evaluating and potentially using such therapies, reliably establishing the presence of ␤-amyloid deposition would be of paramount importance. Recently, the Society of Nuclear Medicine and Molecular Imaging and the Alzheimer's Association proposed PET amyloid imaging appropriate-use guidelines for patients who meet specific criteria. 25 For the test to be of greatest clinical utility, interreader variability needs to be minimized. We hypothesized that incorporating a method of quantification to standard image interpretation of [ 18 F] florbetapir PET brain scans by the addition of SUVr would reduce interreader variability.
Our results show a significantly higher (P ϭ .006) interreader agreement when [ 18 F] florbetapir PET scans were evaluated with adjunctive SUVr data ( ϭ 0.92) compared with qualitative-only assessment of the same studies ( ϭ 0.69). This value of 0.69 is similar to the value in another study with 5 readers visually assessing 59 cases ( ϭ 0.76) and in a study with 2 readers visually assessing 46 cases ( ϭ 0.71). 17 The values of our study are also similar to those seen with other ␤-amyloid imaging agents such as 11 C Pittsburgh compound-B. 26 The impact of the data of SUVr on values reported here is also higher than those seen in interobserver variability studies in lesion detection in other organ systems, for example, in the detection of pulmonary nodules with ( ϭ 0.67) 27 and without the use of computer-assisted detection software ( ϭ 0.64) 28 and the detection of breast lesions by using automatic breast scanners ( ϭ 0.8). 29 When [ 18 F] florbetapir PET brain studies were read qualitatively, there was interreader disagreement in 9/30 (30%) cases; however, there was complete agreement between readers for 8 of these cases when independently evaluated with semiquantitative indices. With the adjunct use of the SUVr, there was interreader disagreement on 3/30 (10%) cases. One case was discrepant (interreader disagreement) on both qualitative and SUVr reads (the case in which the software failed to register the study to the atlas). In one of the remaining 2 discrepant cases, there was interreader disagreement on both the qualitative assessment and the assessment with the aid of SUVr; and on the other case, 1 reader interpreted the study differently than the others only on assessment with the aid of SUVr. The interreader average SUVr on these cases were 1.27 (range, 1.24 -1.35) and 1.15 (range, 1.13-1.21) and were the closest to the threshold value of 1.17 5,7,12 of the test cases ( Fig  4). Therefore, we hypothesized that interreader disagreement on these 2 cases was probably a result of visually borderline scans because both of these subjects had diagnoses of early mild cognitive impairment, which often has intermediate values of SUVr. 5,7,12 Although we did not directly assess the extent of ROI manipulation by the readers, the manipulation by all readers was, at most, minimal as shown by the small SD between interreader average SUVr (Յ5%, On-line Table). These findings are concordant with a prior study that demonstrated minimal variance in the interrater reliability of manual and automated ROI delineation for Pittsburgh compound-B PET. 30 There was no certain relationship between the years of post-board certification and discrepancies, compared with most interpretations of individual cases (Table).
Our results also show a significant difference (P Ͻ .0002) in the values of SUVr between normal controls and patients with early mild and late mild cognitive impairment or EAD, with progressively higher SUVr values seen in patients with early mild and late mild cognitive impairment or EAD compared with normal patients, concordant with findings of prior studies. 4,14,19 Com-parison of SUVr between the EMCI and LMCI groups did not demonstrate a significant difference.
Our findings demonstrate that cases that had complete interreader agreement as positive for significant ␤-amyloid deposition with the aid of semiquantitative analysis but not complete agreement when qualitatively assessed had an interreader mean SUVr value of 1.32 Ϯ 0.0 (n ϭ 3). Cases that had complete agreement for positive scan findings by both methods had an interreader mean SUVr of 1.58 Ϯ 0.15 (n ϭ 12). These findings suggest that cases with values of SUVr closer to the cutoff value of 1.17 are often visually challenging and can therefore contribute to discrepant reads with qualitative assessment. However, a similar divergence in the mean SUVr was not seen in cases with discrepant interpretations by qualitative evaluation that were uniformly interpreted as negative with the aid of SUVr (interreader mean SUVr ϭ 0.97 Ϯ 0. 1; n ϭ 5) compared with cases that had complete interreader agreement for negative scan findings by both methods (interreader mean SUVr ϭ 1.01 Ϯ 0.09; n ϭ 7). This finding is congruent with findings from prior studies 5,12 and may be explained by the decreased [ 18 F] florbetapir uptake seen in patients without significant ␤-amyloid cerebral cortical deposition and therefore creating a narrower range for SUVr values (Fig 4).
We designed our experiment to include 30 practice cases because all 5 readers had no prior clinical experience in interpreting [ 18 F] florbetapir PET brain scans (which had just been approved at the time of this study) and had varying amounts of research experience and varying years of experience in nuclear medicine. Therefore, we thought that the practice cases would help readers gain similar familiarity with the software and method. In the original 30 practice cases used as a lead-in, there was moderate interreader agreement between qualitative-only interpretation ( ϭ 0.56) and with the adjunct use of the SUVr ( ϭ 0.55). These values were significantly different (P Ͻ .05) than the values achieved in our dataset of 30 cases. This discrepancy is likely the result of inexperience and varying early proficiency in using the software. Improved agreement between readers was seen in our test set of the 30 cases for qualitative-only interpretation, emphasizing the importance of practice on physician performance. Improved agreement between readers on the test set compared with the training set when using SUVr signifies the need for familiarization with image analysis software and is congruent with findings seen in using other computer-aided diagnostic software such as in the detection of lesions on mammograms. 31 Finally, the values of all 60 cases evaluated with the aid of SUVr were higher than those of the same cases when only qualitatively evaluated, showing a trend toward statistical significance (P ϭ .09).
We could not determine the accuracy of interpretations in this study because no criterion standard pathologic data were available for the cohort. Clinical diagnosis is not the criterion standard for diagnosing Alzheimer disease 32 and can be an unreliable diagnostic tool with 10%-20% of patients diagnosed with Alzheimer disease lacking pathology on postmortem histopathologic analysis 24 ; clinical diagnosis has intermediate sensitivity (84%) and low specificity (52.5%) in diagnosing Alzheimer disease. 33 Our findings show that 52% of mild cognitive impairment (EMCI ϩ LMCI) cases were interpreted with complete interreader agreement as ␤-amyloid-positive with the aid of SUVr, while 37% of these mild cognitive impairment cases were interpreted with complete interreader agreement as ␤-amyloid-positive with qualitative-only assessment. Fourteen percent of cognitively normal cases were interpreted with complete interreader agreement as ␤-amyloid-positive, both with only qualitative assessment and with the aid of SUVr (On-line Table). These findings are congru-

Comparison of years of experience in diagnostic imaging and discrepant reads (versus majority consensus) in 30 test cases
Years of Experience (after Radiology or Nuclear Medicine Residency)

Discrepant Reads on Qualitative Assessment
Discrepant Reads with Aid of SUVr  2  3  0  3  6 a  2 a  3  1  1  8  3  0  18 1 0 a In 1 case, only qualitative interpretation for both the qualitative and qualitative ϩ SUVr reads could be provided because input images could not be successfully registered to the template for quantification due to nonstandard patient positioning. ent with prior imaging studies 13 and are within range of cortical ␤-amyloid deposition seen in postmortem case studies in cognitively normal, mildly cognitively impaired, and patients with Alzheimer disease. [33][34][35] Our findings also demonstrate that the relationship between cognitive decline and the amount of cortical ␤-amyloid deposition is variable because we see studies with positive findings in all of our experimental groups (normal, EMCI, LMCI, and AD) and they are compatible with prior pathologic studies. 36 The degree of cortical ␤-amyloid deposition could have prognostic importance because recent studies have demonstrated that cognitively normal, mild cognitively impaired, and subjects with Alzheimer disease who have PET scans positive for amyloid demonstrate greater cognitive and global deterioration during an 18-month 37 and 3-year 38 follow-up than subjects with scans with negative findings. We did not use the cutoff value of SUVr of 1.17 solely as a method to quantitatively interpret the scans. The study that determined this value had a small sample size of 19 patients, and studies in larger community-based samples with a broader distribution of SUVr would be needed to more definitively establish standard thresholds. Furthermore, the applicability of this value obtained from whole-brain cortical uptake to use in regional values is unknown. 12 In our study, the readers were aware of this empiric threshold and could choose to use it when evaluating the calculated regional and averaged regional SUVr value from [ 18 F] florbetapir PET images.
We did not measure the intraobserver variability because we wanted to emulate the reading sessions as a standard clinical practice in which studies are assigned a single interpretation by an individual physician reader. In this regard, our methodology was similar to that in other studies examining interreader performance in diagnostic imaging with and without an intervention such as computer-aided diagnostic software. Parallel methodology has been used in lung disease, [39][40][41][42][43] breast imaging, 29,[44][45][46] and Alzheimer disease, 26,47 without determining intraobserver variability.
The most important limitation of our study is the absence of the criterion standard of pathologic analysis to establish the presence or absence of cerebral cortical ␤-amyloid deposition in our cohort of subjects to determine the accuracy of physician interpretations of the [ 18 F] florbetapir PET studies. As such, we could not determine whether the use of SUVr improved diagnostic accuracy; our primary aim was to assess interreader variability. Prior studies have demonstrated a high correlation between PET SUVr and immunohistochemical measurements of ␤-amyloid, 5,7,12 suggesting that improving agreement will likely improve diagnostic accuracy. However, while high interreader agreement is desirable in diagnostic testing, it will be important to directly evaluate the effect on diagnostic accuracy in future studies, especially in patients with mild cognitive impairment because prior studies have primarily focused on determining the accuracy of amyloid PET in cognitively normal individuals or patients with probable Alzheimer disease. 5,7 Second, although it was meant as an adjunct tool, we did not determine or prescribe the degree to which readers used the SUVr values in determining their interpretations of scans. Third, due to differences in reader ROI manipulation, there was minimal variance in average SUVr values on the test cases; therefore, this minimal variance is a potential weakness of a semi-automated method. Fourth, 1 case from our 30 test cases did not successfully register to the template for quantification due to nonstandard patient positioning. Therefore, registration errors and other technical failures of the software are an additional potential weakness of such a semiautomated method.

CONCLUSIONS
Our results support the use of SUVr to improve interreader agreement in the interpretation of [ 18 F] florbetapir images. Furthermore, using computer software to obtain values of SUVr can be an appealing and efficient option for nuclear medicine physicians and radiologists in interpreting [ 18 F] florbetapir PET brain scans and other brain imaging agents. The promising results from this initial study support future larger and prospective studies, including determining the performance of semiquantification strategies for [ 18 F] florbetapir and other ␤-amyloid radiopharmaceuticals to establish ranges for negative and positive, compared against clinical and histopathologic reference standards.

ACKNOWLEDGMENTS
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative data base (adni.loni. ucla.edu), a $60 million public-private partnership launched in 2003 by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration, private pharmaceutical companies, and nonprofit organizations. The primary goal of ADNI has been to test whether serial brain imaging studies, clinical and neuropsychological assessments, and other biologic markers can be combined to measure the progression of mild cognitive impairment and early AD to aid researchers and clinicians in developing new treatments and monitoring their effectiveness. ADNI, led by Principal Investigator Michael W. Weiner, MD, VA Medical Center and University of California, San Francisco, is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from Ͼ50 sites across the United States and Canada. For up-to-date information and descriptions of PET imaging acquisition parameters, see www.adni-info.org.