Visual-Statistical Interpretation of 18F-FDG-PET Images for Characteristic Alzheimer Patterns in a Multicenter Study: Inter-Rater Concordance and Relationship to Automated Quantitative Evaluation

BACKGROUND AND PURPOSE: The role of 18F-FDG-PET in the diagnosis of Alzheimer disease is increasing and should be validated. The aim of this study was to assess the inter-rater variability in the interpretation of 18F-FDG-PET images obtained in the Japanese Alzheimer's Disease Neuroimaging Initiative, a multicenter clinical research project. MATERIALS AND METHODS: This study analyzed 274 18F-FDG-PET scans (67 mild Alzheimer disease, 100 mild cognitive impairment, and 107 normal cognitive) as baseline scans for the Japanese Alzheimer's Disease Neuroimaging Initiative, which were acquired with various types of PET or PET/CT scanners in 23 facilities. Three independent raters interpreted all PET images by using a combined visual-statistical method. The images were classified into 7 (FDG-7) patterns by the criteria of Silverman et al and further into 2 (FDG-2) patterns. RESULTS: Agreement among the 7 visual-statistical categories by at least 2 of the 3 readers occurred in >94% of cases for all groups: Alzheimer disease, mild cognitive impairment, and normal cognitive. Perfect matches by all 3 raters were observed for 62% of the cases by FDG-7 and 76 by FDG-2. Inter-rater concordance was moderate by FDG-7 (κ = 0.57) and substantial in FDG-2 (κ = 0.67) on average. The FDG-PET score, an automated quantitative index developed by Herholz et al, increased as the number of raters who voted for the AD pattern increased (ρ = 0.59, P < .0001), and the FDG-PET score decreased as those for normal pattern increased (ρ = −0.64, P < .0001). CONCLUSIONS: Inter-rater agreement was moderate to substantial for the combined visual-statistical interpretation of 18F-FDG-PET and was also significantly associated with automated quantitative assessment.

tical tools (visual-statistical), and automated quantitative analysis, but the relationship between the latter 2 of these approaches has been little explored, to our knowledge. Visual interpretation features comprehensive and flexible assessment of the qualitative radioactivity distribution by the reader, who may look into all features across the brain. This approach appears effective because patients with AD typically present with characteristic temporoparietal hypometabolism known as the "AD pattern." However, inter-rater variability inevitably occurs because each rater has his or her own experience and criteria, especially for borderline cases, and this variability can potentially be increased or decreased when the reader also takes into account statistical information provided by various software display tools.
On the other hand, quantitative analysis traditionally extracts radioactivity uptake values of the region of interest, placement of which is a subjective matter requiring experience. Although a recently developed anatomic standardization technique can define ROIs automatically and further allows voxelwise statistical analysis to generate Z-maps, standardization may not always be accurate and may require adjustment by a human observer. Although these region-of-interest values can be processed into a numeric indicator such as an FDG-PET score 4,5 and a cutoff level can be determined, a single indicator may not be as accurate as complex and comprehensive evaluation by expert readers. As a result, a "combined" approach of visual and quantitative evaluation is often used during image interpretation, in which the readers examine both the tomographic PET images and the result of region-ofinterest analysis and/or a Z-map.
Inter-rater variability and comparison between visual reading and software-based evaluation have been studied by some investigators on brain 18 F-FDG-PET. Ng et al 6 studied the inter-rater variability of 15 patients with AD and 25 cognitively normal subjects (NCs) and reported that visual agreement between 2 readers was good ( ϭ 0.56). Tolboom et al 7 studied the variability of 20 patients with AD and 20 NCs and reported that agreement between 2 readers was moderate ( ϭ 0.56). Rabinovici et al 8 also reported the inter-rater agreement of 18 F-FDG ( ϭ 0.72). However, the data of these preceding studies were acquired with a single scanner in a single site and were evaluated by the readers belonging to the institution who were used to the scanner and its image quality. In addition, the studied subjects did not include patients with MCI, in whom PET findings featuring AD, if any, are mild and may make the discrimination challenging. Furthermore, inter-rater variability for combined interpretation of visual and statistical analysis has never been reported, to our knowledge.
In the present study, we analyzed the baseline scans of 18 F-FDG in a multicenter clinical project named Japanese Alzheimer's Disease Neuroimaging Initiative (J-ADNI) 9 and evaluated the inter-rater variability among 3 independent expert raters who were blinded to the clinical information and interpreted the PET images to evaluate the characteristic AD pattern in 18 F-FDG-PET on the basis of a combined visual-statistical evaluation. The raters looked at the 3D stereotactic surface projection Z-map of 18 F-FDG-PET visually as well as the 18 F-FDG tomographic images because it is considered the standard means of human interpretation of 18 F-FDG-PET images in Japan and therefore was adopted as the official interpretation method in J-ADNI. Images were also assessed by auto-mated quantitative analysis by using an FDG-PET score, which was derived from ADtsum, 4,5 and were compared with the visual-statistical rating by the 3 raters and with their consensus.

Subjects
Data used in the present study were obtained from J-ADNI. 9 This project was approved by the ethics committee of each site in which J-ADNI data were acquired, and written informed consent was obtained from each subject before participating in J-ADNI. All subjects were native Japanese speakers, 60 -84 years of age, and were registered as 1 of 3 clinical groups (mild AD, MCI, or NC). Subjects of the mild AD group scored 20 -26 in Mini-Mental State Examination-Japanese and 0.5-1.0 in the Clinical Dementia Rating-Japanese and were compatible with the probable AD criteria in the National Institute of Neurologic and Communicative Disorders and Stroke and the Alzheimer Disease and Related Disorders Association. 10 Subjects of the MCI group scored 24 -30 in the Mini-Mental State Examination-Japanese and 0.5 in the Clinical Dementia Rating-Japanese. Subjects of NC group scored 24 -30 in the Mini-Mental State Examination-Japanese and 0 in the Clinical Dementia Rating-Japanese. The exclusion criteria were depression (Geriatric Depression Scale-Japan Ն 6), cerebrovascular disorders (Hachinski Ischemic Score Ն 5), and other neurologic or psychiatric disorders.
Enrollment in each clinical group for J-ADNI was primarily determined by the referring physician, and 303 consecutive subjects entered the study to undergo 18 F-FDG-PET scanning. A thorough central review of the clinical and behavioral data by expert psychiatrists and psychologists excluded 29 cases that had erroneous assessment of the cognitive test results, depression or cerebrovascular disorders that had been overlooked, prohibited concomitant medications, or other deviations from the criteria. As a result, 274 baseline 18 F-FDG-PET scans (67 mild AD, 100 MCI, and 107 NC) were analyzed in the present study.

PET Imaging
As a quality assurance measure necessary for the multicenter study, all PET sites in J-ADNI were qualified for the PET scanner and other devices, resting-state environment, quality of the onsite-produced PET drugs, and so forth before scanning of the first subject. Intersite differences were minimized by standardizing the imaging protocol, and interscanner differences were addressed with the Hoffmann 3D phantom data. 11 The data used for the analysis in the present study were acquired with 14 types of PET or PET/CT scanners in 23 PET centers.
In the 18 F-FDG-PET scans, all subjects fasted for at least 4 hours and their preinjection blood glucose levels were confirmed to be Ͻ180 mg/dL. Intravenous administration of 18 F-FDG (185 Ϯ 37 MBq) was followed by a resting period of 30 minutes in a dimly lit and quiet room. Dynamic scans (300 seconds ϫ 6 frames) were obtained starting 30 minutes postinjection in the 3D mode. Attenuation was corrected for by a transmission scan with segmentation for dedicated PET and by a CT scan for PET/CT.
All the PET images acquired in each PET site went through the J-ADNI PET quality control process, 11 in which head motion between frames was corrected for and bad frames were removed to create sum frame images. Then the images were reoriented to the anterior/posterior commissure line with the same matrix size and voxel size so that all camera models presented images of similar orientation and appearance to the viewer and were then passed on to image interpretation.
The 18 F-FDG-PET images that had passed through the quality control process above were also treated with a 3D stereotactic surface projection technique to generate z score maps (displayed with upper ϭ 7 and lower ϭ 0) by using iSSP software, Version 3.5 (Nihon Medi-physics, Tokyo, Japan). The normal data base used for generating the Z-maps was made by a method of leave-oneout cross-validation based on 25 healthy subjects of J-ADNI (11 men and 14 women; mean age, 66.0 Ϯ 4.8 years) who were interpreted as having a normal pattern by one of the coauthors of the study. The Z-maps were used not for the automated quantification but for a part of the information for human raters in the visual-statistical interpretation.

Human Interpretation
Those 18 F-FDG images generated through the quality control process above were independently interpreted with the combined visual-statistical method by 3 expert raters blinded to the clinical group and other clinical and laboratory data. The raters were provided with the 18 F-FDG tomographic images on the viewer as well as the Z-map images in PDF format. Information about the age and sex was also provided to the raters. Moreover, T1-weighted MR images acquired in 3D mode by using MPRAGE or its equivalent and reformatted in axial sections were also provided together with axial T2WI and proton-attenuation images, in which the MR imaging sections did not correspond to the PET section positions. The experience of the 3 raters as physicians specializing in nuclear neuroimaging was 17, 19, and 19 years, respectively, when this project started.
After independent interpretation, consensus reads were performed by the 3 raters and 2 other discussants who are experienced nuclear medicine physicians specialized in neuroimaging. The experience of both discussants as physicians specializing in nuclear neuroimaging was 20 years. The same images and information as that in the independent interpretation were also provided for the discussants in the consensus reads. The 7 sessions of consensus reads lasted for 1.5 years in the order of subject enrollment in J-ADNI. In the consensus reads, the cases in which the evaluations by the 3 raters did not completely match were discussed, and the unified visual-statistical interpretation was determined as an official judgment by the J-ADNI PET Core.
For classification of 18 F-FDG-PET, the criteria of Silverman et al 1 were adopted for classifying the uptake pattern in J-ADNI. All 3 expert raters and the 2 discussants had attended a training course for the criteria organized by Silverman et al before starting the J-ADNI project. In the criteria of Silverman et al, 18 F-FDG uptake patterns were classified into 7 categories: progressive patterns: P1, P1ϩ, P2, and P3, in which P1 represents the characteristic AD pattern and P1ϩ represents AD-variant pattern, including the characteristic Lewy body dementia pattern; and nonprogressive patterns: N1, N2 and N3, in which N1 represents the characteristic normal pattern. In addition to these original 7 categories (FDG-7), the present study defined a binary criteria (FDG-2) in which the 7 categories were dichotomized into posterior-predominant hypometabolism (AD and AD-variant) patterns (P1, P1ϩ) and the other patterns (N1, N2, N3, P2, and P3).

Automated Quantitative Evaluation
In the automated quantitative analysis, the FDG-PET score, as a measure of the AD pattern, was calculated from ADtsum 4 by using the Alzheimer's Discrimination Tool in PMOD, Version 3.12 (PMOD Technologies, Zurich, Switzerland) 4,5 by using the following equation: FDG-PET score ϭ log2 {(ADtsum / 11,089) ϩ 1}. The FDG-PET score was not calculated in 1 case because no significant clusters were determined for the image. 4 This case was excluded from the quantitative analysis.

Statistical Analysis
Concordance among the 3 raters was evaluated by Cohen statistics. As comparisons between human and automated evaluation, the association between the FDG-PET score and the number of the raters who interpreted the case as P1 (AD pattern) in FDG-7 was evaluated by the Spearman rank correlation coefficient. Likewise, association between the FDG-PET score and the number of the raters who interpreted the case as N1 (normal pattern) was evaluated. The association was also examined between the FDG-PET score and the number of raters in FDG-2 classification (ie, how many raters judged the case as the AD and AD-variant patterns [P1, P1ϩ] versus the other patterns [N1, N2, N3, P2, and P3]). A P value Ͻ .05 was considered significant. In addition, the FDG-PET score was compared with the final combined visualstatistical interpretation determined by the consensus read and with the clinical group. Receiver operating characteristic analysis was used to obtain the optimum cutoff level for the quantitative index for discrimination.
Neither iSSP nor the PMOD Alzheimer's Discrimination Tool was approved for clinical use by the US Food and Drug Administration. Figure 1 summarizes concordance rates among the 3 raters. Agreement among the 7 visual-statistical categories by at least 2 of the 3 readers occurred in Ͼ94% of cases for all groups: NC, MCI, Breakdown of the 18 F-FDG-PET cases into degree of match by 3 raters in a combined visualstatistical human classification into 7 (FDG-7) (A) or 2 (FDG-2) (B) categories. A perfect match by the 3 raters is observed for 62% of the cases for FDG-7 and 76% for FDG-2 in total. The AD group shows the highest concordance followed by the MCI and NC groups, in this order, both for FDG-7 and FDG-2. and AD. The statistic Ϯ SE for each pair of the 3 raters was 0.59 Ϯ 0.04, 0.54 Ϯ 0.04, and 0.58 Ϯ 0.04 in FDG-7 (average, 0.57), and 0.73 Ϯ 0.04, 0.65 Ϯ 0.0, and 0.64 Ϯ 0.05 in FDG-2 (average, 0.67), respectively. Figure 2 illustrates the relationship between the FDG-PET score and the number of raters who visually-statistically interpreted the 18 F-FDG-PET image as P1 (Fig 2A) and N1 (Fig 2B). A significant positive association was observed between the FDG-PET score and the number of P1 interpretations ( ϭ 0.59, P Ͻ .0001). The mean FDG-PET score was 0.46 Ϯ 0.37 (n ϭ 103) for the scans no raters interpreted as P1, but it increased to 0.723 Ϯ 0.39 (n ϭ 34) for those that 1 rater interpreted as P1, to 0.99 Ϯ 0.45 (n ϭ 31) for 2 raters, and to 1.21 Ϯ 0.73 (n ϭ 105) for all 3 raters. Likewise, a significant negative association was observed between the FDG-PET score and the number of N1 interpretations ( ϭ Ϫ.64, P Ͻ .0001). The FDG-PET score was 1.15 Ϯ 0.69 (n ϭ 146) for the scans no raters interpreted as N1, but it decreased to 0.80 Ϯ 0.39 (n ϭ 28) for those 1 rater interpreted as N1, 0.50 Ϯ 0.25 (n ϭ 40) for 2 raters, and 0.34 Ϯ 0.22 (n ϭ 59) for all 3 raters. A similar association was observed between the FDG-PET score and the number of raters who interpreted the case as AD and AD-variant patterns, including the Lewy body dementia pattern (P1, P1ϩ) or the other patterns (N1, N2, N3, P2, and P3); and both showed significant positive and negative associations ( ϭ 0.60, P Ͻ .0001; and ϭ Ϫ0.60, P Ͻ .0001). Figure 3 illustrates scatterplots of the FDG-PET scores as contrasted to the combined visual-statistical interpretation determined by the consensus read of 18 F-FDG-PET for each clinical group. For each group as well as for all subjects, cases with P1 interpretation showed higher FDG-PET scores than those with N1. Receiver operating characteristic analysis on P1 and N1 cases led to a cutoff FDG-PET score of 0.67 for discrimination between P1 and N1. As was expected, NC cases with P1 interpretation had lower FDG-PET scores than MCI and AD cases with P1 interpretation, and the ratio of the cases above-to-below the cutoff level was also lower. As for the cases with other patterns, a large fraction of the cases with N2 interpretation had FDG-PET scores above the cutoff level, though most were below 1.0. The FDG-PET scores of the cases with P1ϩ and P2 were variable.

DISCUSSION
Matches among 7 visual-statistical categories by at least 2 of 3 readers occurred in Ͼ94% of cases for each clinical group, and perfect matches among the 3 raters were observed for 62% of the cases for FDG-7 and 76% for FDG-2 categorization schemes in total. The mild AD group showed the highest concordance, followed by MCI and NC, in order, for both FDG-7 and FDG-2. The AD pattern in 18 F-FDG-PET is usually seen in the early stage of AD and is expected to predict the onset of AD. 1,12 Because most of the subjects who are clinically diagnosed as having AD may have had an established AD pattern in 18 F-FDG-PET, it is reasonable for these results that AD showed the highest concordance. Based on the classification of values described by Landis and Koch, 13 agreements were considered to be moderate for FDG-7 and substantial for FDG-2. Inter-rater variability is one of the indices that are often used to evaluate the validity of methods of image interpretation, and it facilitates comparison with the other studies. The index of FDG-2 ( ϭ 0.67) of the present study showed values similar to those of the other studies ( ϭ 0.56-.72) evaluated by the bi-  nary criteria. [6][7][8] However, the values observed in the other studies are not the same as those in the present study because we analyzed the interpretation both visually and statistically. Recent studies have shown that the diagnostic capability of visual analysis of 18 F-FDG-PET increases when the raters interpret the images in combination with 3D stereotactic surface projections. 14,15 These kinds of visual-statistical methods seem to be a standard approach in clinical settings.
To increase the concordance rate and diagnostic capability, we need to overcome some problems. We had to degrade the image quality according to the PET with the lowest quality among the 23 facilities of J-ADNI. 11 Therefore, the quality of the images may be improved in the future. In addition to the image quality, development of new methods or new approaches to image interpretation may contribute to increasing the concordance.
This study showed a relationship between combined visualstatistical interpretation and automated quantitative assessment regarding the characteristic AD pattern in brain 18 F-FDG-PET. Significant association was observed between the quantitative index (FDG-PET score) and the number of raters who interpreted the scans accordingly. This correlation may have been something expected from reports on similar/automated analysis. 5,6 However, this association was observed in a large-scale multicenter study by using various camera models on a wide spectrum of subjects in the present study.
From the standpoint of detecting the AD pattern, cases evaluated as having positive AD findings by complete agreement of all 3 raters tended to show a higher quantitative index than the cases that fewer than 3 raters interpreted as having positive AD findings. From the standpoint of ruling out the AD pattern, cases evaluated as having negative AD findings by complete agreement of all 3 raters also tended to show a lower quantitative index than the cases that fewer than 3 raters interpreted as having negative AD findings. Therefore, the results suggest that interpretation by 3 raters may be better than that by 2 or fewer raters. The results also indicate that cases that only 1 rater interpreted as having positive (or negative) AD findings presented a different quantitative index from those that no raters interpreted as having positive (or negative) findings. This outcome suggests that there are cases in which the "minority opinion" may not be ignored.
Generally, the minority opinion is somewhat important when a subtle but definite finding is evaluated. However, most of the 18 F-FDG-PET images for which the judgment did not agree among the raters showed ambiguous findings. Ng et al 6 reported that experienced raters scored higher accuracy than nonexperienced raters in the interpretation of brain 18 F-FDG-PET images for the diagnosis of AD. 6 Such subtle findings in brain 18 F-FDG-PET may be difficult to interpret. We need to analyze the difference in detail and develop new methods for interpretation or new diagnostic tools.
When the FDG-PET score of the cases judged as P1 in the consensus read were examined, NC subjects with P1 interpretation showed lower FDG-PET scores than MCI and AD subjects. This result is probably because many of the NC subjects with P1 interpretation presented with a very mild AD pattern that influenced the FDG-PET score to only a small extent. Those cases, however, presented characteristic findings such as posterior cingulate hypometabolism, which led to the P1 interpretation.
The criterion standard used in this study was the clinical diagnosis at enrollment. Although dementia with Lewy body cases with the specific symptoms were excluded from enrollment in the J-ADNI beforehand, differentiating Lewy body dementia from AD is occasionally difficult in clinical settings. 16 The typical Lewy body dementia pattern of 18 F-FDG-PET, evaluated as occipital hypometabolism, is classified into P1ϩ by the criteria of Silverman et al. 1 Some cases classified into P1ϩ, though limited in the present study, seem to have the possibility of Lewy body dementia. Moreover, the consensus read judged 16 of 107 cases of the NC group to be the AD pattern (P1 and P1ϩ), and 8 of 67 cases in the AD group to be a non-AD pattern (N1 and P2). These disagreements might be either caused by inappropriate clinical diagnosis at enrollment or reflecting the limitation of FDG-PET as a diagnostic tool. While these diagnostic discrepancies are not critical in the present study, which analyzed inter-rater concordance, comparison with other criterion standards such as long-term follow-up or postmortem examination is important for this kind of multicenter study in the future.
The FDG-PET score of 1.0, by definition, is proposed as an optimum threshold for the differential diagnosis of AD from healthy subjects. 5 Because the present study deals with comparison of combined visual-statistical human interpretation with automated quantitative analysis, we derived a cutoff level of 0.67 based on discrimination of the P1 from the N1 pattern. This discrepancy may be explained by the difference in the target of discrimination as well as in the profile of subjects, and the lower cutoff would be consistent with a higher sensitivity for visually detecting the AD pattern than for clinically identifying the diagnosis of AD, for which the 1.0 cutoff is designed. In addition, one of the essential factors for this discrepancy seems to be that decisions by visual-statistical interpretation are not completely consistent with the actual clinical diagnosis. Because the diagnostic capability of 18 F-FDG-PET is not the subject of the present study, further studies are needed to elucidate the discrepancy.

CONCLUSIONS
Inter-rater agreement was moderate to substantial regarding the combined visual-statistical human interpretation of the characteristic AD pattern in 18 F-FDG-PET. In addition, a significant relationship between human interpretation and automated quantitative assessment was found. The human rating as an AD or normal pattern was best predicted by the FDG-PET score when using a cutoff of 0.67.