Introduction

Normal pressure hydrocephalus (NPH), a syndrome discovered by Hakim and Adams [1], classically presents with dementia, gait disturbance, and urinary incontinence [23]. Idiopathic NPH (iNPH) is the most common form of adult-onset hydrocephalus, and the current gold standard for diagnosing iNPH is shunt response (SR), which is also the treatment [21, 23, 37, 45, 51, 65]. Although clinical improvement has been reported in up to 90% of patients following shunt surgery [34], this value has also been as low as 46.7% [5]. For example, Hebb and Cusimano [23] found that 59% of iNPH patients improved post-shunt, with long-term improvement reported in only 29%. This disparity in reported outcomes reflects the difficulty in selecting suitable patients for shunt surgery as many pathologies mimic iNPH symptoms [41, 43, 57].

Existing iNPH guidelines, such as the widely cited Japanese iNPH guidelines (2012) [44], a narrative review, outline various clinical tests that can be used in aiding iNPH diagnosis. Alongside Hakim’s triad, radiological and biochemical markers, these include the tap test (TT), infusion test (IT), extended lumbar drainage (ELD), and intracranial pressure monitoring (ICPM) [44]. However, the current guidelines [44] do not include explicit diagnostic parameters and cut-off values for each test. Hence, in practice, there is a lack of consistency in the method and evaluation of each test.

The current literature is outdated with regard to the analysis of the diagnostic tests and the presenting factors which may predict SR. In the most recent literature review, Nunn et al. [48] was published last year and investigated the accuracy of ELD. This review [48] included 4 papers of which only one was published after 2003. Furthermore, the last systematic reviews to investigate the diagnosis of iNPH were Relkin et al. (2005) [56] and Hebb and Cusimano (2001) [23]. While both investigated multiple iNPH diagnostic features, there was little comparison within and between diagnostic tests, and no meta-analysis was performed [23, 55]. Therefore, to fill this gap in the literature, this review aims to evaluate the diagnostic effectiveness of presenting features, TT, IT, ELD, and ICPM, by incorporating the latest primary research.

Methods

Literature search

This systematic review was conducted following the Cochrane Collaboration guidelines [9] and Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [42]. A comprehensive search of MEDLINE, Embase, and Scopus was conducted from January 2003 to November 2020. January 2003 was chosen to encompass the literature after the end search year of Relkin et al. [56], the last seminal review on iNPH diagnosis. The search string consisted of the search term “Normal Pressure Hydrocephalus”. Additional articles were identified through manual searches on PubMed, Google Scholar, and JSTOR.

Study inclusion and exclusion criteria

Original articles in the English language that reported SR in relation to diagnostic tests were included. Our study selection criteria included the following: adult iNPH patients, radiological confirmation of hydrocephalus, 1 or more clinical features of NPH, use of cerebrospinal fluid (CSF) shunt, objective system of functional grading of patients preoperatively and a minimum of 3 months postoperatively, and that the diagnostic test was evaluated for the ability to predict SR. Radiological and biochemical studies were excluded, as the depth of literature on the role of both of these tests in iNPH diagnosis warrant separate meta-analyses.

Eligibility assessment, data extraction, and quality assessment

Studies were evaluated for eligibility independently by two reviewers. Disagreements were resolved by consensus after discussion with a third reviewer. Data were extracted using the Covidence data collection tool [10]. All papers were critically appraised using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS)-2 tool [69].

Statistical analysis

An Egger’s regression and asymmetry test [9] was used to assess publication bias (p < 0.05% = significant). A univariate analysis yielded the log diagnostic odds ratio (logDOR) for each diagnostic test and the DerSimonian-Laird (DSL) summary point, visualized in forest plots. The diagnostic odds ratio (DOR) was calculated by taking the natural logarithm of the logDOR. Cochrane’s Q test [9] (p < 0.05% = significant) and Higgins I2 test [68] (heterogeneity: < 25% = low; 25–50% = moderate; > 50% = high) were used to assess heterogeneity. Finally, a bivariate analysis, which is known to give a slightly more accurate estimation of diagnostic performance than univariate analysis in small sample size analyses, was utilized to plot a summary receiver operating characteristic (SROC) curve. An area under curve (AUC) of 1 indicates perfect diagnostic effectiveness; an AUC \(\le\) 0.5 indicates an ineffective test. Statistical significance was assumed for p < 0.05. Statistical analysis was carried out by utilizing the mada [11] and meta [60] packages with the R software (version 4.0.4) [55].

Results

The literature search retrieved a total of 7179 papers, of which 359 papers underwent full-text review and 35 studies were included (Fig. 1). The QUADAS-2 tool [69] scored all included studies at low to moderate risk of bias overall, none scoring high risk (Fig. 2a). No significant publication bias was detected by Egger’s test (p = 0.0847) (Fig. 2b) [9].

Fig. 1
figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [42] flowchart outlining the study selection process for qualitative synthesis (systematic review) and quantitative synthesis (meta-analysis)

Fig. 2
figure 2

In Fig. 2a, risk of bias summary plot with bar chart of the distribution of risk-of-bias judgments for all included studies (n = 35) across the domains of the QUADAS-2 tool [69], shown in percentages (%), is shown. In the bottom, an overall risk of bias, which represents the collated risk-of-bias judgments for all domains, is depicted. The reference standard refers to shunt responsiveness. In Fig. 2b, an Egger’s asymmetry test funnel plot [9] of all data points included in the meta-analysis (n = 24; two studies used twice due to discussing two index tests) indicating presence and degree of publication bias is shown. P-value < 0.05 is deemed significant and implicates publication bias. Egger’s asymmetry test yielded 0.0847%, calculated running an Egger’s regression [9] (see Egger’s regression line) on the collated logDOR and standard errors of all data used in the meta-analysis (n = 24)

Presenting clinical features

Nine studies investigated the effect of presenting symptomology and patient characteristics on SR (Table 1) [3, 30, 35, 36, 38, 39, 44, 50, 61].

Table 1 Presenting clinical features

Symptom dominance

McGirt et al. [40] found that patients with gait disturbance as a primary symptom were twice as likely to respond than others, although complete triad presentation compared to the absence of one symptom did not affect SR. This was reinforced by Marmarou et al. [38] and Murakami et al. [46]. However, Meier and Miethke [41] found that the presence of dementia was significantly associated with a worse outcome (p = 0.01).

Symptom duration

Meier and Miethke [41] found that a symptom duration onset of under 12 months before shunt surgery was associated with a significantly better outcome. McGirt et al. [40] noted a 13% lower chance of positive shunt outcome for each additional symptom year before surgery. However, Bådagård et al. [3] and Mahr et al. [37] observed no effect of symptom duration on outcome.

Symptom severity

Bådagård et al. [3] found a higher modified iNPH scale score, which reflects less severe symptoms, to significantly predict better prognosis. Mahr et al. [37] saw an association between lower baseline Kiefer scores and SR; a cut-off of > 5 had a sensitivity = 58% and specificity = 71%. Meier and Miethke [41] found worsening and severe dementia to be indicators of poor SR. Thomas et al. [63] reported that those with one SD below the age-corrected population average for the following tests to fair 4–6 × worse when shunted: Wechsler memory scale (WMS) immediate recall and WMS immediate recall with Rey-Osterrieth complex figure direct copy or Stroop color word. Mahr et al. [37] found modified Rankin scale (mRS), Stein and Langfitt hydrocephalus score (SLHS), and mini mental state examination (MMSE) scores to be on average higher in responders than non-responders. Using an MMSE cut-off of ≥ 21 was reported to have a specificity = 93% and sensitivity = 67% [37]. Similarly, Kazui et al. [30] found those with a high MMSE total score to be 1.1 × more likely to respond. Overall, the worse the symptomology, the worse the prognosis.

Age

Mahr et al. [37] found the mean age in years of non-shunt responders to be 73.9 compared to 68.8 for excellent responders, while Murakami et al. [46] observed shunt responders mean age to be 75.8 and non-shunt responders 79.9. Bådagård et al. [3] split patients into 3 age groups; in those aged < 70, the response rate was 62% and then 52% and 39% for 70–80 and > 80, respectively. However, Marmarou et al. [38] and McGirt et al. [40] reported no significant relationship between age and SR.

Co-morbidities

Bådagård et al. [3] reported ischemic stroke and transient ischemic attack to be significant negative predictors of SR, while Kazui et al. [30] found patients with hypertension to be twice as likely not to respond to shunting. However, Bådagård et al. [3] reported that hypertension, cardiovascular disease, diabetes, hyperlipidemia, and anti-platelet and anticoagulant use had no significant impact on SR.

Tap test

Six studies discussed the CSF TT as a predictor of SR using different criteria for TT response, outlined below (Table 2) [24, 26, 27, 64, 68, 71].

Table 2 Tap test

Gait tests

Three studies used gait function to assess TT response, namely the timed up and go (TUG) [26, 27, 71], 10-m walk in time (10Ti) [27], and in step (10St) tests [27]. Using the TUG test, Yamada et al. [73] showed that the absolute difference in time between pre- and post-TT was more accurate in predicting SR than percentage change. Using the absolute time difference, a ≥ 5.6-s threshold within 24 h after TT was reported to have a sensitivity = 83.3% and specificity = 81.0% for predicting an improvement of ≥ 10 s 12-month post-shunt [73]. Ishikawa et al. [26] set the threshold to a 10% improvement in the TUG, yielding sensitivity = 34.2% and specificity = 73.6%. Ishikawa et al. [27] again evaluated the TUG as well as 10Ti and 10St and found the percentage change in TUG on day 1 post-TT at the cut-off value ≥ 11.3% to have the highest sensitivity (78.3%) and specificity (80.0%).

iNPH grading scales

Wikkelsø et al. [70] used an iNPH grading scale (iNPHGS) with 4 domains (gait, neuropsychology, balance, and continence). They found the TT to be limited in its ability to predict SR, with a sensitivity = 52% and specificity = 59% [70]. However, gait improvement with TT correlated significantly with improvement in post-shunt iNPH score (p = 0.02) [70]. Ishikawa et al. [26] utilized a slightly different iNPH grading scale with 3 domains (gait, cognition, and urinary function), but in line with Wikkelsø et al. [70], they also found low sensitivity (< 40%) and specificity (< 40%) values. However, these values improved slightly when the 3 domain scales were added as iNPHGS total change (1-point improvement in any domain), which had sensitivity = 71.3% and specificity = 65% [26]. Improvement in any of iNPHGS, TUG, or MMSE showed the highest sensitivity of 92.5%, but specificity was very low (20%) [26].

Perfusion studies

In addition to clinical examinations, Hertel et al. [24] and Walter et al. [66] used magnetic resonance imaging (MRI) pre- and post-TT investigating cerebral perfusion and showed that TT diagnostic effectiveness can be improved by MRI use. Notably, Walter et al. [66] found that baseline perfusion values in the shunt responder group were significantly lower than in those who did not improve. Furthermore, they found that only the shunt responder group demonstrated a significant increase in cerebral perfusion from pre- to post-TT [66]. Both studies reported clinical examination to not be significantly associated with SR, unlike cerebral perfusion improvement [24, 64].

Infusion test

Eight studies discussed the use of IT in predicting SR, analyzing different aspects of CSF dynamics (Table 3) [2, 4, 15, 29, 35, 39, 56, 59].

Table 3 Infusion test

Rout

Seven studies analyzed the role of Rout, defined as resistance to outflow of CSF which reflects the impedance of CSF absorption [2, 4, 15, 29, 35, 39, 59]. Five studies concluded that Rout significantly predicts SR; however, there were discrepancies in the recommended Rout cut-off value [4, 15, 35, 39, 59]. While Meier and Miethke [41] and Bech-Azeddine et al. [4] reported significant cut-off values at 15 mmHg/mL/min and 16 mmHg/mL/min, respectively, Mahr et al. [37] reported the optimal Rout cut-off to be at 13 mmHg/mL/min. However, Kahlon et al. [29] and Anile et al. [2] found no significant correlation between Rout and SR.

Complementary variables

Besides Rout, the significance of complementary variables was reported by 4 studies [2, 15, 29, 56]. Kahlon et al. [29] concluded that steady-state plateau pressure (PpL) was equal to Rout in terms of predicting SR, with 78% of patients with PpL > 22 mmHg improving. Eide and Brean [15] detected a significant correlation between 12-month shunt outcome and elevated CSF pulse pressure (CSFPP) amplitude of ≥ 4 mmHg measured during lumbar infusion (p < 0.002). Anile et al. [2] found that a cut-off > 0.25 for intracranial elastance (IE) slope gradient clearly differentiated shunt responders and non-responders and was superior to Rout in predicting SR in ventricular infusion test (VIT). Ryding et al. [58] concluded that the volume at plateau level of intracranial venous blood volume (PLIV) decreasing after CSF infusion was positively predictive of shunt outcomes (p < 0.01).

Extended lumbar drainage

Eight studies used an ELD for the prediction of SR using different response criteria for ELD (Table 4) [7, 8, 17, 19, 35, 36, 47, 69].

Table 4 Extended lumbar drainage

ICPM as assessment of ELD response

Eide and Stanisic [17] reported a correlation of ICPM with ELD, showing that all patients with a raised pulsatile ICP of ≥ 5 mmHg in 10% of recording time were ELD responders (PPV = 100% and NPV = 77%). A reduction in ICP wave amplitude during ELD was a powerful indicator of SR (Spearman correlation − 0.6; p < 0.001) [17].

Memory tests as criteria for ELD response

Gallina et al. [19] showed that a 1-day ELD had a sensitivity = 100% and specificity = 75% for SR when using a ≥ 3-point increase in MMSE for ELD response. This was supported by Chotai et al. [8] who used a 4-day ELD with ≥ 2-point MMSE increase, yielding sensitivity = 100% and specificity = 60%. Similarly, Mahr et al. [37] found that an ELD response, defined as a 10% improvement in gait or MMSE score, was predictive of SR in 87.9% of patients. Using a different memory test, Chaudhry et al. [7] assessed ELD response with Rey auditory verbal learning test (RAVLT), with improvement post-ELD yielding PPV = 50% and NPV = 96%.

Global symptom improvement as criterion for ELD response

Panagiotopoulos et al. [47] found that a ≥ 1-point increase in NPH score, an objective assessment of symptom severity, after ELD was significantly associated with SR. This finding was supported by Woodworth et al. [71] who reported that ≥ 1 objective symptom improvement predicted SR with sensitivity = 91%, and specificity = 70%. Marmarou et al. [38] showed that when improvement was reported by the patient or family, ELD response was significantly correlated to SR (p < 0.001) and had sensitivity = 95% and specificity 64%.

Intracranial pressure monitoring

Eleven studies investigated the use of ICPM parameters for predicting SR (Table 5) [12,13,14, 16,17,18, 20, 35, 49, 59, 60].

Table 5 Intracranial pressure monitoring

Static ICPM

Six studies evaluated the use of the static ICP parameter (mean ICP and B-wave activity) as a predictor of SR [13, 16, 35, 49, 59, 60]. Overall, 4 studies reported no significant correlation between mean ICP or B-wave activity and SR [13, 35, 59, 60]. However, Pfisterer et al. [51] found a significant correlation (p < 0.004) between patients with a preoperative basal ICP level > 10 mmHg and SR. Eide and Sorteberg [16] found the mean ICP to be significantly higher in shunt responders (8 mmHg vs. 7 mmHg), although static ICP overall was a poor predictor of SR.

Dynamic ICPM

Seven studies reported a significant correlation of dynamic ICP values with SR [12,13,14, 16,17,18, 20]. Eide and Brean [14] examined pulse amplitude value (PAV) cut-offs of ≥ 4, ≥ 5, and ≥ 6 mmHg in 70%, 40%, and 10% of 6-s time windows, and reported a PPV = 82–90% and NPV = 91–100%. It was also reported previously by Eide [13] that a PAV ≥ 4 mmHg in 70% of the recording time was a significant predictor of SR, PPV = 90% and NPV = 100%. Eide and Sorteberg [16] also reported that a mean ICP wave amplitude cut-off of ≥ 4 mmHg had sensitivity = 98% and specificity = 70%, PPV = 93%, and NPV = 91%. Eide (2005) [13], Eide et al. [18], and Eide (2011) [12] additionally found that ICP wave amplitude values did not correlate with cardiac output (CO) or arterial blood pressure (ABP) wave amplitude increases.

RAP index

The relationship between static and dynamic ICP parameters was measured in 2 studies using the RAP index (Pearson correlation between mean ICP and mean ICP wave amplitude) [16, 35]. Eide and Sorteberg [16] reported that although the RAP average was similar between responders and non-responders, the percentage of RAP ≥ 0.6 or ≥ 0.8 was higher in shunt responders. The use of a RAP threshold of 0.8 yielded sensitivity = 66%, specificity = 48%, PPV = 82%, and NPV = 27% for predicting SR [16]. In line with this, Mahr et al. [37] reported sensitivity = 74% and specificity = 70% for RAP ≥ 0.8.

Meta-analysis

In the meta-analysis, sensitivity and specificity of the best-performing diagnostic parameters were selected for each included study to facilitate consistent comparisons. Studies were excluded from the meta-analysis in the case of insufficient reporting of statistical findings. A meta-analysis was not conducted for the diagnostic ability of presenting features due to significant methodological heterogeneity.

The meta-analysis for TT included 5 studies, and the DSL summary point is at 1.35 (95% CI: − 0.18–2.89) (Fig. 3a, Fig. 3e) [24, 26, 64, 68, 71]. The analysis of IT included 5 studies, and the DSL summary point is at 1.74 (95% CI: 0.75–2.73) (Fig. 3b, Fig. 3f) [4, 15, 29, 35, 56]. For ELD, 6 studies were included, and the DSL summary point is at 3.31 (95% CI: 2.19–4.43) (Fig. 3c, Fig. 3g) [8, 17, 19, 36, 47, 69]. Finally, for ICPM, 7 studies were included, and the DSL summary point is at 3.93 (95% CI: 2.79–5.07) (Fig. 3d, Fig. 3h) [12, 13, 16,17,18, 20, 35]. Cochrane’s Q test [9] and Higgins’ I2 test [68] indicate low heterogeneity for all sub-analyses (Fig. 3).

Fig. 3
figure 3

Forest plots indicating the log diagnostic odds ratio (logDOR) of TT (tap test) in Fig. 3a, IT (infusion test) in Fig. 3b, ELD (extended lumbar drainage) in Fig. 3c, and ICPM (intracranial pressure monitoring) in Fig. 3d, as well 95% confidence intervals in brackets. At the bottom of each graph, the DerSimonian-Laird (DSL) summary point is shown, which is the summary diagnostic odd ratio (DOR), drawn as a diamond with width inversely proportional to its standard error. A higher DOR implies higher diagnostic effectiveness. Cochrane Q test [9] is shown with p < 0.05% indicating significance, and Higgin’s I2 test [68] with < 25%, 25–50%, > 50% indicating low, moderate, and high heterogeneity, respectively. Study-specific estimates of sensitivity and false-positive rate values are shown in summary receiver operating characteristics (SROC) curve for TT in Fig. 3e, IT in Fig. 3f, ELD in Fig. 3 g, and ICPM in Fig. 3 h. The SROC curve is drawn in black, and the confidence region (95%) is drawn in grey. The data points are symbolized by triangles, and the summary estimate as a circle. The area under the curve (AUC) is shown in the bottom right corner with an AUC of 1 indicating highest diagnostic effectiveness and an AUC of 0.5 or below indicating an ineffective test

Overall, ICPM had the highest DOR (50.9) and AUC (0.877), followed by ELD (DOR = 27.7, AUC = 0.753), then IT (DOR = 5.70, AUC = 0.729), and finally TT (DOR = 3.86, AUC = 0.711).

Discussion

Correctly identifying suitable patients for shunt surgery remains challenging but crucial, as unlike secondary NPH, a large proportion of iNPH patients have poor shunt outcomes. Given that both up to 41% of iNPH patients do not have a positive SR [23], as well as shunting carrying a significant procedural risk of 22–28% [31, 70], the need for an accurate diagnostic test to predict SR is apparent. Hence, this review offers guidance as to which clinical tests and patient factors are most effective in predicting SR.

Our meta-analysis of 21 studies concluded that ICPM is the most effective clinical predictor of SR, followed by ELD, IT, and TT. The findings suggest that a patient with shunt-responsive iNPH is 50.1 × more likely to have positive ICPM than a shunt-unresponsive iNPH patient, compared to 27.7 × more likely with ELD. In comparison, a patient with shunt-responsive iNPH is only 5.7 × more likely to have a positive IT and only 3.86 × more likely to have positive TT.

The most predictive indicator of SR is a mean pulse amplitude cut-off of ≥ 4 mmHg, in 70% of recording times in dynamic ICPM [17, 19, 29]. The meta-analysis is a reliable indicator of ICPM effectiveness as the methodology of the studies included was similar, all using intraparenchymal ICP monitors and assessing similar index parameters, reflected by the relatively small spread of data (Fig. 3h). However, we found several limitations within the included studies. In papers with Eide as principal author, which comprised 5 of the 7 included ICPM studies [12, 13, 16,17,18], there were notable differences in patient selection, with Sorteberg et al. [61] selecting patients with triad symptoms and ventriculomegaly for ICPM, while Eide and Stanisic [17] and others used the Relkin guidelines (Table 5) [56]. Moreover, 4 of these studies selected patients retrospectively from the same unit between 2002 and 2008, raising concerns about patient selection bias; additionally some patients may have been excluded from retrospective analysis in one study but included in another [13, 16,17,18]. Ultimately, further research is needed in multicenter settings, with more attention on complications, which Eide neglected. Overall, given the limitations of the included ICPM studies, the findings of the meta-analysis for ICPM must be interpreted with caution. Although TT is thought to simulate a shunt procedure making it an accurate assessment of the effect of a physiological reduction of CSF circulating volume, its high false-negative rates may deprive many patients of the potential benefits of shunt surgery. ICPM could help identify these patients when used as a second-line test given its higher specificity. The mechanism of ICPM’s greater accuracy may be due to its ability to uncover CSF pathology, which affects the compliance of brain parenchyma, by direct analysis of its reflection in ICP values. However, as the pathophysiology of iNPH remains unknown [44], the biological mechanism behind the statistical diagnostic superiority of ICPM remains elusive. Regardless of this, a significant advantage of ICPM is that it utilizes objective cut-off values based on monitor readings, and, therefore, is not subject to assessor bias like in some subjective assessments of ELD or TT improvement. ICPM is negatively regarded due to a high perceived risk of complications. However, it must be noted that intraparenchymal ICPM has been found to be much safer than external ventricular drain (EVD)-ICPM: Raboel et al. [54] reported a 5% infection risk for intraparenchymal ICPM versus 27% for EVD-ICPM [22]. In line with this, Vonhoff et al. [64] reported no patients with major complications and only 7% with minor complications, such as accidental removal of the probe, in 152 adults undergoing hydrocephalus ICPM. Although currently less accurate than invasive ICPM, non-invasive ICPM, using transcranial Doppler ultrasonography or MRI, might provide an entirely complication-free assessment of SR in the future [22, 54].

ELD is the second most effective test. Despite ELD techniques and response criteria varying significantly between studies, there was relatively low spread in the data (Fig. 3c, Fig. 3g), indicating that ELD is highly accurate regardless of methodological heterogeneity. That said, 3-point MMSE improvement after 1-day ELD had the highest sensitivity and specificity and hence should be given preference [19]. Furthermore, as Chotai et al. [8] found that 1-day ELD is as effective as 4-day ELD, a 1-day ELD should be used to avoid procedural complications and financial costs of a 4-day ELD. However, it must be noted that the study by Chotai et al. [8] had several weaknesses, particularly supplementary tests, such as radionucleotide cisternography and MRI, being used, which hinders drawing valid conclusions regarding ELD as a sole diagnostic test for SR. Interestingly, the Japanese NPH guidelines [44] recommend TT over ELD, citing fewer complications in the TT, while still reporting higher sensitivity and specificity of ELD. However, we only found 3 cases of infections and no other long-term significant complications in 8 ELD studies (n = 425), consistent with Walchenbach et al. [65] who reported only 2 major complications (n = 49) without long-term sequelae. We conclude that the much higher diagnostic effectiveness of ELD relative to TT and IT outweighs the potential complication risk, which has been decreasing in the last decade most likely due to evolving ELD techniques [8].

The financial burden of ICPM and ELD is regarded by many as significant obstacle to clinical implementation. In 2006, Burnett et al. [6] found that an undiagnosed NPH patient would incur mean costs of $108,842. In comparison, it costs $7000 to insert a shunt and $2870 for an ELD, both values signifying the cost of the respective devices only [6]. However, given the low NPV and specificity of ELD, it may incur additional costs by not selecting shunt-responsive patients (Table 4). This could be avoided by using a highly specific diagnostic test, namely ICPM; however, the same authors reported that a hypothetical diagnostic test with specificity = 80% would incur $83,000 per quality-adjusted life year, due to the complexity of prolonged clinical monitoring which may exceed cost-effectiveness thresholds [6]. That said, since 2006, many studies reported ICPM and ELD specificity to be > 80% [13, 15, 17, 20, 47]; hence, we recommend an update on the current cost-effectiveness of both tests.

Given the high accessibility of IT and TT, they can be used as first-line diagnostic procedures. Patients who test negatively, but have a high index of clinical suspicion, should be followed up with ICPM, or alternatively with ELD, as these tests have significantly higher specificity and sensitivity. For IT, we recommend using Rout with a cut-off range between 13 and 16 mmHg [4, 35, 39], as well as CSFPP with a cut-off ≥ 4 mmHg [15] as predictive parameters of SR. These should be used in conjunction with PPPA [29], PpL [29], PLIV [58], and IE [2], for which future research must identify optimal cut-off values. For TT, TUG is reported as the most accurate, specifically an absolute time (≥ 5.6-s improvement) is advised [73], although there was no clear consensus on which parameter is best. The authors also recommend that TT should be followed up with an MRI assessing cerebral perfusion post-TT, to improve its sensitivity [24, 64]. The patient’s age, symptom severity, and co-morbidities all influence SR outcomes and should be considered in patient selection [3, 35, 39, 44, 61]. The longer symptoms are present before surgery, the worse the prognosis is [38, 39]; hence, early identification and treatment are critical.

In 2001, Hebb and Cusimano [23] stated that more prospective studies, which shunt iNPH patients regardless of their diagnostic test results, were needed to yield precise overall sensitivities and specificities of each test. Twenty years later, we found that this remains the most significant limitation of the literature. Given that in many studies the results of the diagnostic test influence the decision to shunt, accurate false-negative values are difficult to ascertain. We emphasize the importance that future research must aim to shunt all patients to allow for highly valid conclusions on diagnostic efficiency to be derived.

Radiology is also pivotal in the diagnosis of iNPH; however, until recently, studies have not been able to show an association between radiological features and SR [33]. The iNPH Radscale may potentially play an important role in SR prediction and is a promising new tool in the iNPH patient workup [32]. However, due to its relatively recent introduction in 2017, there is currently insufficient literature evaluating its use. Although Kockum et al. [32] have reported 98.5% predictive accuracy of NPH Radscale, another study found it to be unable to predict motor outcomes post-shunting [35]. Nonetheless, the literature on the role of radiology in iNPH is extensive; hence, the authors believe that a separate meta-analysis on the role of radiological tools in the prediction of SR is required. Additionally, biochemical markers have been hypothesized to be valuable in the diagnosis of iNPH. As reported by Leinonen et al. [36], the most important biomarker indicating iNPH may be abnormally low CSF TNF-α concentrations. Nevertheless, this study, as well as a systematic review by Pfanner et al. [50] recently in 2017, reported that no biomarker was able to predict SR. However, Zhang et al. [74] in 2020 reported that studies had found that high concentrations of vascular endothelial growth factor in the CSF may be associated with worse SR. A meta-analysis of both radiology and biochemical markers, together with this meta-analysis, can give more extensive guidance regarding the clinical prediction of SR.

Limitations

The key limitations of this study were the differences in SR criteria and diagnostic test methodologies between studies. Employing a meta-regression on these independent variables might have increased the validity of our findings. The lack of methodological consistency, as well as not all patients receiving the reference standard (shunt), is reflected in the risk analysis (Fig. 2a) which showed a moderate overall risk of bias for almost half of the studies. The number of studies used in each sub-analysis was moderate (n = 5–7), including more studies may have decreased the risk of sample size bias. Finally, the true diagnostic effectiveness of all diagnostic tests may be higher than reported by the meta-analysis, as many studies either used a medium-pressure valve or did not report using a programmable valve, despite pressure adjustments often necessary postoperatively to avoid over- and underdrainage [7, 28, 36, 46, 47, 69].

Conclusion

Intraparenchymal ICPM is statistically the most effective diagnostic test, followed by ELD, IT, and lastly TT. The best clinical predictors of SR were found to be a mean ICP wave amplitude \(\ge\) 4 mmHg and secondly using a 1- to 4-day ELD with an MMSE cut-off improvement \(\ge\) 3. If IT is used, Rout with a cut-off range between 13 and 16 mmHg, in conjunction with a CSFPP cut-off of ≥ 4 mmHg, should be used. Additionally, for TT, a TUG improvement of ≥ 5.6 s can be used. Despite the statistical superiority of ICPM and ELD, the financial cost associated with their use as well as the potential for complications renders it most useful in identifying SR patients with negative IT or TT results. The latter are clinically easier to perform with low complication rates and hence may be used as first-line diagnostic tests to predict SR. When used in addition to diagnostic tests, the severity of symptoms, patient age, and co-morbidities may aid in predicting SR. In the future, standardized methodologies for each diagnostic test and uniform criteria for SR must become the norm to draw better comparisons.