Elsevier

NeuroImage

Volume 180, Part A, 15 October 2018, Pages 68-77
NeuroImage

Cross-validation failure: Small sample sizes lead to large error bars

https://doi.org/10.1016/j.neuroimage.2017.06.061Get rights and content

Abstract

Predictive models ground many state-of-the-art developments in statistical brain image analysis: decoding, MVPA, searchlight, or extraction of biomarkers. The principled approach to establish their validity and usefulness is cross-validation, testing prediction on unseen data. Here, I would like to raise awareness on error bars of cross-validation, which are often underestimated. Simple experiments show that sample sizes of many neuroimaging studies inherently lead to large error bars, eg ±10% for 100 samples. The standard error across folds strongly underestimates them. These large error bars compromise the reliability of conclusions drawn with predictive models, such as biomarkers or methods developments where, unlike with cognitive neuroimaging MVPA approaches, more samples cannot be acquired by repeating the experiment across many subjects. Solutions to increase sample size must be investigated, tackling possible increases in heterogeneity of the data.

Introduction

In the past 15 years, machine-learning methods have pushed forward many brain-imaging problems: decoding the neural support of cognition (Haynes and Rees, 2006), information mapping (Kriegeskorte et al., 2006), prediction of individual differences –behavioral or clinical– (Smith et al., 2015), rich encoding models (Nishimoto et al., 2011), principled reverse inferences (Poldrack et al., 2009), etc. Replacing in-sample statistical testing by prediction gives more power to fit rich models and complex data (Norman et al., 2006, Varoquaux and Thirion, 2014).

The validity of these models is established by their ability to generalize: to make accurate predictions about some properties of new data. They need to be tested on data independent from the data used to fit them. Technically, this test is done via cross-validation: the available data is split in two, a first part, the train set used to fit the model, and a second part, the test set used to test the model (Pereira et al., 2009, Varoquaux et al., 2017).

Cross-validation is thus central to statistical control of the numerous neuroimaging techniques relying on machine learning: decoding, MVPA (multi-voxel pattern analysis), searchlight, computer aided diagnostic, etc. Varoquaux et al. (2017) conducted a review of cross-validation techniques with an empirical study on neuroimaging data. These experiments revealed that cross-validation made errors in measuring prediction accuracy typically around ±10%. Such large error bars are worrying.

Here, I show with very simple analyses that the observed errors of cross-validation are inherent to small number of samples. I argue that they provide loopholes that are exploited in the neuroimaging literature, probably unwittingly. The problems are particularly severe for methods development and inter-subject diagnostics studies. Conversely, cognitive neuroscience studies are less impacted, as they often have access to higher sample sizes using multiple trials per subjects and multiple subjects. These issues could undermine the potential of machine-learning methods in neuroimaging and the credibility of related publications. I give recommendations on best practices and explore cost-effective avenues to ensure reliable cross-validation results in neuroimaging.

The effects that I describe are related to the “power failure” of Button et al. (2013): lack of statistical power. In the specific case of testing predictive models, the shortcoming of small samples are more stringent and inherent as they are not offset with large effect sizes. My goals here are to raise awareness that studies based on predictive modeling require larger sample sizes than standard statistical approaches.

Section snippets

Distribution of errors in cross-validation

Cross-validation strives to measure the generalization power of a model: how well it will predict on new data. To simplify the discussion, I will focus on balanced classification, predicting two categories of samples; prediction accuracy can then be measured in percents and chance is at 50%. The cross-validation error is the discrepancy between the prediction accuracy measured by cross-validation and the expected accuracy on new data.

Previous results: cross-validation on brain images. Varoquaux

An open door to overfit and confirmation bias

The large error bars are worrying, whether it is for methods development of predictive models or their use to study the brain and the mind. Indeed, a large variance of results combined with publication incentives weaken scientific progress (Ioannidis, 2005).

With conventional statistical hypothesis testing, the danger of vibration effects is well recognized: arbitrary degrees of freedom in the analysis explore the variance of the results and, as a consequence, control on false positives is

Conclusion: improving predictive neuroimaging

With predictive models even more than with standard statistics small sample sizes undermine accurate tests. The problem is inherent to the discriminant nature of the test, measuring only a success or failure per observations. Estimates of variance across cross-validation folds give a false sense of security as they strongly underestimates errors on the prediction accuracy: folds are far from independent. Rather, to avoid the illusion of biomarkers that do not generalize or overly-optimistic

Acknowledgments

Computing resources were provided by the NiConnect project (ANR-11-BINF-0004_NiConnect). I am grateful to Aaron Schurger, Steve Smith, and Russell Poldrack for feedback on the manuscript. I would also like to thank Alexandra Elbakyan for help with the literature review, as well as Colin Brown and Choong-Wan Woo for sharing data of their review papers.

References (59)

  • R. Rosenthal

    The file drawer problem and tolerance for null results

    Psychol. Bull.

    (1979)
  • R. Saxe et al.

    Divide and conquer: a defense of functional localizers

    Neuroimage

    (2006)
  • J. Stelzer et al.

    Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (mvpa): random permutations and cluster size control

    Neuroimage

    (2013)
  • D.C. Van Essen et al.

    The wu-minn human connectome project: an overview

    Neuroimage

    (2013)
  • G. Varoquaux et al.

    Assessing and tuning brain decoders: cross-validation, caveats, and guidelines

    NeuroImage

    (2017)
  • T. Wolfers et al.

    From estimating activation locality to predicting disorder: a review of pattern recognition for neuroimaging-based psychiatric diagnostics

    Neurosci. Biobehav. Rev.

    (2015)
  • G. Ziegler et al.

    Individualized gaussian process-based prediction and detection of local and global gray matter abnormalities in elderly subjects

    NeuroImage

    (2014)
  • S. Arlot et al.

    A survey of cross-validation procedures for model selection

    Stat. Surv.

    (2010)
  • Y. Bengio et al.

    No unbiased estimator of the variance of k-fold cross-validation

    J. Mach. Learn. Res.

    (2004)
  • B. Biswal et al.

    Toward discovery science of human brain function

    Proc. Ntl Acad. Sci.

    (2010)
  • U.M. Braga-Neto et al.

    Is cross-validation valid for small-sample microarray classification?

    Bioinformatics

    (2004)
  • C.J. Brown et al.

    Machine Learning on Human Connectome Data from mri

    (2016)
  • K.S. Button et al.

    Power failure: why small sample size undermines the reliability of neuroscience

    Nat. Rev. Neurosci.

    (2013)
  • S.G. Costafreda

    Pooling fmri data: meta-analysis, mega-analysis and multi-center studies

    Front. Neuroinformatics

    (2009)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (2006)
  • A. Di Martino et al.

    The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism

    Mol. psychiatry

    (2014)
  • C. Dwork et al.

    The reusable holdout: preserving validity in adaptive data analysis

    Science

    (2015)
  • K.J. Gorgolewski et al.

    Neurovault. org: a web-based repository for collecting and sharing unthresholded statistical maps of the human brain

    Front. Neuroinform.

    (2015)
  • T. Hastie et al.

    The Elements of Statistical Learning

    (2009)
  • Cited by (397)

    View all citing articles on Scopus
    View full text