Multivariate Classification of Blood Oxygen Level–Dependent fMRI Data with Diagnostic Intention: A Clinical Perspective

SUMMARY: There has been a recent upsurge of reports about applications of pattern-recognition techniques from the field of machine learning to functional MR imaging data as a diagnostic tool for systemic brain disease or psychiatric disorders. Entities studied include depression, schizophrenia, attention deficit hyperactivity disorder, and neurodegenerative disorders like Alzheimer dementia. We review these recent studies which—despite the optimism from some articles—predominantly constitute explorative efforts at the proof-of-concept level. There is some evidence that, in particular, support vector machines seem to be promising. However, the field is still far from real clinical application, and much work has to be done regarding data preprocessing, model optimization, and validation. Reporting standards are proposed to facilitate future meta-analyses or systematic reviews.

F unctional MR imaging based on blood oxygen level-dependent signal changes that are measured by using fast T2 * -sensitive echo-planar imaging techniques provides an indirect measure of neural activity in the brain. It has an enormous impact on basic research in the field of cognitive neurosciences 1 and has been applied in numerous group studies with the aim of clarifying disease mechanisms in psychiatric and neurologic disorders, some of which do not exhibit obvious structural alterations (eg, Zhang and Raichle 2 and Chen et al 3 ). However, the applicability of fMRI to single subjects in clinical settings has been limited to a few indications, mainly in the context of surgery planning. 4,5 Although there has been a substantial effort to identify neuroimaging biomarkers for psychiatric disorders 6,7 (eg, schizophrenia, 8 depression, 9 and neurodegenerative disorders like Alzheimer dementia 10 with the goal of including biomarkers in official diagnostic criteria, 11 ), to date capturing functional aspects in diagnostic imaging is almost limited to tracer studies in certain kinds of neurodegeneration. 6,12,13 In clinical practice, neuroradiologic MR imaging examinations are broadly confined to the exclusion of gross structural abnormalities, but normally, actual disease mechanisms are not used as further information in a majority of these individuals. Voxel-based morphometry, DTI, and fMRI have been proposed as potential MR imaging biomarkers that might help overcome this shortcoming in the future. 7,8 A prime drawback of fMRI is the rather high inter-and intraindividual variability of measures in conventional analyses, even in healthy individuals, [14][15][16] that foils many such attempts. Conventional fMRI methods mainly comprise univariate activation or cofluctuation (functional-connectivity) analyses based on averaged signals in a few regions of interest or mass-univariate analyses across the whole brain, 1 which come along with high requirements to control for multiple comparisons. 17

Overview of Machine-Learning-Based Classification Techniques for fMRI
In the case of intertrial variability in individual subjects, the problem of differentiating single trials has been overcome in recent years by the rise of multivariate supervised learning methods derived from the fields of machine learning and pattern recognition. Such methods, often termed multivariate or multivoxel pattern analyses (MVPAs), are increasingly adopted in psychologically motivated fMRI studies. The concept of such analyses is that at first an algorithm is used to derive a decision rule (classifier) on the basis of a set of labeled training data (eg, comprising Ն2 classes; eg, different stimuli categories or tasks). This rule is applied to classifying an independent set of test data as belonging to one of these classes in a second step. A general overview of this approach is shown in Fig 1. In contrast to conventional analyses, these techniques are based on patterns of brain activation or connections not on individual regions or voxels. [18][19][20] Recently this concept has been extended to classifying individual subjects with a diagnostic purpose (for earlier, methodologically oriented reviews see Kloppel et al 21 and Orrù et al 22 ). This article gives a comprehensive overview of MVPA applications to fMRI from a more clinical, particularly neuroradiologic, point of view.
Although there are a large number of supervised machinelearning techniques that can, in principle, be applied in this context, 23 2 groups of methodologies are of particular importance: support vector machines (SVMs) and linear discriminant analyses (LDAs). In SVMs, the classification problem is operationalized as defining a hyperplane that best distinguishes groups of subjects. The classifier is trained by using a kernel by maximizing the margin of separation between 2 groups on the basis of the examples closest to the separating hyperplane. 22-24 In a typical LDA variant, all data points are projected to a 1D space with the aim of maximizing intergroup separation and minimizing intraclass variation. 22,23 LDA and support vector machines are very heterogeneous groups, depending on the actual operationalization or the kernel used. Certain kinds of SVMs are mathematically very similar to certain types of LDAs, while there can be important differences between different support vector machine formulations and parameter sets. 23 The distinction made is, therefore, somewhat artificial.
fMRI datasets usually comprise several thousand nonindependent voxels. Yet the number of subjects is usually limited to dozens. This difference poses a certain problem for MVPA because most methods cannot deal with a high dimensionality of the data compared with the number of samples. There is a high risk of overfitting. This means that the classifier is perfectly trained to separate the samples used for training but has a poor ability to generalize to the successful classification of new data. This issue can be dealt with by the selection of classification methods that are less sensitive to a high dimensionality, such as SVMs. In contrast, LDA is usually very sensitive to this. Still, a strict dimensionality reduction is necessary: Primary data are preprocessed to concatenate redundant information by feature extraction, and features that are decisive are identified before actual classifier training by feature selection or weighting. Filter approaches, partially by using conventional univariate statistics or wrapper-based approaches, are commonly applied for feature selection. 19 An issue that has to be overcome in diagnostic classification is interindividual structural variability regarding the morphology of the cerebral sulci and gyri as well as their relation to histologically and functionally relevant brain areas. 25 Within-subject MVPA analyses often rely on fine-grained patterns on a single-voxel level. 18,19 In contrast, most diagnostic MVPA studies reviewed here focus on another spatial scale: larger functionally coherent brain areas.
A specific feature present in the design of most MVPA-based fMRI studies is that datasets are often small and that classification performance is assessed through cross-validation (CV). Here, feature selection and classifier training are repeated several times. Each time a different range of datasets, often exactly one in the case of leave-one-out CV, is excluded and used as a test set. 19

Recent Diagnostic fMRI Approaches Based on MVPA
There has recently been a remarkable upsurge of scientific articles from the interdisciplinary functional neuroimaging community reporting successful applications of MVPA on fMRI data to various diagnostic problems, especially in the past 3 years. This constitutes a paradigm shift from comparative univariate to discriminative multivariate analyses of fMRI data. An exhaustive overview of these previous studies by using either task-based 26-63 or task-free 55,64-97 fMRI is given in On-line Tables 1 and 2. An overview of particularly reliable studies with above-average statistical power is presented in Fig 2. Although they are promising at first glance, there is a high degree of methodologic heterogeneity of classification algorithms and data-preprocessing steps in these studies. Some of the reported results seem to be mostly add-ons to studies whose designs were primarily aimed at clarifying disease mechanisms or were focused on computational aspects, not primarily done with the aim of developing a diagnostic tool. Until now, no single effort in this field has provided sufficient large-scale validation and systematic optimization of methodologic choices leading to an application in a real medical diagnostic setting. Due to this heterogeneity and because strategies to assess the statistical significance of diagnostic accuracy vary considerably between studies, we did not perform a formal meta-analytical comparison of these reports.
Data Acquisition and Preprocessing. By now a majority of reported approaches are based on conventional task-based fMRI.  This means that patients have to perform a specific, mainly neuropsychological task in the MR imaging scanner. Statistical models are designed to evaluate the amount of variance in the acquired EPI data caused by this task modification of brain activity. This corresponds to "brain activation" in conventional fMRI studies. 1 An advantage of this approach is a rather straightforward functional interpretability of such data. Yet in addition to mainly psychologically motivated studies in young healthy participants, patients' adherence to task instructions constitutes an important source of variability in real clinical settings and may even interfere with diagnostic decision-making.
Recently, a significant number of studies 55,64-97 have been based on task-free fMRI acquisitions, so-called resting-state fMRI, which focuses on the functional connectivity of distant brain regions in terms of signal cofluctuations and therefore on the integrity of large-scale brain networks. 98,99 A potential benefit of this method is that typical networks seem to be robustly identifiable in individual subjects. However, reports focusing on the reliability of typical resting-state fMRI measures highlight the problem that these are highly dependent on potentially confounding factors such as wakefulness or autonomic arousal. 100,101 Although most resting-state fMRI findings in basic neuroscience are based on short acquisitions of approximately 5 minutes, which seem to be sufficient for network detection, 102 there is recent evidence that retest reliability can be significantly improved by longer acquisitions. 103 Only a minority of resting-state functional connectivity-based MVPA approaches have used acquisitions lasting at least 7 minutes. 67,72,79,84,87,91 As a common analysis step on a single subject level, featureextraction methods are used to extract meaningful information from and simultaneously reduce the high dimensionality of the raw EPI time-series data. Prevailing approaches based on prior knowledge are activation modeling, based on general linear models for task-based acquisitions, 1 and seed-to-voxel or region-of-interest to region-of-interest correlation analyses for task-free acquisitions. In addition, recently more complex graph-theoretic approaches have been derived from the ROI-based methods. Another way of analyzing task-based and task-free studies relies on data-driven approaches such as independent component analyses. 98,104 Recent further developments in diagnostic MVPA are not solely based on one of these methods. For example, Du et al 55 combined both task-and task-free fMRI in schizophrenia in a small study. Additionally, combinations of fMRI measures with volumetric data, 41,[48][49][50]63,76,[78][79][80][81]86,89 DTI, 46,49,92 as well as genetics 42 and behavioral data, 40,41,50,76 have been used as features in MVPA analyses. However, results reported so far do not allow verified statements about the benefit of such multimodal acquisitions.
Feature Selection, Classifier Training, and Assessment of Classification Accuracy. Figure 2 and On-line Tables 1 and 2 contain information about the multivariate classification methods in the studies included in this review. They also contain information about whether the selection of potentially decisive features was based on conventional univariate analyses or whether it was also guided by multivariate information of distributed network patterns.
With a few exceptions, 72,[75][76][77][78]80,81,86,88,89,95 the small sample sizes in most studies did not allow testing the classification accuracy in datasets completely independent of those used for classifier training. As stated above, a trick makes approximative assessments of classification accuracy of a set of trained classifiers possible: Most studies use CV to show the generalizability of strongly overlapping classifiers to new test data. This means that in most reports only the diagnostic ability of 1 particular set of dependent classifiers 23 is proved. There is usually no formal test that allows conclusions regarding the ability of whole MVPA approaches (acquisition ϩ feature extraction ϩ feature selection ϩ classifier training) to construct successful diagnostic tools in a particular clinical setting because CV is only used to assess classification of new data but not reliable classifier training independent from particular subjects. Additionally, setting up CV loops that do not strictly keep the test set and training set apart is a known source of error, leading to overoptimistic estimates of diagnostic accuracy. There is still some uncertainty regarding the most appropriate test of significance to be applied in the CV setting. 19 Only a small subset of reports contains systematic comparisons and optimizations of larger sets of classification models used. 49,68,70,[75][76][77][78]80,81,86,88,89,95

Potential Clinical Applications and Integration in Diagnostic Workflows
To this point, most studies report applications to distinguish healthy controls and patients with a specific disease. These are a necessary step in developing and accessing diagnostic tools, but is it currently really clinically desirable to strive for such a tool?
In the context of practically illness-defining brain alterations, as in certain kinds of neurodegeneration, MVPA-fMRI methods might compete with radioactive tracer studies in the future. Regarding psychiatric diseases, it seems, for example, desirable to identify patients with a high risk of disease recurrence or progression. Especially in the case of major depressive disorders, there are a number of patients who do not respond to standard pharmacologic treatment; this outcome hints at potentially underlying divergent biologic mechanisms. Prediction of treatment response to a certain group of drugs seems to be a valuable objective as well. 9 To date, some MVPA-fMRI studies have already attempted to classify subjects regarding prognostically relevant subgroups. 36,38,40,46,49,50,52,53,56,58 Another important but overlapping clinical question may be how to distinguish patients with neurobiologically different disease entities but with a similar initial clinical presentation such as unipolar and bipolar depression. Such differential diagnostic aspects have been addressed in a few recent MVPA-fMRI studies as well. 32, 39,51,54,57,71,73,75,78,80,81,88,89,94,95 In this context, specific features of most psychiatric diseases should be taken into account when discussing the results of these analyses: The etiology and progression of disease are complex and only partly attributable to biologic causes. The biopsychosocial model of pathogenesis includes major influences of social and life event-related factors 105,106 that do not necessarily lead to correlates that are approachable by biologic measures such as fMRI. 107 Furthermore, many diagnostically relevant symptoms are, by definition, subjective (eg, depressed mood). 108 The burden of suffering is often decisive in terms of indications for treatment. 109 Therefore, fMRI-MVPA-based measures should not be expected to become the criterion standard in diagnostics and replace indepth history-taking. The accuracies of the studies reviewed here support this theoretic argument. Still, imaging-based multivariate tools might be able to provide clinically useful additional information: When important information (eg, regarding prognosis) is, by definition, not deducible from the course of disease, these tools might provide the clinician with crucial hints, 7 unraveling the "biologic share" of disease.
In nearly all fMRI-MVPA studies, there was a significant amount of misclassifications (On-line Tables 1 and 2). Partially, they may be attributable to inherent noise in the data and remaining methodologic weaknesses in data analysis. However, misclassification might also be based on biologically and medically meaningful information like the effects of medication 110 and age. 69,85 Sex effects are a much-debated issue in fMRI as well. 111,112 Further investigation of misclassified subjects might even pose a starting point to identify biologically different disease subgroups. Supposedly, a practical problem is that the referring physician and the radiologist cannot easily grasp what leads to a single diagnostic decision by fMRI-MVPA. In comparison with other types of diagnostic imaging, it is therefore not directly possible to appreciate the extent of potentially biasing features in a specific subject. Only 5 recent studies have tried to overcome this issue by introducing individual confidence measures. 39,48,54,56,57 As seen in Fig 2 and On-line The diversity of scientific backgrounds of recent studies is reflected by a striking heterogeneity of reported methodologic details, sample characteristics, validation strategies, and performance measures. This heterogeneity limits effort to draw more reliable quantitative conclusions about the clinical benefits of MVPA-fMRI at this stage. More specifically designed studies with a sufficiently high statistical power and confounding factors of a real clinical setting in mind with a more standardized diagnostic end point should be performed to facilitate meta-analytic comparisons in the future. As a stimulus for further debate, we propose reporting standards and standards of study design that, in our opinion, may help overcome some of these issues. They are summarized in the Table. Before MVPA-fMRI could be applied in real clinical settings, potential interscanner variability 113 should also be taken into account.

CONCLUSIONS
Approximately 70 studies at the proof-of-principle level that use MVPA of fMRI data with a diagnostic intention have been reported. However, there is wide range of different methodologic decisions, from data-acquisition strategies through preprocessing and feature selection to actual diagnostic classification algorithms and parameter settings and, therefore, a high flexibility in study design. Results reported as yet are mainly based on small sets of subjects. Therefore, one has to be cautious in drawing reliable conclusions on the basis of this literature. Published results may just represent the tip of the iceberg, with a lot more unsuccessful unpublished attempts to apply this methodology. Therefore, there might be an important publication bias, and published results regarding the statistical significance of successful diagnostic classification should be interpreted in the light of a potential need to correct for multiple comparisons. 114 Nevertheless, it can be regarded meanwhile as an independently replicated finding that building on task-based and resting-state fMRI as well support vector machines as LDA approaches has the potential to differentiate patients from healthy subjects in psychiatric disorders with most repeated findings in dementia, schizophrenia, and depression.
In contrast, there is apparently more uncertainty regarding optimal strategies for data preprocessing and feature selection, advisable steps to allow the classification algorithm to work despite a very high dimensionality and noise level of the original data. Many of these methods are derived from conventional fMRI analysis methods. Hardly any effort seems to have been made to systematically compare and evaluate the influence of these different approaches and parameter-setting selections on diagnostic accuracy.
In conclusion, here is some evidence that MVPA-fMRI is promising for overcoming long-known reliability issues in fMRI and providing clinically important prognostic and differential diagnostic information in psychiatric disorders beyond pure exclusion of gross structural alterations. Despite the optimism coming from the recent discussion in the interdisciplinary functional neu-roimaging community, this method is still rather new, and work has to be done to validate methodologic choices and identify those specific clinical settings that really allow a beneficial application. Moreover, a conceivable integration of MVPA-based fMRI into clinical workflow will depend critically on tackling diagnostic problems with a real clinical benefit and effects on therapeutic decision-making.