Evaluating methods for classifying expression data

Michael Z Man; Greg Dyson; Kjell Johnson; Birong Liao

doi:10.1081/BIP-200035491

Evaluating methods for classifying expression data

J Biopharm Stat. 2004 Nov;14(4):1065-84. doi: 10.1081/BIP-200035491.

Authors

Michael Z Man¹, Greg Dyson, Kjell Johnson, Birong Liao

Affiliation

¹ Nonclinical Statistics, Pfizer Global Research and Development - Ann Arbor Laboratories, Ann Arbor, MI 48105, USA. michael.mann@pfizer.com

PMID: 15587980
DOI: 10.1081/BIP-200035491

Abstract

An attractive application of expression technologies is to predict drug efficacy or safety using expression data of biomarkers. To evaluate the performance of various classification methods for building predictive models, we applied these methods on six expression datasets. These datasets were from studies using microarray technologies and had either two or more classes. From each of the original datasets, two subsets were generated to simulate two scenarios in biomarker applications. First, a 50-gene subset was used to simulate a candidate gene approach when it might not be practical to measure a large number of genes/biomarkers. Next, a 2000-gene subset was used to simulate a whole genome approach. We evaluated the relative performance of several classification methods by using leave-one-out cross-validation and bootstrap cross-validation. Although all methods perform well in both subsets for a relative easy dataset with two classes, differences in performance do exist among methods for other datasets. Overall, partial least squares discriminant analysis (PLS-DA) and support vector machines (SVM) outperform all other methods. We suggest a practical approach to take advantage of multiple methods in biomarker applications.

MeSH terms

Algorithms
Artificial Intelligence
Data Interpretation, Statistical*
Discriminant Analysis
Gene Expression*
Genetic Markers
Least-Squares Analysis
Models, Genetic
Neural Networks, Computer
Oligonucleotide Array Sequence Analysis / statistics & numerical data
Predictive Value of Tests
Principal Component Analysis
Reproducibility of Results
Statistics, Nonparametric

Substances

Genetic Markers