PLS-regression: a basic tool of chemometrics

https://doi.org/10.1016/S0169-7439(01)00155-1Get rights and content

Abstract

PLS-regression (PLSR) is the PLS approach in its simplest, and in chemistry and technology, most used form (two-block predictive PLS). PLSR is a method for relating two data matrices, X and Y, by a linear multivariate model, but goes beyond traditional regression in that it models also the structure of X and Y. PLSR derives its usefulness from its ability to analyze data with many, noisy, collinear, and even incomplete variables in both X and Y. PLSR has the desirable property that the precision of the model parameters improves with the increasing number of relevant variables and observations.

This article reviews PLSR as it has developed to become a standard tool in chemometrics and used in chemistry and engineering. The underlying model and its assumptions are discussed, and commonly used diagnostics are reviewed together with the interpretation of resulting parameters.

Two examples are used as illustrations: First, a Quantitative Structure–Activity Relationship (QSAR)/Quantitative Structure–Property Relationship (QSPR) data set of peptides is used to outline how to develop, interpret and refine a PLSR model. Second, a data set from the manufacturing of recycled paper is analyzed to illustrate time series modelling of process data by means of PLSR and time-lagged X-variables.

Introduction

In this article we review a particular type of multivariate analysis, namely PLS-regression, which uses the two-block predictive PLS model to model the relationship between two matrices, X and Y. In addition PLSR models the “structure” of X and of Y, which gives richer results than the traditional multiple regression approach. PLSR and similar approaches provide quantitative multivariate modelling methods, with inferential possibilities similar to multiple regression, t-tests and ANOVA.

The present volume contains numerous examples of the use of PLSR in chemistry, and this article is merely an introductory review, showing the development of PLSR in chemistry until, around, the year 1990.

PLS-regression (PLSR) is a recently developed generalization of multiple linear regression (MLR) [1], [2], [3], [4], [5], [6]. PLSR is of particular interest because, unlike MLR, it can analyze data with strongly collinear (correlated), noisy, and numerous X-variables, and also simultaneously model several response variables, Y, i.e., profiles of performance. For the meaning of the PLS acronym, see Section 1.2.

The regression problem, i.e., how to model one or several dependent variables, responses, Y, by means of a set of predictor variables, X, is one of the most common data-analytical problems in science and technology. Examples in chemistry include relating Y=properties of chemical samples to X=their chemical composition, relating Y=the quality and quantity of manufactured products to X=the conditions of the manufacturing process, and Y=chemical properties, reactivity or biological activity of a set of molecules to X=their chemical structure (coded by means of many X-variables). The latter models are often called QSPR or QSAR. Abbreviations are explained in Section 1.3.

Traditionally, this modelling of Y by means of X is done using MLR, which works well as long as the X-variables are fairly few and fairly uncorrelated, i.e., X has full rank. With modern measuring instrumentation, including spectrometers, chromatographs and sensor batteries, the X-variables tend to be many and also strongly correlated. We shall therefore not call them “independent”, but instead “predictors”, or just X-variables, because they usually are correlated, noisy, and incomplete.

In handling numerous and collinear X-variables, and response profiles (Y), PLSR allows us to investigate more complex problems than before, and analyze available data in a more realistic way. However, some humility and caution is warranted; we are still far from a good understanding of the complications of chemical, biological, and economical systems. Also, quantitative multivariate analysis is still in its infancy, particularly in applications with many variables and few observations (objects, cases).

The PLS approach was originated around 1975 by Herman Wold for the modelling of complicated data sets in terms of chains of matrices (blocks), so-called path models, reviewed in Ref. [1]. This included a simple but efficient way to estimate the parameters in these models called NIPALS (Non-linear Iterative PArtial Least Squares). This led, in turn, to the acronym PLS for these models (Partial Least Squares). This relates to the central part of the estimation, namely that each model parameter is iteratively estimated as the slope of a simple bivariate regression (least squares) between a matrix column or row as the y-variable, and another parameter vector as the x-variable. So, for instance, the PLS weights, w, are iteratively re-estimated as Xu/(uu) (see Section 3.10). The “partial” in PLS indicates that this is a partial regression, since the x-vector (u above) is considered as fixed in the estimation. This also shows that we can see any matrix–vector multiplication as equivalent to a set of simple bivariate regressions. This provides an intriguing connection between two central operations in matrix algebra and statistics, as well as giving a simple way to deal with missing data.

Gerlach et al. [7] applied multi-block PLS to the analysis of analytical data from a river system in Colorado with interesting results, but this was clearly ahead of its time.

Around 1980, the simplest PLS model with two blocks (X and Y) was slightly modified by Svante Wold and Harald Martens to better suit to data from science and technology, and shown to be useful to deal with complicated data sets where ordinary regression was difficult or impossible to apply. To give PLS a more descriptive meaning, H. Wold et al. have also recently started to interpret PLS as Projection to Latent Structures.

    AA

    Amino Acid

    ANOVA

    ANalysis Of VAriance

    AR

    AutoRegressive (model)

    ARMA

    AutoRegressive Moving Average (model)

    CV

    Cross-Validation

    CVA

    Canonical Variates Analysis

    DModX

    Distance to Model in X-space

    EM

    Expectation Maximization

    H-PLS

    Hierarchical PLS

    LDA

    Linear Discriminant Analysis

    LV

    Latent Variable

    MA

    Moving Average (model)

    MLR

    Multiple Linear Regression

    MSPC

    Multivariate SPC

    NIPALS

    Non-linear Iterative Partial Least Squares

    NN

    Neural Networks

    PCA

    Principal Components Analysis

    PCR

    Principal Components Regression

    PLS

    Partial Least Squares projection to latent structures

    PLSR

    PLS-Regression

    PLS-DA

    PLS Discriminant Analysis

    PRESD

    Predictive RSD

    PRESS

    Predictive Residual Sum of Squares

    QSAR

    Quantitative Structure–Activity Relationship

    QSPR

    Quantitative Structure–Property Relationship

    RSD

    Residual SD

    SD

    Standard Deviation

    SDEP, SEP

    Standard error of prediction

    SECV

    Standard error of cross-validation

    SIMCA

    Simple Classification Analysis

    SPC

    Statistical Process Control

    SS

    Sum of Squares

    VIP

    Variable Influence on Projection

We shall employ the common notation where column vectors are denoted by bold lower case characters, e.g., v, and row vectors shown as transposed, e.g., v′. Bold upper case characters denote matrices, e.g., X.

    multiplication, e.g., AB

    transpose, e.g., v′,X

    a

    index of components (model dimensions); (a=1,2,…,A)

    A

    number of components in a PC or PLS model

    i

    index of objects (observations, cases); (i=1,2,…,N)

    N

    number of objects (cases, observations)

    k

    index of X-variables (k=1,2,…,K)

    m

    index of Y-variables (m=1,2,…,M)

    X

    matrix of predictor variables, size (NK)

    Y

    matrix of response variables, size (NM)

    bm

    regression coefficient vector of the mth y. Size (K∗1)

    B

    matrix of regression coefficients of all Y's. Size (KM)

    ca

    PLSR Y-weights of component a

    C

    the (MA) Y-weight matrix; ca are columns in this matrix

    E

    the (NK) matrix of X-residuals

    fm

    residuals of mth y-variable; (N∗1) vector

    F

    the (NM) matrix of Y-residuals

    G

    the number of CV groups (g=1,2,…,G)

    pa

    PLSR X-loading vector of component a

    P

    Loading matrix; pa are columns of P

    R2

    multiple correlation coefficient; amount Y “explained” in terms of SS

    RX2

    amount X “explained” in terms of SS

    Q2

    cross-validated R2; amount Y “predicted”

    ta

    X-scores of component a

    T

    score matrix (NA), where the columns are ta

    ua

    Y-scores of component a

    U

    score matrix (NA), where the columns are ua

    wa

    PLSR X-weights of component a

    W

    the (KA) X-weight matrix; wa are columns in this matrix

    wa*

    PLSR weights transformed to be independent between components

    W*

    (KA) matrix of transformed PLSR weights; wa* are columns in W*.

Section snippets

Example 1, a quantitative structure property relationship (QSPR)

We use a simple example from the literature with one Y-variable and seven X-variables. The problem is one of QSPR or QSAR, which differ only in that the response(s) Y are chemical properties in the former and biological activities in the latter. In both cases, X contains a quantitative description of the variation in chemical structure between the investigated compounds.

The objective is to understand the variation of y=DDGTS=the free energy of unfolding of a protein (tryptophane synthase a unit

PLSR and the underlying scientific model

PLSR is a way to estimate parameters in a scientific model, which basically is linear (see Section 4.3 for non-linear PLS models). This model, like any scientific model, consists of several parts, the philosophical, conceptual, the technical, the numerical, the statistical, and so on. We here illustrate these using the QSPR/QSAR model of example 1 (see above), but the arguments are similar in most other modelling in science and technology.

Our chemical thinking makes us formulate the influence

Latent Variables

In PLS modelling, we assume that the investigated system or process actually is influenced by just a few underlying variables, latent variables (LV's). The number of these LV's is usually not known, and one aim with the PLSR analysis is to estimate this number. Also, the PLS X-scores, ta, are usually not direct estimates of the LV's, but rather they span the same space as the LV's. Thus, the latter (denoted by V) are related to the former (T) by a, usually unknown, rotation matrix, R, with the

The initial PLSR analysis of the AA data

The first PLSR analysis (linear model) of the AA data gives one significant component explaining 43% of the Y-variance (R2=0.435, Q2=0.299). In contrast, the MLR gives an R2 of 0.788, which is equivalent to PLSR with A=7 components. The full MLR solution, however, has a Q2 of −0.215, indicating that the model is poor, and does not predict better than chance.

With just one significant PLS component, the only meaningful score plot is that of y against t (Fig. 3). The aromatic AA's, Trp, Phe, and,

Example 2 (SIMCODM)

High and consistent product quality combined with “green” plant operation is important in today's competitive industrial climate. The goal of process data modelling is often to reduce the amount of down time and eliminate sources of undesired and deleterious process variability. The second example shows the investigation of possibilities to operate a process industry in an environment-friendly manner [40].

At the Aylesford Newsprint paper-mill in Kent, UK, premium quality newsprint is produced

Summary; how to develop and interpret a PLSR model

(1) Have a good understanding of the stated problem, particularly which responses (properties), Y, that are of interest to measure and model, and which predictors, X, that should be measured and varied. If possible, i.e., if the X-variables are subject to experimental control, use statistical experimental design [41] for the construction of X.

(2) Get good data, both Y (responses) and X (predictors). Multivariate Y's provide much more information because they can first be separately analyzed by

Regression-like data-analytical problems

A number of seemingly different data-analytical problems can be expressed as regression problems with a special coding of Y or X. These include linear discriminant analysis (LDA), analysis of variance (ANOVA), and time series analysis (ARMA and similar models). With many and collinear variables (rank-deficient X), a PLSR solution can therefore be formulated for each of these.

In linear discriminant analysis (LDA), and the closely related canonical variates analysis (CVA), one has the X-matrix

Conclusions and discussion

PLSR provides an approach to the quantitative modelling of the often complicated relationships between predictors, X, and responses, Y, that with complex problems often is more realistic than MLR including stepwise selection variants. This because the assumptions underlying PLS—correlations among the X's, noise in X, model errors—are more realistic than the MLR assumptions of independent and error free X's.

The diagnostics of PLSR, notably cross-validation and score plots (u, t and t, t) with

Acknowledgements

Support from the Swedish Natural Science Research Council (NFR), and from the Center for Environmental Research in Umeå (CMF) is gratefully acknowledged. Dr. Erik Johansson is thanked for helping with the examples.

References (44)

  • N. Kettaneh-Wold

    Analysis of mixture data with partial least squares

    Chemom. Intell. Lab. Syst.

    (1992)
  • H. Wold

    Soft modelling. The basic design and some extensions

  • S. Wold et al.

    The collinearity problem in linear regression. The partial least squares approach to generalized inverses

    SIAM J. Sci. Stat. Comput.

    (1984)
  • A. Höskuldsson

    PLS regression methods

    J. Chemom.

    (1988)
  • A. Höskuldsson
  • S. Wold et al.

    PLS—partial least squares projections to latent structures

  • M. Tenenhaus

    La Regression PLS: Theorie et Pratique

    (1998)
  • N.E. El Tayar et al.

    Octan-1-ol-water partition coefficients of zwitterionic a-amino acids. Determination by centrifugal partition chromatography and factorization into steric/hydrophobic and polar components

    J. Chem. Soc., Perkin Trans.

    (1992)
  • S. Hellberg et al.

    The prediction of bradykinin potentiating potency of pentapeptides. An example of a peptide quantitative structure–activity relationship

    Acta Chem. Scand., Ser. B

    (1986)
  • M. Sandberg et al.

    New chemical dimensions relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids

    J. Med. Chem.

    (1998)
  • J.E. Jackson

    A User's Guide to Principal Components

    (1991)
  • A. Burnham et al.

    Frameworks for latent variable multivariate regression

    J. Chemom.

    (1996)
  • Cited by (7643)

    View all citing articles on Scopus
    View full text