PLS-regression: a basic tool of chemometrics
Introduction
In this article we review a particular type of multivariate analysis, namely PLS-regression, which uses the two-block predictive PLS model to model the relationship between two matrices, X and Y. In addition PLSR models the “structure” of X and of Y, which gives richer results than the traditional multiple regression approach. PLSR and similar approaches provide quantitative multivariate modelling methods, with inferential possibilities similar to multiple regression, t-tests and ANOVA.
The present volume contains numerous examples of the use of PLSR in chemistry, and this article is merely an introductory review, showing the development of PLSR in chemistry until, around, the year 1990.
PLS-regression (PLSR) is a recently developed generalization of multiple linear regression (MLR) [1], [2], [3], [4], [5], [6]. PLSR is of particular interest because, unlike MLR, it can analyze data with strongly collinear (correlated), noisy, and numerous X-variables, and also simultaneously model several response variables, Y, i.e., profiles of performance. For the meaning of the PLS acronym, see Section 1.2.
The regression problem, i.e., how to model one or several dependent variables, responses, Y, by means of a set of predictor variables, X, is one of the most common data-analytical problems in science and technology. Examples in chemistry include relating Y=properties of chemical samples to X=their chemical composition, relating Y=the quality and quantity of manufactured products to X=the conditions of the manufacturing process, and Y=chemical properties, reactivity or biological activity of a set of molecules to X=their chemical structure (coded by means of many X-variables). The latter models are often called QSPR or QSAR. Abbreviations are explained in Section 1.3.
Traditionally, this modelling of Y by means of X is done using MLR, which works well as long as the X-variables are fairly few and fairly uncorrelated, i.e., X has full rank. With modern measuring instrumentation, including spectrometers, chromatographs and sensor batteries, the X-variables tend to be many and also strongly correlated. We shall therefore not call them “independent”, but instead “predictors”, or just X-variables, because they usually are correlated, noisy, and incomplete.
In handling numerous and collinear X-variables, and response profiles (Y), PLSR allows us to investigate more complex problems than before, and analyze available data in a more realistic way. However, some humility and caution is warranted; we are still far from a good understanding of the complications of chemical, biological, and economical systems. Also, quantitative multivariate analysis is still in its infancy, particularly in applications with many variables and few observations (objects, cases).
The PLS approach was originated around 1975 by Herman Wold for the modelling of complicated data sets in terms of chains of matrices (blocks), so-called path models, reviewed in Ref. [1]. This included a simple but efficient way to estimate the parameters in these models called NIPALS (Non-linear Iterative PArtial Least Squares). This led, in turn, to the acronym PLS for these models (Partial Least Squares). This relates to the central part of the estimation, namely that each model parameter is iteratively estimated as the slope of a simple bivariate regression (least squares) between a matrix column or row as the y-variable, and another parameter vector as the x-variable. So, for instance, the PLS weights, w, are iteratively re-estimated as X′u/(u′u) (see Section 3.10). The “partial” in PLS indicates that this is a partial regression, since the x-vector (u above) is considered as fixed in the estimation. This also shows that we can see any matrix–vector multiplication as equivalent to a set of simple bivariate regressions. This provides an intriguing connection between two central operations in matrix algebra and statistics, as well as giving a simple way to deal with missing data.
Gerlach et al. [7] applied multi-block PLS to the analysis of analytical data from a river system in Colorado with interesting results, but this was clearly ahead of its time.
Around 1980, the simplest PLS model with two blocks (X and Y) was slightly modified by Svante Wold and Harald Martens to better suit to data from science and technology, and shown to be useful to deal with complicated data sets where ordinary regression was difficult or impossible to apply. To give PLS a more descriptive meaning, H. Wold et al. have also recently started to interpret PLS as Projection to Latent Structures.
- AA
Amino Acid
- ANOVA
ANalysis Of VAriance
- AR
AutoRegressive (model)
- ARMA
AutoRegressive Moving Average (model)
- CV
Cross-Validation
- CVA
Canonical Variates Analysis
- DModX
Distance to Model in X-space
- EM
Expectation Maximization
- H-PLS
Hierarchical PLS
- LDA
Linear Discriminant Analysis
- LV
Latent Variable
- MA
Moving Average (model)
- MLR
Multiple Linear Regression
- MSPC
Multivariate SPC
- NIPALS
Non-linear Iterative Partial Least Squares
- NN
Neural Networks
- PCA
Principal Components Analysis
- PCR
Principal Components Regression
- PLS
Partial Least Squares projection to latent structures
- PLSR
PLS-Regression
- PLS-DA
PLS Discriminant Analysis
- PRESD
Predictive RSD
- PRESS
Predictive Residual Sum of Squares
- QSAR
Quantitative Structure–Activity Relationship
- QSPR
Quantitative Structure–Property Relationship
- RSD
Residual SD
- SD
Standard Deviation
- SDEP, SEP
Standard error of prediction
- SECV
Standard error of cross-validation
- SIMCA
Simple Classification Analysis
- SPC
Statistical Process Control
- SS
Sum of Squares
- VIP
Variable Influence on Projection
We shall employ the common notation where column vectors are denoted by bold lower case characters, e.g., v, and row vectors shown as transposed, e.g., v′. Bold upper case characters denote matrices, e.g., X.
- ∗
multiplication, e.g., A∗B
- ′
transpose, e.g., v′,X′
- a
index of components (model dimensions); (a=1,2,…,A)
- A
number of components in a PC or PLS model
- i
index of objects (observations, cases); (i=1,2,…,N)
- N
number of objects (cases, observations)
- k
index of X-variables (k=1,2,…,K)
- m
index of Y-variables (m=1,2,…,M)
- X
matrix of predictor variables, size (N∗K)
- Y
matrix of response variables, size (N∗M)
- bm
regression coefficient vector of the mth y. Size (K∗1)
- B
matrix of regression coefficients of all Y's. Size (K∗M)
- ca
PLSR Y-weights of component a
- C
the (M∗A) Y-weight matrix; ca are columns in this matrix
- E
the (N∗K) matrix of X-residuals
- fm
residuals of mth y-variable; (N∗1) vector
- F
the (N∗M) matrix of Y-residuals
- G
the number of CV groups (g=1,2,…,G)
- pa
PLSR X-loading vector of component a
- P
Loading matrix; pa are columns of P
- R2
multiple correlation coefficient; amount Y “explained” in terms of SS
- RX2
amount X “explained” in terms of SS
- Q2
cross-validated R2; amount Y “predicted”
- ta
X-scores of component a
- T
score matrix (N∗A), where the columns are ta
- ua
Y-scores of component a
- U
score matrix (N∗A), where the columns are ua
- wa
PLSR X-weights of component a
- W
the (K∗A) X-weight matrix; wa are columns in this matrix
- wa*
PLSR weights transformed to be independent between components
- W*
(K∗A) matrix of transformed PLSR weights; wa* are columns in W*.
Section snippets
Example 1, a quantitative structure property relationship (QSPR)
We use a simple example from the literature with one Y-variable and seven X-variables. The problem is one of QSPR or QSAR, which differ only in that the response(s) Y are chemical properties in the former and biological activities in the latter. In both cases, X contains a quantitative description of the variation in chemical structure between the investigated compounds.
The objective is to understand the variation of y=DDGTS=the free energy of unfolding of a protein (tryptophane synthase a unit
PLSR and the underlying scientific model
PLSR is a way to estimate parameters in a scientific model, which basically is linear (see Section 4.3 for non-linear PLS models). This model, like any scientific model, consists of several parts, the philosophical, conceptual, the technical, the numerical, the statistical, and so on. We here illustrate these using the QSPR/QSAR model of example 1 (see above), but the arguments are similar in most other modelling in science and technology.
Our chemical thinking makes us formulate the influence
Latent Variables
In PLS modelling, we assume that the investigated system or process actually is influenced by just a few underlying variables, latent variables (LV's). The number of these LV's is usually not known, and one aim with the PLSR analysis is to estimate this number. Also, the PLS X-scores, ta, are usually not direct estimates of the LV's, but rather they span the same space as the LV's. Thus, the latter (denoted by V) are related to the former (T) by a, usually unknown, rotation matrix, R, with the
The initial PLSR analysis of the AA data
The first PLSR analysis (linear model) of the AA data gives one significant component explaining 43% of the Y-variance (R2=0.435, Q2=0.299). In contrast, the MLR gives an R2 of 0.788, which is equivalent to PLSR with A=7 components. The full MLR solution, however, has a Q2 of −0.215, indicating that the model is poor, and does not predict better than chance.
With just one significant PLS component, the only meaningful score plot is that of y against t (Fig. 3). The aromatic AA's, Trp, Phe, and,
Example 2 (SIMCODM)
High and consistent product quality combined with “green” plant operation is important in today's competitive industrial climate. The goal of process data modelling is often to reduce the amount of down time and eliminate sources of undesired and deleterious process variability. The second example shows the investigation of possibilities to operate a process industry in an environment-friendly manner [40].
At the Aylesford Newsprint paper-mill in Kent, UK, premium quality newsprint is produced
Summary; how to develop and interpret a PLSR model
(1) Have a good understanding of the stated problem, particularly which responses (properties), Y, that are of interest to measure and model, and which predictors, X, that should be measured and varied. If possible, i.e., if the X-variables are subject to experimental control, use statistical experimental design [41] for the construction of X.
(2) Get good data, both Y (responses) and X (predictors). Multivariate Y's provide much more information because they can first be separately analyzed by
Regression-like data-analytical problems
A number of seemingly different data-analytical problems can be expressed as regression problems with a special coding of Y or X. These include linear discriminant analysis (LDA), analysis of variance (ANOVA), and time series analysis (ARMA and similar models). With many and collinear variables (rank-deficient X), a PLSR solution can therefore be formulated for each of these.
In linear discriminant analysis (LDA), and the closely related canonical variates analysis (CVA), one has the X-matrix
Conclusions and discussion
PLSR provides an approach to the quantitative modelling of the often complicated relationships between predictors, X, and responses, Y, that with complex problems often is more realistic than MLR including stepwise selection variants. This because the assumptions underlying PLS—correlations among the X's, noise in X, model errors—are more realistic than the MLR assumptions of independent and error free X's.
The diagnostics of PLSR, notably cross-validation and score plots (u, t and t, t) with
Acknowledgements
Support from the Swedish Natural Science Research Council (NFR), and from the Center for Environmental Research in Umeå (CMF) is gratefully acknowledged. Dr. Erik Johansson is thanked for helping with the examples.
References (44)
- et al.
Partial least squares modelling with latent variables
Anal. Chim. Acta
(1979) - et al.
Latent variable regression tools
Chemom. Intell. Lab. Syst.
(1999) Analysis of two partial least squares algorithms for multivariate calibration
Chemom. Intell. Lab. Syst.
(1987)- et al.
Missing data methods in PCA and PLS: score calculations with incomplete observation
Chemom. Intell. Lab. Syst.
(1996) - et al.
Missing values in principal component analysis
Chemom. Intell. Lab. Syst.
(1998) - et al.
Multivariate design of process experiments (M-DOPE)
Chemom. Intell. Lab. Syst.
(1994) - et al.
Maturity determination of organic matter in coals using the methylphenantrene distribution
Geochim. Cosmochim. Acta
(1987) - et al.
Modified jack-knife estimation of parameter uncertainty in bilinear modeling (PLSR)
Food Qual. Preference
(2000) The latent variable, an editorial
Chemom. Intell. Lab. Syst.
(1992)- et al.
PLS discriminant plots
Analysis of mixture data with partial least squares
Chemom. Intell. Lab. Syst.
Soft modelling. The basic design and some extensions
The collinearity problem in linear regression. The partial least squares approach to generalized inverses
SIAM J. Sci. Stat. Comput.
PLS regression methods
J. Chemom.
PLS—partial least squares projections to latent structures
La Regression PLS: Theorie et Pratique
Octan-1-ol-water partition coefficients of zwitterionic a-amino acids. Determination by centrifugal partition chromatography and factorization into steric/hydrophobic and polar components
J. Chem. Soc., Perkin Trans.
The prediction of bradykinin potentiating potency of pentapeptides. An example of a peptide quantitative structure–activity relationship
Acta Chem. Scand., Ser. B
New chemical dimensions relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids
J. Med. Chem.
A User's Guide to Principal Components
Frameworks for latent variable multivariate regression
J. Chemom.
Cited by (7643)
Evidence-based successful example of a structure-based approach for the prediction of designer fentanyl-like molecules
2024, Emerging Trends in Drugs, Addictions, and HealthNon-destructive assessment of 'Fino' lemon quality through ripening using NIRS and chemometric analysis
2024, Postharvest Biology and TechnologyEstimating leaf photosynthetic capacity using hyperspectral reflectance: Model variability and transferability
2024, Computers and Electronics in Agriculture