Chemometric Spectral Modeling

PeakLab uniquely offers a combination of peak fitting and direct automated multivariate predictive model creation for the most accurate chemometric predictive models you will find in any software product.

The PeakLab predictive modeling is often used to estimate chromatographic or other lab intensive values from far faster and simpler spectroscopy. The PeakLab modeling is a general statistical modeling procedure that can be used for creating a multivariate predictive model from any lab measured value using any data matrix from any source. That matrix series can be spectra in wavelength or frequency, chromatographic data sets in time, or any other collection of data series, including principal components and Fourier domain deconvolved spectra.

Spectral modeling has largely relied on Partial Least Squares (PLS) or Principal Component Regression (PCR) predictions. Instead of offering one more implementation of these now ubiquitous procedures, we made a conscious choice to produce a wholly new and far simpler robust method for generating these predictive models. If you have used PLS and PCR for chemometric models, please see our analytic reasoning for this new technology near the bottom of this page.

The PeakFit solution for chemometric modeling consists of the following:

It is not an easy matter to spot an outlier spectrum when plotting hundreds or thousands of spectra that will be fit to produce a predictive model. PeakFit offers a graphical visualization specifically for this task. In the plot below, a data matrix consisting of 156 spectra is plotted and those with a target y value between 4-5% are highlighted.

Before building a predictive model, it is useful to fit the spectroscopy for the known target analyte(s) with a set of equal width Gaussian or Voigt peaks. This will furnish two essential pieces of information for the predictive modeling. The optimal fit will determine the spectral resolution of the instrument, and this will be used for setting the x-spacing for the wavelengths in the predictive model. The fitting will also estimate the wavelengths that should appear as the principal predictors in the chemometric models. In the reference standards example below, there are four principal peaks, all at adjacent wavelengths, and the optimal multiple Gaussian fit estimates an SD width of about 4 nm, a reasonable value for high resolution FTNIR spectra.

Unlike complex chemometric modeling programs, PeakLab presents the whole of the modeling settings in one dialog where most options can be left at the defaults. Simply select the variable in the data matrix that is to be modeled, the range of the wavelengths you want to use, and use filters, if needed, to produce a reasonable modeling time. In this example, 60 million candidate models will be fit in about 10 seconds. PeakLab’s higher parameter count filters make impractical full permutation fits viable. The fit in the example below would require close to 12 hours and the fitting of 220 billion models without these smart filters that intelligently remove unproductive predictors as the fitting algorithm proceeds.

Unique to PeakLab, models with nine and higher parameter counts can be added by a “smart stepwise” procedure that optimally begins with the best retained full permutation fits of a lesser parameter count. The algorithm then uses an intelligent multidimensional search where wavelengths that may have been removed or filtered are given new opportunities to be added back when specific combinations of wavelengths make this possible.

Also unique to PeakLab are sparse PLS models. These are not partial least squares models formed in the conventional sense, but rather consist of a weighted combination of the best full permutation and stepwise models containing the most statistically significant wavelengths in the retained fitted models. Unlike a PLS model, where every wavelength is weighted into the overall model, these are sparse PLS robust models with up to 15 select wavelengths, all of which are required to test as statistically significant in the individual models that are weighted by prediction goodness of fit into an overall model. Sparse PLS models are a convenience that offer the robustness of combining many different individually effective predictive models across a composite set of deterministic wavelengths.

PeakLab is founded upon advanced proprietary predictive modeling science developed for financial time series. You should find PeakLab’s predictive modeling to be the most effective commercially available chemometric predictive modeling in the world, but we also believe you will find it is the most accessible, straightforward, and intelligent.

This image has an empty alt attribute; its file name is Web_Chemometic_FTNIRModeling1_1024_640.png

In the above predictive model plots, the x-axis contains known chromatographic reference values and the y-axis contains the prediction from the spectroscopy. The lower plot shows the leave-one-out prediction errors from the selected model where each point is an average of multiple spectral replicates. The upper plot shows the prediction error of out-of-sample spectra (wholly unknown to the model) and where, as in field analyses, no replicates were averaged and individual spectra were used for the estimation.

In direct spectral fits, far better estimates of significant wavelengths are realized. These tend to be much less influenced by secondary and inverse correlations or other artifacts arising from generating factor arrays. In the significance plot below, the contour reflects the weighted statistical significance for each wavelength that appears within the retained (best fit) models. At a certain parameter count and beyond, the significant wavelengths become sharply defined.

This image has an empty alt attribute; its file name is Web_Chemometic_FTNIRModeling1_1024_640.png

From state-of-the-art outlier removal methods that allow you to refit and inspect a predictive model with the outliers removed, to writing the model code for you in C++ or Visual BASIC, PeakLab offers a full modeling solution from beginning to end. The C++ code below was generated from one of PeakLab’s sparse PLS models.

C++ Language Code – argument is specific spectra
  double glm01(double *spec)
  // spec[0] X Predictor 1 (1658, index=8)
  // spec[1] X Predictor 2 (1660, index=10)
  // spec[2] X Predictor 3 (1662, index=12)
  // spec[3] X Predictor 4 (1664, index=14)
  // spec[4] X Predictor 5 (1670, index=20)
  // spec[5] X Predictor 6 (1672, index=22)
  // spec[6] X Predictor 7 (1758, index=108)
  // spec[7] X Predictor 8 (1760, index=110)
  // spec[8] X Predictor 9 (1762, index=112)
  // spec[9] X Predictor 10 (1766, index=116)
  // spec[10] X Predictor 11 (1782, index=132)
  // spec[11] X Predictor 12 (1816, index=166)
  // spec[12] X Predictor 13 (1818, index=168)
  // spec[13] X Predictor 14 (1822, index=172)
  // spec[14] X Predictor 15 (1828, index=178)
    double p[16]=  {
      4.06405812524123, -198.551348386628, -177.594084516741, -141.842381803743, -59.6348303087346,
      280.22152394363, 538.494036162069, -216.618929884493, -759.483492213287, 778.463631058768,
      -830.474230583933, 852.825799066597, 244.969804203827, 173.817451045103, 148.18538374493,
    int nx = 15;
    double estimate = p[0];
    for(int i=1; i<=nx; i++)
      estimate += p[i]*spec[i-1];
    return estimate;

The best direct spectral models should theoretically outperform optimum PLS and PCR models. For example, in PLS, some measure of the correlation latent variables will consist of random noise correlations and relationships with non-target components, moisture, or solvents. Experienced predictive modelers know that such predictive relationships are suspect, unreliable, far from robust, and are wisely never incorporated in a predictive system since these can vanish unexpectedly in real-world systems without warning. In general, such correlations are poorly understood, if at all, and are often blindly included in PLS and PCR modeling.

We have never encountered an ‘easy’ modeling scenario where PLS and PCR optimum factor counts could match the parameter count in direct spectral models, or where the prediction performance was as good. Even a simple case where a component could be efficiently predicted from a single spectral wavelength will typically optimize at three or four PLS factors or PCR principal components. Further, since PLS and PCR models must generally be converted back to the original spectral domain from the factor variables, those models may involve hundreds of parameters to address what a simple one wavelength direct spectral model manages with a higher prediction accuracy. Here is a white paper for a simple modeling example.

Similarly, we have equally never encountered a ‘difficult’ modeling scenario where the PLS and PCR factor counts weren’t higher than the parameter counts in an optimum spectral model, nor have we seen a case where the PLS or PCR models outperformed the direct spectral fits with respect to prediction accuracy. Here is a white paper for a complex modeling example where reflectance spectroscopy is impacted by particle size, moisture, and a massive obfuscation of the target components by other ingredients in a natural product.

PeakLab’s direct spectral fits use a full permutation procedure that can fit as many as a hundred million candidate fits in less than a minute, and in this procedure far more accurately determine the wavelengths where the deterministic information resides. Since no derived or latent variables are used, nothing is hidden, and random chance correlations and inverse relationships will not dominate the predictions.

The PeakLab direct spectral modeling produces true multivariate linear models which have been used in predictive models for decades and which are the bread and butter GLM models found in every major statistical software product in existence. Unlike PLS and PCR where this factor generation is usually hidden and proprietary, and where different PLS and PCR software will not produce exactly the same results, the PeakLab models can be replicated to full precision in any professional statistical software’s GLM procedure.

In using PeakLab to generate your spectral models, you are using the most stable, sound, and robust mathematical and statistical science available, and with an efficiency and ease of coding that will allow you to create web servers, embedded system handheld spectrometers, with ease. As part of PeakLab’s suite of analytical tools, the software cost for this modeling capability is a fraction of the price of commercial statistical or unscrambling software that is often purchased for chemometric modeling.