PeakLab v1 Documentation Contents AIST Software Home AIST Software Support

Modeling Spectra Tutorial I - Liquid UV-VIS

In this tutorial, we will generate certain of the models documented in the Modeling Spectra - Part I - UV-VIS Data white paper. In this paper we discuss using the 3D HPLC DAD spectra in a reconstruction technique where the UV-VIS spectroscopy is taken directly from the chromatography.

Modeling Basics

In modeling lab variables, such as chromatographic quantities, to spectra, we need to know, or would like to know, the following:

1. Wavelength band where the entities that are being estimated absorb

2. Resolution of the spectral signal, the line-spread width

3. Signal/Noise of the spectral signal

Wavelength band where the entities that are being estimated absorb

In modeling spectra for a given target material, we would ideally build a model using only the WLs (wavelengths) where the entity or entities absorb. We would prefer to exclude indirect relationships in the model, where possible. If a given entity in the spectra is inversely related to the target molecule(s), for example, and the WLs where this inverse material absorbs are included in the model, the predicted estimates will depend on a non-target entity which can change from its state in the design or model data to such a measure, the predictive model can fail or perform poorly. Correlated or inversely correlated items can be as simple, for example, as the moisture one would signal in the near infrared.

In the Modeling Spectra - Part I - UV-VIS Data white paper, we estimate the total curcuminoid percentages in turmeric powders using liquid UV-VIS spectra. From Figures 4, 5, and 6 in this white paper, we know that the total curcuminoids have a strong absorbance from 320 nm in the UV to about 480 nm in the blue, and that these absorbances are almost exclusively the target curcuminoids. Competing entities with similar absorbances are not a significant factor.

Resolution of the spectral signal, the line-spread width of a pure peak in the spectrum

We want to know the resolution of the spectral signal in order to have a fair sense for the sampling density to use for the spectral data in the modeling. In the section that follows, we will use PeakLab's multiple Gaussian fitting to estimate the resolution as a Gaussian SD width. In our experience, we use a rule of thumb where the WL increment is set to approximately half this Gaussian SD.

Signal/Noise of the spectral signal

We also want to know the S/N of the spectral signal. If the S/N is especially strong, the modeling may support a higher sampling density. If weak, the sampling density may need to be somewhat less to minimize the fitting of noise. We will also assess the S/N of the data before we begin the actual modeling.

Gaussian Peak Fitting

From the **File** menu, select **Open...** and select the file **ModelingUVVIS_AvgSpectrum.pfd
**from the program's installed default data directory (\PeakLab\Data).

Click the **Section Data** button in the left panel of the main
window. Enter **320** for **Xi** and **480** for **Xf**, click **Apply**. This will
be the data that we will first fit to determine the resolution of the spectral data. It is the average
of the UV-VIS liquid spectra from 156 turmeric samples. Click **OK** to exit the procedure and **OK**
once again to acknowledge the sectioning and to return to the main window.

For this example, we will fit the data directly without a baseline correction.

Click the **Hidden Peaks - Second Derivative** button in the left panel of the main
window.

Set the options in the dialog as follows:

**No Baseline**

Sm n(1) **10**

Sm alg **order 6 S-S-D**

**Spectroscopy**

**Gauss**

Amp% **1.50**

D2 % **-99.00**

Use IRF,ZDD Config **Check**

Vary width a2 **Uncheck**

These settings use a 6th order Savitzky-Golay D2 (second derivative smoothing) filter with two pre-smoothing
passes of the raw data. The one-sided smoothing window of 10 points produces an overall window of 21 points.
The D2 % can be set anywhere from -99% to +99% of the D2 range. Here we use -99%, meaning all D2 minima
will be treated as peaks irrespective of where the minima occur within the D2 range. Here it is important
that the a_{2} width is not varied - we want to fit a single shared width Gaussian. We will use
that single Gaussian SD as the resolution of this data. With these seetings, ten Gaussian peaks are automatically
placed.

Click the **Modify Peak Fit Preferences** button to the right of the Peak Fit button. Change the Maximum
Total Iterations to **1000** and the a1 Global Constraint to **10**%. These types of fits can involve
a significant shift in center locations and may require a higher count of iterations. Click **OK**.

Click the **Peak Fit** button in the left panel of the dialog and select the last of the fit
strategies, the **Fit with Reduced Data Prefit, Cycle Peaks, 2 Pass, Lock Shared Parameters on Pass 1**
option. Be sure the **Fit using Sequential Constraints** option is checked and then click **OK**
to begin the fitting.

When the fitting is complete click **Review Fit** to enter the Peak
Fit Review.

In the PeakLab graph's toolbar, click the Split Y and Y2 Plots to put the fitted peaks on the same plot as the data and overall fit.

In the PeakLab
graph's toolbar, click the 2D
View button. In the **Grids** section uncheck the **All** box. In the **Lines** secion set
the **Width** to **2** and check the **Smooth (anti-alias) box**. Click **OK**.

In the PeakLab
graph's toolbar, click the Select
Color Scheme button and select the **Page White II** color scheme. Click **OK**.

In the PeakLab
graph's toolbar, click the Select
Function Labels button and select the **Center Parameter a1** option and click **OK**.

Click the **Numeric** button.

**Fitted Parameters**

**r**^{2}** Coef Det****
****DF Adj r**^{2}**
****Fit Std Err****
****F-value****
****ppm uVar**

**0.99999453****
****0.99999370****
****0.30819875****
****1,278,813 ****
****5.47379293**

** Peak****
****Type****
**** a0 ****
**** a1 ****
**** a2 ****
**

**
1****
****Gauss****
****2327.04317****
****312.781753****
****12.7216043****
**

**
2****
****Gauss****
****2813.06354****
****336.027026****
****12.7216043****
**

**
3****
****Gauss****
****3608.05433****
****356.287029****
****12.7216043****
**

**
4****
****Gauss****
****4520.65944****
****377.114717****
****12.7216043****
**

**
5****
****Gauss****
****7068.21501****
****397.873914****
****12.7216043****
**

**
6****
****Gauss****
****8968.81560****
****417.763054****
****12.7216043****
**

**
7****
****Gauss****
****5867.31659****
****432.841836****
****12.7216043****
**

**
8****
****Gauss****
****6989.27350****
****448.096465****
****12.7216043****
**

**
9****
****Gauss****
****2204.41071****
****466.717086****
****12.7216043****
**

**
10****
****Gauss****
****6557.48869****
****516.201847****
****12.7216043****
**

The last of the peaks fails significance in its a_{0} area parameter (grayed), typical of
partial peaks. The fitted Gaussian a_{2} width (SD) is 12.7 nm.

In this type of estimate, we performed a simple multiple Gaussian peak fit where we fit wavelength
(nm) instead of frequency (cm^{-1}) and where, by necessity, a single fitted width is shared across
all peaks.

We perform this type of preliminary fitting to ascertain two important pieces of information. First, the width of the Gaussian peaks helps us to understand the spacing for the predictors used in the full permutation predictive modeling. In this case, a spacing of 5 nm is realistic, given this estimated peak width. If desired, the linear modeling algorithm can readily generate parameters further refined to the smallest wavelength spacing to see if a higher resolution in the final wavelengths offers any significant benefit. An x-spacing that is approximately half the Gaussian SD of the spectral peaks would be approximately 6 nm. Although the retained full permutation direct spectral fits rarely consist of fitting noise when the sample size is sufficient, such can still occur if the x-spacing is too narrow. Further, a high count of closely spaced predictors will produce seriously lengthy full permutation modeling times for no useful purpose.

Second, the center values of the fitted Gaussians represent wavelengths we might reasonably expect to see in the modeling. The peaks shown in this fit are only approximations, apply only to this averaged spectrum of all of the samples, and may not correspond exactly with the pure component absorbances which produce the overall spectra. This does, however, approximate what the linear modeling algorithms will be able to see and process. In this instance, we have three main UV wavelengths 336, 356, and 377 nm. There is one transition wavelength at 398 nm. There are four blue wavelengths at 418, 433, 448, and 467 nm. We will ignore the smallest and largest peaks in the fit above since they are outside the 320-480 nm modeling band.

The fit is a strong one, 5.47 ppm statistical (normalized least squares) error with an F-statistic
of 1.28 million. We follow longstanding statistical modeling methodology where the highest value of the
F-statistic is assumed to correspond with the maximum likelihood of the correct count of component peaks.
From this fit, we would expect an effective predictive model to incorporate WLs close to these fitted
a_{1} center values.

Please note that this type of peak fitting is not used in the predictive modeling. This type of preliminary analysis merely gives us a sense for the wavelengths we might expect to see in linear predictive models, as well as a reasonable wavelength spacing for fast full-permutation fitting.

Click **OK** to close the Peak
Fit Review and return to the main window.

S/N of Spectra

To estimate the S/N of the UV-VIS spectra, we cannot use the average spectrum from the 156 samples. We must import individual spectra. We will import these directly from the spectral data matrix file that we will use for the modeling.

From the **File** menu's, **Import Data Set(s)**, select the **Import Data Sets from Model
Data Matrix...** item. Select the** ModelingUVVIS_SpectralData.csv** file from the program's
installed default data directory (\PeakLab\Data).

From the list, select the first twelve samples in the file, as above, and click **OK**.

We will only import the data in the modeling band. Select **320** for the starting X-column,
and **480** for the ending X-column. Click **OK**. Enter any name you wish, such as First12Spectra,
for the file which will contain these twelve spectra. Click **OK** at the write notification. You will
now see the first 12 spectra in the data matrix.

From the **DSP** menu, select the **Fourier S/N Estimation...** option.

If necessary, set the data taper window to **None** and be sure the **Best Exact N**
Fourier algorithm is selected.

You will see the Fourier spectra for the 12 data sets. Click the **Numeric** Button.

S/N can be expressed as a ppm noise in the overall data, as decibel (dB) values, or by significant digits. We also measure the S/N in both the time and Fourier domains.

Signal Noise Analysis

Algorithm: Best Exact N

Window: None

Size Size Time Time Time Fourier Fourier Data

Data FFT SNRdb Digits ppm SNRdb Digits ID

161 161 41.3216935 2.06608467 8588.46056 58.5532309 2.92766154 s219

161 161 51.6809324 2.58404662 2605.87379 61.3382564 3.06691282 s220

161 161 41.9321351 2.09660675 8005.58819 65.2179760 3.26089880 s222

161 161 43.7644836 2.18822418 6482.99698 62.7763699 3.13881849 s223

161 161 45.0613650 2.25306825 5583.82435 63.0190151 3.15095076 s224

161 161 45.4617571 2.27308786 5332.27015 64.6110313 3.23055156 s225

161 161 45.5542386 2.27771193 5275.79691 65.2302258 3.26151129 s226

161 161 47.2975383 2.36487691 4316.41394 63.2402817 3.16201409 s227

161 161 47.4876148 2.37438074 4222.98228 61.9017618 3.09508809 s228

161 161 41.6075109 2.08037555 8310.44834 63.7479261 3.18739630 s229

161 161 55.0072352 2.75036176 1776.79875 65.4612237 3.27306119 s230

161 161 52.1091906 2.60545953 2480.50707 64.0886330 3.20443165 s231

Avg 46.5238079 2.32619040 5248.49678 63.2654943 3.16327472 All Data

For this data the time domain estimate of S/N is 2.33 significant digits. This is the point where half the precision is lost. The Fourier domain contributes the noise floor, the significant digits where all precision is lost, and this is 3.16 digits. Beyond 3.16 significant digits, this data is only noise. There is nothing exceptional about the S/N of the UV-VIS reconstructed spectra to justify a higher sampling density.

**Close the Signal Noise Analysis numeric window** and click **OK** to close the Fourier
S/N procedure. From the **File** menu select **Close** to clear the current data.

Visualizing the Design Data Matrix

From the **Model** menu's **Import Data Matrix** item, select the **Import Spectroscopic
Data Matrix...** option. We select the same data matrix file as above, **s219-S372_UVVispt75-7m4(99)_Avg.csv**
in the PeakLab/Model folder. Answer **No** when asked if you wish to proceed to fit this data.

From the **Model** menu's **Graph Data Matrix **item, select the **Graph All Data Sets
in Spectroscopic Data Matrix...** option. Again enter **320** for the starting X-column and **480**
for the ending X-column and click **OK**.

Check the **Use Mouse** button and highlight the individual data sets by passing the mouse
over a given spectrum. In the above plot, a sample with a high UV signal and proportionately lower visible
signal is highlighted. Click the **Unit Area** Normalization and place the mouse on the spectra that
is furthest removed from all of the others.

Here we see this same sample. The **Unit Area** format is useful for spotting significant
differences in the spectral absorbances. Click **OK** to close the GLM matrix plot.

Direct Spectral Fit of Total Curcuminoids

Once the Import
Data Matrix | Import Spectroscopic Data Matrix... item in the **Model** menu is used to load a
spectroscopic modeling data matrix into the program, the Fit
GLM Models option is available for fitting full-permutation
GLM models, and for fitting smart
stepwise forward models and sparse
PLS models, both of which use the full permutation models as their starting point.

The PeakLab full permutation GLM models are an alternative to the PLS (partial least squares) and PCR (principal component regression) procedures.

From the **Model** menu, select the the **Fit Full Permutation GLM Models...**
option. This is the entry point for all predictive
modeling.

There are seven sections that govern the modeling.

X Predictors

We want to fit all of the X-predictors through the full eight allowed in the GLM full permutation
algorithm. Click **All** in the X Predictors section to select all of the predictor counts if needed.

X Predictor Filters

We want to fit all of the WLs in the full permutation modeling, based on the X column spacing
we will subsequently specify. Click **All** so that **100%** (no filtering) is shown for all predictor
counts.

Stepwise Fits

We will use the defaults. Be sure **Add stepwise fits** is **checked**, the **Maximum
count of x predictors** is set to **12**, the **Stepwise t to enter** is **2.50**, and the
**Stepwise t to remove** is **2.00**.

For Each X Predictor Count

We will also use the defaults. Be sure **Evaluate n best models** is set to **1000**,
**Keep n best predictions** is **100**, and the **sort by** is set to **predicted r2**. These
numbers specify retaining the best 1000 models (by least squares error in the model fits) at each predictor
count, evaluating them for the best r^{2} of prediction, and keeping 100 models for the GLM
Review.

Data

Select the **pctTotal** column in the data file for the **Selected Y column**, **320**
as the **Starting X column**, and **480** as the **Ending X column**. First, let's enter **1**
for the **X column spacing**. You should see, in red, an estimate of 9.9 trillion fits. On a basic
four core machine, this will require about 67 days. Change the **X column spacing** to **2**, reducing
the total WLs in the full permutation modeling from 161 to 81, The estimate should now be shown in magenta,
and there are 36 billion fits. Again, on a basic machine, this will require about 5.8 hours. We know,
from the peak fitting that the Gaussian width of the spectral absorbances is approximately 12 nm, and
that a spacing much below 6 nm will not contribute to the prediction efficacy of the direct spectral models.
Change the **X column spacing** to **5**. This will use 33 total wavelengths in the full permutation
modeling, and fit 19.5 million fits. For this basic four core machine, about 28 seconds will be required.
We will use this 5 nm spacing.

Prediction Errors

Be sure the Leave One Out is selected.

Custom Fit Options

**Check** the **Refit to X column spacing = 1. **This will perform an additional fitting
step to resolve the predictors to a 1 nm resolution.

Click **OK** to start the fitting.

Because this is a rather basic modeling problem, even the 1-predictor fits have a very high
predicted r^{2}. Click **OK** when the fitting is done and choose a name for the saving the
full fit information to disk. Using this file you can recall this fit in a single import step at any
point further along. You will now see the GLM
Review.

Reviewing the Direct Spectral Fits

The GLM Review assembles the full-permutation GLM models, the smart stepwise forward models, and the sparse PLS models for an intelligent selection of a final predictive model.

The two most important windows will be the Model List window and the main GLM Review window containing the model plot and significance map. The Model List window will initially look similar to the following. The models may be in a different order if you have changed the defaults.

The default Model
List settings will consist of displaying only the predicted r^{2} and predicted average error.
In the **List** menu of the Model
List window, select **Median Error of Prediction**. In the **Filter** menu, select **Single
Predictor Count** and then select **Include 8 Predictor Models**. You see only the 8-predictor retained
models. From this **Filter** menu, select **Include 7 Predictor Models**. You now see only the
7-predictor retained models. In the **Sort** menu, choose **By X Predictor Compliance**. This sorts
the retained 7-predictor fits currently shown in the list by compliance with the weighted significance
map of the all of the retained 7-predictor fits.

The list is now ordered by the models that best match the 7-predictor wavelength significances, weighted by prediction error, across all retained 7-predictor models. You can think of this as a kind of maximum likelihood of wavelengths across the most effective models.

The model plot shows this first selection in the Model List, the model containing the wavelengths most compliant in a prediction performance weighted average of significance for all 7-parameter models retained in the fitting procedure. The lower model plot has the reference Y values of the modeling on the X-axis, and the predicted values on the Y-axis. By default, the lower plot will also show a 95% prediction interval for a linear fit of that reference vs. predicted plot. The upper graph in the model plot is a residuals plot (difference between predicted and reference). The residuals are defined so that the directions match in the two plots. The background colors highlight the five quintiles in the Y-data of the modeling.

Note that the model with the best wavelength significance compliance is not normally the best performing model on any other prediction metric.

Predictor Count Optimization

Click the Fit
Performance button in the main GLM
Review window. Select **Best Only**.

Multivariate parameter count optimization is an extensively studied science, and many different
prediction metrics, or combination of metrics, are often used. The panels consist of r^{2} of
prediction, standard error of prediction, base-10 log of F-statistic inverse, the corrected Akaike Information
Criterion (AICc), the Bayesian Information Criterion (BIC), the Minimum Description Length (MDL), the
median prediction error, and the average prediction error. The first six plots contain a reference (white
curve) which contains the same metric for the model estimation or calibration. This metric from the design
fit will almost universally be better than the prediction value. The median plot in the seventh panel
has as its reference the average error (orange). The median error will almost always be lower than the
average error. The average prediction error plot in the eighth panel has average prediction error references
for the 5 quintiles of the Y-variable. In general, the average errors will tend to increase with the magnitude
of the Y values.

Click the **Toggle Display of Reference Data** button in the graph
toolbar of the Fit Performance graph.

The choice of optimization is both art and science, and experience will often dictate your
preference. We usually defer to an information criterion, but we like to see where the predictions level
off for errors. In this fit, there are only small differences between a 1-predictor fit and higher predictors,
and as such these optimization criteria are not consistent. The best prediction by r^{2}, SE,
median, and average error is with 9-predictor smart stepwise models. The F-statistic and MDL optimize
to just 1 predictor, the AICC to 7, and the BIC to 4. In looking closely at the eight optimizations, we
would tend to go with the 4 predictor, following the BIC. At 4 predictors, you have much of the r^{2},
SE, median and average errors, the AICc flattens out at 4, and the MDL has a second minima at 4.

Click **OK** to close the Fit Performance Plot. In the Model List window's **Filter**
menu, be sure the **Single Predictor Count** option is checked and select the **Include 4 Predictor
Models** option. If the **Sort** is still set to **By X Predictor Compliance**, this is the model
you should see:

If we look at the first 4-predictor model, we see two UV and two VIS wavelengths. The 375 and 445 WLs generate the positive component of the prediction and the 365 and 411 WLs offset. We did see 377 nm and 448 nm in the peak fit we first made to identify target signal frequencies.

Experience and Judgment

In the initial visualization of the data matrix we saw one spectrum that was well removed from the others, nothing between. Is there another variant in nature not as yet included in our model data that would produce an equally strong deviation from the bulk of the spectra, but which would be as different or distinct? If so, there is a way to estimate that kind of error.

Click **OK** in the main GLM Review to exit the procedure and once again from the **Model**
menu, select the the **Fit GLM Models** option. This time check the **Random sampling, minimum prediction
performance** option for Prediction Errors. Leave the defaults in place (2/3 data to model, 1/3 data
to blind prediction, 10 random sets). Click **OK** to refit the data, click **OK** when the fit
has concluded, and enter a different file name for saving this fit to disk.

As a first step, click the **Fit Performance** button.

The Leave One Out prediction estimates allow for very little loss of bracketing in the modeling
problem. Model bracketing issues are emphasized when using a worst-case error in a random sampling procedure.
This assigns the highest error as opposed to an average when multiple blind predictions exist for a given
sample. Note how different the optimizations look. The predicted r^{2} optimizes to 1 (9), the
SE to 1 (9), the F-statistic to 1 (1), the AICc to 1 (7), the BIC to 1 (4), the MDL to 1 (1), the median
error to 2 (9), and the average error to 2 (9). The Leave One Out predictor count optimizations are in
parentheses. If you feel you are susceptible to an incomplete bracketing of the modeling problem in your
design data, this kind of analysis may be helpful. Here it suggests that we may be better off with a single
predictor model if we wish our prediction to defend against this kind of unknown.

Click **OK** in the Fit Performance window to close it. In the **Filter** menu of the
Model
List window, select **Include 1 Predictor Models**.

The best single predictor model, by r^{2} of prediction, is the one with the 445 nm
wavelength, and the stats and errors are only slightly weaker (5858 ppm least-squares error vs 5489 for
the 4-WL model).

Prediction

Click the Prediction
button in the left panel of the **GLM Review** window. Select the file **ModelingUVVIS_PredictionData.csv**
from the program's installed default data directory (\PeakLab\Data). . This file contains
602 spectra which were nowhere used in the modeling.

If the file containing the spectra for prediction doesn't have a matching Y-value name,
you must select it if these spectra have known reference values. Select **TotalC** for the Y-column
and click **OK**. A full Prediction procedure window opens and the model graph's upper plot is modified
to show this prediction instead of the residuals from the leave one out prediction from the fitting.

In this case, we see only a few outliers in the upper plot's prediction. The r^{2}
of this different prediction is weaker, 0.9914 vs. 0.9941 for the leave one out prediction error estimated
during the modeling.

**Rank****
****733****
****GLM[1](445)**

**Prediction
Statistics - from Original Fit - Leave One Out**

**Data:
s219-S372_UVVispt75-7m4(99)_Avg.csv - 156 Observations**

**Fits:
s219-s372_uvvispt75-7m4(99)_avg_nofilter_totalc_1nm.bin**

**Y
Column in Data Matrix:**** ****6 (pctTotal)**

** ****r**^{2}**
****SE****
****F-stat****
****AICc****
****BIC****
****MDL****
****DOF**

**0.9941419 ****
****0.1545550 ****
****2.613e+04 ****
****-135.7143 ****
****-126.7226 ****
****-98.23023 ****
**** 154
**

**
****Median Err****
****Avg Err****
****Avg Err
Q1****
****Avg Err
Q2****
****Avg Err
Q3****
****Avg Err
Q4****
****Avg Err
Q5**

**0.0966062 ****
****0.1193078 ****
****0.0678179 ****
****0.0947165 ****
****0.0969568 ****
****0.1377332 ****
****0.1968144
**

**
****Quintile****
****Lower****
****Upper**

**
First ****
****1.4797533 ****
****3.0649067
**

**
Second ****
****3.0649067 ****
****3.3511367
**

**
Third ****
****3.3511367 ****
****3.6682000
**

**
Fourth ****
****3.6682000 ****
****4.1988067
**

**
Fifth ****
****4.1988067 ****
****13.198050
**

**Prediction
Statistics - from Prediction Data**

**s001-218_SpectraWtAdj2_T123_m17(PI99).csv
- 602 Observations**

** ****r**^{2}**
****SE****
****F-stat****
****AICc****
****BIC****
****MDL****
****DOF**

**0.9913499 ****
****0.2212434 ****
****6.876e+04 ****
****-103.7852 ****
****-90.62456 ****
****-60.60203 ****
**** 600
**

**
****Median Err****
****Avg Err****
****Avg Err
Q1****
****Avg Err
Q2****
****Avg Err
Q3****
****Avg Err
Q4****
****Avg Err
Q5**

**0.0960807 ****
****0.1382589 ****
****0.0906968 ****
****0.1154860 ****
****0.1391623 ****
****0.1389409 ****
****0.2568171
**

**Prediction
Statistics - 10 Outliers Removed**

**s001-218_SpectraWtAdj2_T123_m17(PI99).csv
- 592 Observations**

** ****r**^{2}**
****SE****
****F-stat****
****AICc****
****BIC****
****MDL****
****DOF**

**0.9946643 ****
****0.1601656 ****
****1.1e+05 ****
****-484.4910 ****
****-471.3813 ****
****-439.9328 ****
**** 590
**

**
****Median Err****
****Avg Err****
****Avg Err
Q1****
****Avg Err
Q2****
****Avg Err
Q3****
****Avg Err
Q4****
****Avg Err
Q5**

**0.0935228 ****
****0.1235726 ****
****0.0906968 ****
****0.1154860 ****
****0.1391623 ****
****0.1389409 ****
****0.1952699
**

The prediction window will help you see if a prediction stands up to blind data. For this one-predictor model, we see an average leave one out prediction error of 0.1193. The predictions from the new data have an average error of 0.1382 with the red points included. By default, the prediction procedure trims out predictions outside of 3 SE of the reference fit. In this case, 10 possible outliers were removed and the average error dropped to 0.1235, respectably close to the leave one out error from the modeling. In general, you should expect some measure of incomplete bracketing in the model data relative to fully blind prediction data. We see that here, although this is small - what prediction scientists want to see.

In the Prediction window's **Outliers** menu, choose **Outside of 2 SE** in the
**Multiples of SE**.

**Prediction
Statistics - 36 Outliers Removed**

**s001-218_SpectraWtAdj2_T123_m17(PI99).csv
- 566 Observations**

** ****r**^{2}**
****SE****
****F-stat****
****AICc****
****BIC****
****MDL****
****DOF**

**0.9952925 ****
****0.1376405 ****
****1.192e+05 ****
****-634.6026 ****
****-621.6296 ****
****-589.8934 ****
**** 564
**

**
****Median Err****
****Avg Err****
****Avg Err
Q1****
****Avg Err
Q2****
****Avg Err
Q3****
****Avg Err
Q4****
****Avg Err
Q5**

**0.0885558 ****
****0.1103185 ****
****0.0869513 ****
****0.1154860 ****
****0.1285117 ****
****0.1248069 ****
****0.1582827
**

With this stricter removal of potential outliers, the count of predictions has dropped by another 26 spectra and the average prediction error from these 566 remaining spectra is now .1103.

**Scroll through the Model List** while looking at the Prediction
window and in particular, note the median error with no outliers removed.

Click **OK** to close the GLM Review.

Comparison with PLS and PCR

Partial Least Squares (PLS) and Principal Component Regression (PCR) are often used in chemometric modeling. How did our full permutation direct spectral modeling in this example compare?

The Unscrambler software application automatically selects the 1-factor PLS when fitting
320-480 at a 1 nm spacing. This 1-factor PLS model, has a 0.992103 r^{2} prediction (7893 ppm
error), an average absolute prediction error of 0.135, and a median absolute prediction error of .115.
The optimum PLS by best Leave One Out r^{2} of prediction is a four factor model:

This PLS model, 4-factor, 161 wavelengths, 162 coefficients, has an r^{2} prediction
of 0.993917 (6083 ppm error), an average absolute prediction error of 0.115, and a median absolute prediction
error of .929.

The optimum PCR by a Leave One Out r^{2} of prediction is a five principal component
model. This PCR model, 5-principal components, 161 wavelengths, 162 coefficients, has an r^{2}
prediction of 0.993896 (6104 ppm error), almost identical to the best performing PLS model by the Leave
One Out prediction method, an average absolute prediction error of 0.117 and a median absolute prediction
error of .934.

In this specific modeling problem, the direct spectral fits, even the simple 1-predictor model, outperformed the best PLS and PCR models.

A More Challenging Modeling Problem

In the Modeling Spectra - Part I - UV-VIS Data white paper, we discuss the challenges associated with fitting the bis-demthoxycurcumin (BDMC) in the UV-VIS.

Click **OK** in the main GLM
Review to exit the procedure and once again from the **Model** menu, select the the **Fit GLM
Models** option. Once again check the **Leave One Out** Prediction Errors and change the Selected
Y Column to **pctBDMC**. Click **OK** to refit the data.

You will note a very different look to the progress bar, and there is an indication
the average of the best 10 r^{2} of prediction models occurs with 4 predictors.

Click **OK** when the fit has concluded, and enter a different file name for saving
this fit to disk.

As a first step, click the **Fit Performance** button.

If you look closely at the BIC in the fifth panel and the MDL in the sixth you will see that both show an optimum at 4 predictors.

Click **OK** to close the Fit Performance Plot, In the Model List window's **Filter**
menu, be sure the **Single Predictor Count** option is checked and select the **Include 4 Predictor
Models** option. In the **Sort** menu, select **By X Predictor Compliance.**

**
**Click the** Select ****Color
Scheme**** and ****Customize
Colors**** **button in the graph's
toolbar**. **Select the **Cyan** color scheme and click **OK**. This is the model you should
see:

The errors are much smaller in magnitude only because there is far less BDMC in most
turmeric samples. The r^{2} of prediction is much weaker. The red points in the model plot are
those where the absolute residual is greater than 3 standard errors of this linear fit between the measured
and predicted. There is one spectrum amongst the 156 that is outside 3 SE. The yellow points in the model
plot are those outside 2 SE. There are four such spectra.

We will first get a sense for what a fit might look like if these red and yellow spectra points in the lower model plot were treated as outliers, where the measurements are treated as suspect, and these were removed from the data and a refit was made.

Click **Prediction** in the left panel of the GLM Review window and select the same
**s219-S372_UVVispt75-7m4(99)_Avg.csv** file we used for the modeling. In doing this, we have only
used the Prediction feature to replicate the model plot. Right click the model plot and select the **Mark
Inactive Outside Prediction Interval** option.

The last line in the titles and the upper plot reflect an improvement in the linear
fit from 0.9667 to 0.9757 by the removal of points outside the default 95% prediction
interval (this interval can be changed in the graph
toolbar). Note that eight spectra were removed by this prediction interval criterion. If we felt that
these truly were outliers in the data, we could use the **Data** option to flag outliers using any
number of statistical criteria, and we could then utilize its **File** menu **Save Data w/Flagged
Outliers Removed** option to save a revised data matrix without these spectra present. The revised data
matrix would then be fitted. In this instance, we believe the wider error band is more likely due to the
difficulty in modeling this component. Even if certain BDMC estimates are outliers from a measurement
error, their exclusion will not get us anywhere near the accuracy seen in the total curcuminoids model.

The model we are looking at has its principal positive contribution at 414 nm. In the Modeling Spectra - Part I - UV-VIS Data white paper Fig. 5, we demonstrate that the BDMC central peak is at a 417 nm blue absorbance and its chromatographic coeluting companion is at about 412 nm. The 414 central WL makes sense, as do the offsets between 435-450 nm that address the zone where the three components are most strongly separated.

In this model, no UV WLs at all were used.

Click the **Significance** button in the left panel of the main GLM
Review window, and select **Contour** if it is not already selected. **Zoom the contour plot to
show only the 3-8 predictor models**.

In the contour map, the UV wavelengths do not have a strong presence until 5 predictors are used in the models. Note that the five and higher predictor models have a very different significance map with respect to the four predictor models.

Click **OK** in the Significance window to close it. In the Model List window's **Filter**
menu, be sure the **Single Predictor Count** option is checked and select the **Include 5 Predictor
Models** option. In the **Sort** menu, select **By X Predictor Compliance**.

The compliance brought 5-predictor models with one or more UV wavelengths to the top
of the list. This first model in the list has an r^{2} of prediction that is weaker, even though
the median and average errors are less. If you look closely at the quintile 5 (Q5) average prediction
error, you will see that it is appreciably higher in this model. This fifth quintile has the lightest
color background in the plot and consists of the samples with the highest BDMC content.

We will filter the list so that only models with 400-450 nm visible wavelengths are
shown. In the Model
List **Filter** menu, select **No Filter** to clear all filtering. In the **Sort** menu,
choose **By Predicted rČ**. From this same **Filter** menu, select **Custom X Band Filtering**.

Enter the filter as above and click **OK**. Select the 2nd model in the list, the
**Rank****
****2****
****Step[5,4](414,420,444,447,449)**
model. This is a stepwise model that incorporates a great deal of tail information for the visible spectral
feature.

Did we benefit from this additional predictor versus the 4-predictor model? The r^{2}
of prediction is weaker, the F-stat is appreciably less, and there was no significant improvement in any
of the errors.

In the Model
List **Filter** menu, select **No Filter** to clear the custom filtering and then select the
**Include 4-predictor Models**. Our four-predictor model is again shown.

Exporting the Model

It is a simple matter to export the selected model.

C++

Click **Numeric**, and in the Numeric Summary window's **Options** menu, select
the **C++ Code - Input Specific X-Predictors**. You will see the following at the bottom of the Numeric
Summary:

**C++
Language Code - argument is specific spectra**

double glm01(double *spec)

{

// spec[0] X Predictor 1 (414, index=94)

// spec[1] X Predictor 2 (439, index=119)

// spec[2] X Predictor 3 (444, index=124)

// spec[3] X Predictor 4 (449, index=129)

double p[5]= {

0.0221445801426611, 0.0202444570210056, -0.01915233028586, 0.0189152012010083, -0.021939480431503,

};

int nx = 4;

double estimate = p[0];

for(int i=1; i<=nx; i++)

estimate += p[i]*spec[i-1];

return estimate;

}

The model is as simple as storing the constant and performing a dot product with four coefficients and four spectral values.

Visual BASIC

In the Numeric Summary window's **Options** menu, select the **VB Code - Input Specific
X-Predictors**. You will now see the following at the bottom of the Numeric Summary:

**VBA
- Excel - specific spectra across columns in a single row**

Function GLMEval(Spectra As Range) As Double

' Spectra in Spectra.Cells(1,1) X Predictor 1 (414, index=95)

' Spectra in Spectra.Cells(1,2) X Predictor 2 (439, index=120)

' Spectra in Spectra.Cells(1,3) X Predictor 3 (444, index=125)

' Spectra in Spectra.Cells(1,4) X Predictor 4 (449, index=130)

Dim N As Integer

Dim i As Integer

Dim eval As Double

Dim parms(21) As Double

parms(1) = 0.0221445801426611

parms(2) = 0.0202444570210056

parms(3) = -0.01915233028586

parms(4) = 0.0189152012010083

parms(5) = -0.021939480431503

eval = parms(1)

N = 4

For i = 1 To N

eval = eval + Spectra.Cells(1, i) * parms(i + 1)

Next i

GLMEval = eval

End Function

This code is structured to work within the VBA Excel development environment.

Excel In-Cell Formulas

Click the **Data** button in the left panel of the main GLM
Review window and from the Data window's **File** menu, select **Save Summary And X Data As...**
option. Save the CSV file, and if you wish, open this CSV file in Excel.

In this export, the formula in cell L2 is a string. Convert it to a formula with a leading = symbol and copy it to the additional rows in the L column where there is spectral data in the G-J columns

Column L now contains the model evaluations matching the results you see in the Data summary.

Evaluating the Prediction in Various Applications

The Prediction should still be shown in the upper plot in the GLM
Review graph. Right click the graph and choose **Refresh - Clear Inactive**. The upper Prediction
plot and the lower Model plot should now match. All of the following options will allow you to validate
the models in other application software.

Maple

In the Prediction
window's **Edit** menu select **Copy Prediction Evaluations for Maple**.

Parms:=[-0.049541042518362,0.011068653042172]:

WL:=[445]:

mSpec:=[[1,382.708145121588],

[1,398.806969759026],

[1,370.069754244102],

...

[1,275.992993955259],

[1,240.21593729655]]:

pred :=array(1..602):

for i from 1 to 602 do

p := 0: for j from 1 to 2 do p := p + mSpec[i,j]*Parms[j]: od:

pred[i] := p:

od:

print(pred);

Paste this code into Maple to evaluate all of the spectra imported in the prediction.

Mathematica

In the Prediction
window's **Edit** menu select **Copy Prediction Evaluations for Mathematica**.

Parms:={-0.049541042518362,0.011068653042172}

WL:={445}

mSpec:={{1,382.708145121588},

{1,398.806969759026},

{1,370.069754244102},

..

{1,275.992993955259},

{1,240.21593729655}}

Table[mSpec[[i ;; i, 1 ;; 2]].Parms, {i, 1, 602}]

Paste this code into Mathematica to evaluate all of the spectra imported in the prediction.

Excel

In the Prediction
window's **Edit** menu select **Copy Prediction Evaluations for Excel**. You can also use the Prediction
window's **File** menu option Save Summary and X-Predictor Data As... to write this information to
a CSV file. It is similar to the the Data window's **File** menu **Save Summary And X Data As...**
option except that the imported prediction data is written to the file.

C++

#include "stdafx.h"

double Parms[2]={-0.049541042518362,0.011068653042172};

double WL[1]={445};

double mSpec[602][1]={

{382.708145121588},

{398.806969759026},

...

{275.992993955259},

{240.21593729655}};

double glmEval(int nx, double *parms, double *spec)

{

double estimate = parms[0];

for(int i=1; i<=nx; i++)

estimate += parms[i]*spec[i-1];

return estimate;

}

double *glmEvalAll()

{

int nobs = 602;

int nx = 1;

double *estimate = (double *)calloc(nobs, sizeof(double));

for(int i=0; i<nobs; i++)

estimate[i] = glmEval(nx, Parms, mSpec[i]);

return estimate;

}

The glmEvalAll() function will evaluate all of the predicted data, glmEval(_) evaluates a single prediction.

Click **OK** to close the GLM Review.

Modeling Spectra Tutorial II - Powder FTNIR

In the second modeling tutorial, we will address spectral data that is far harder to model. We address both obfuscating absorbances, as well as latent variables of powder particle size (distribution density) and powder moisture content.