PFChrom v5 Documentation Contents AIST Software Home AIST Software Support

The Quest for a Universal Chromatographic Fit

Over the course of the five years PFChrom was in development, much came to light that helped us to better grasp the concept of an ideal chromatographic nonlinear fit. In simplest terms, an ideal chromatographic non-linear fit was seen to require:

(1) A model that was capable of accounting nearly all of the variance in the data while retaining statistical significance in all fitted parameters, and where each of the components of that model was in line with prevailing theory. The model would need to manage all types of chromatography, LC and GC, including gradient HPLC and preparative shapes.

(2) A fitting procedure that would manage such a complex model swiftly and efficiently, and which resolved the global solution in the iterative optimization in a single step requiring no user intervention.

(3) A certainty we 'got it right', that everything that could be estimated by a statistical modeling procedure was being fitted. As a test, we wanted to be able to fit higher concentrations with much more strongly fronted and tailed shapes and see equally effective fits. We also wanted to see gradient HPLC peaks and higher overload preparative shapes fitted by the same models, ideally to the same measure of efficacy.

What Would an Ideal Fit Look Like with Respect to Goodness of Fit?

We will use one of the samples from the Chromatographic Experiments tutorial, one that has an good S/N and no impact from an additive:

We can generate a Fourier spectrum of the data that isolates the power where the noise in the data begins and how many significant figures exist where all determinacy is lost:

The noise floor begins at about -120 dB and finishes at the highest frequencies at about -130 dB. This corresponds to approximately six significant figures of information. The sixth significant figure in the data is likely to be the equivalent of random noise.

If we zoom-in, we see the noise starting at about -80 dB both in the decay and in the first oscillation. This corresponds to approximately four significant figures of information. This means the fourth significant figure should be attainable to full accuracy. For the count of points in this data set, this would be an F-statistic of 10^8 or 100 million. Using this loosely as a benchmark, we can say a 'perfect' fit would have every parameter significant, and a goodness of fit F-statistic of at least 100 million. Since the noise only begins at the -80 dB threshold, we could easily enough assert a higher F-statistic, perhaps closer to 10^9, one-billion, as being the target for a 'perfect' fit to this real world data.

The GenHVL<ge> Fit

If we fit the data to an HVL once-generalized chromatographic peak model with the <ge> IRF, the sum of half-Gaussian and exponential distortions, we see the following:

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999854 0.99999853 0.00698952 1.3617e+08 1.46243271

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}

1 GenHVL<ge> 3.81823449 2.87894324 0.03850284 -0.0058968 0.01509544 0.00507165 0.04195172 0.65914916

Parameter Statistics

Peak 1 GenHVL<ge>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.81823449 0.00017319 22046.9454 3.81778778 3.81868121 0.00000

Center 2.87894324 0.00016563 17382.1215 2.87851602 2.87937045 0.00000

Width 0.03850284 2.7689e-05 1390.56223 0.03843142 0.03857426 0.00000

Distortn -0.0058968 2.4275e-06 -2429.1650 -0.0059031 -0.0058906 0.00000

Z-Asym 0.01509544 5.5308e-05 272.932932 0.01495278 0.01523810 0.00000

g-sd 0.00507165 0.00015753 32.1939309 0.00466531 0.00547798 0.00000

e-tau 0.04195172 3.4493e-05 1216.23128 0.04186275 0.04204069 0.00000

g-frac 0.65914916 0.00103868 634.604392 0.65647003 0.66182828 0.00000

For a 99% statistical significance (99% confidence the parameter is non-zero), with this count of data, the magnitude of the t-value, the ratio of the parameter estimate to its standard error, should be 2.5 or higher. Only the half-Gaussian narrow width component, which we know must approximate multiple effects, has anything other than an exceptional significance. Even the weakest parameter is well removed from this threshold of statistical insignificance. For the moment, we will postpone discussion of the assumptions associated with the least-squares confidence statistics.

The GenHVL<pe> Fit

If we fit the GenHVL<pe> model where the IRF's narrow component approximately models interphase
mass-transfer resistances with an order 1.5 kinetic decay instead of the half-Gaussian intended to model
axial dispersion, we see a slight improvement, an F-statistic of 145 million and 1.38 ppm error, but this
a_{5} (k-tau) time constant parameter has slightly less significance. A <pe> IRF is harder
to fit than the <ge> IRF since the long tail of the 1.5 power narrow width component will be more
correlated with the higher width exponential. Note also that a_{7} (k-frac), the area fraction
of this narrow IRF component, has a wider confidence band.

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999862 0.99999862 0.00678137 1.4466e+08 1.37662531

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}

1 GenHVL<pe> 3.82107931 2.88269228 0.03873919 -0.0058753 0.01521039 0.00022985 0.04187664 0.55635860

Parameter Statistics

Peak 1 GenHVL<pe>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.82107931 0.00022505 16978.9483 3.82049883 3.82165979 0.00000

Center 2.88269228 5.3479e-05 53903.2631 2.88255433 2.88283022 0.00000

Width 0.03873919 1.5543e-05 2492.35269 0.03869910 0.03877928 0.00000

Distortn -0.0058753 1.8227e-06 -3223.4798 -0.0058800 -0.0058706 0.00000

Z-Asym 0.01521039 5.4004e-05 281.655247 0.01507109 0.01534968 0.00000

k-tau 0.00022985 7.9302e-06 28.9842808 0.00020940 0.00025031 0.00000

e-tau 0.04187664 3.3724e-05 1241.75128 0.04178966 0.04196363 0.00000

k-frac 0.55635860 0.00602294 92.3731858 0.54082325 0.57189395 0.00000

Limitations of Nonlinear Modeling

Just as we discussed in the topic Understanding PFChrom's models with respect to spectroscopy, there will be factors which can be modeled and those which cannot. Just as the two different kinds of Lorentzian broadening can't be identified in Voigt model fits, so does a similar argument apply to this narrow width IRF component in chromatography.

We know there are multiple narrow-width IRF effects, but non-linear modeling can only statistically manage a single representation of this IRF component. We successfully model nearly all of the variance that can be modeled, given the S/N of the system, using a single narrow width component. There is too little variance remaining to add a second narrow width component to the IRF and quantify both.

We cannot model the amount and width of a postulated half-Gaussian axial dispersion, the amount and width of a postulated interphase mass-transfer resistances through porous media, and the amount and time constant of a higher width first order system delay. In nonlinear peak fitting, we can only model the amount and width of one assumed form of narrow-width component, as well as the amount and time constant of a higher-width exponential.

Benefits of Nonlinear Modeling

For a spectroscopic peak, both natural and collision broadening have the same theoretical shape, the Lorentzian. The instrumental distortion typically has the same theoretical shape as Doppler broadening, the Gaussian. The convolution actually has a closed-form solution (in the complex domain), a lovely simplification, but fitting that convolution, the Voigt model, cannot distinguish the two types of Lorentzian spectral broadening, nor can peak fitting separate the IRF and Doppler sources of Gaussian broadening.

For chromatographic peaks, apart from this issue of insufficient information to process more than one narrow-width IRF, there is far less ambiguity. Each of the parameters in a PFChrom once-generalized model describes a unique feature of a peak:

a_{0} - the area (zero moment); fitting is the only way to quantify accurate peak areas when peaks
are overlapped or there are small hidden peaks in the rise or decay of larger peaks

a_{1} - this will be the center of mass (first moment) of one of the deconvolutions with the impact
of the IRF removed. This will be the 'true' peak center, and depending on the model selected, this can
reflect or remove any concentration dependency. In a conventional integration, you see only the mode of
the peak, and with the IRF's distortion altering the observed retention.

a_{2} - depending on model selected, this will be either a diffusion width (square root of the
second moment) or a kinetic time constant, a width independent of the IRF and multiple-site binding effects;
for a kinetic model, this will be the solute
desorption time constant.

a_{3} - the concentration-dependent chromatographic tailing and fronting, the shape unique to
chromatographic peaks that can only be realized by peak fitting; for the kinetic models, this will estimate
the equilibrium
constant for adsorption.

a_{4} - the zero-distortion density (ZDD) third moment asymmetry which likely accounts multiple-site
kinetics. It is this parameter that allows a generalized chromatographic model to fit the HVL theoretical
model, the NLC theoretical model, and all shapes between as well as those of a greater asymmetry as is
routinely seen in chromatographic peaks. This a_{4} parameter is generally treated as a constant
across all peaks.

a_{5} - the SD width or time constant of the narrow width instrument response function (IRF) component;
its limited impact on peak shape requires this be specified as a half-Gaussian, for modeling axial dispersion,
an exponential, for fast first order kinetic distortions, or as a 1.5 fractional order kinetic, to approximate
mass-transfer resistances with a second order step in an overall sequence. While this a_{5} parameter
is sensitive to run conditions and prep, the impact is usually small enough that this factor can be treated
as constant across all peaks.

a_{6} - the time constant of the wider IRF component; always a first-order exponential, and generally
very close to constant, independent of run conditions and prep, specific to the instrument flow path and
detection.

a_{7} - the area fraction of the narrow component of the IRF; also very close to independent of
process variables. This a_{7} parameter is also easily treated as constant across all peaks in
the data.

Do We Have It Right?

One could make a strong argument with respect to the orthogonality of the parameters in the once-generalized
chromatographic models. Parameters a_{0}, a_{1}, a_{2}, and a_{4} correspond
directly with the zero, first, second, and third moments of the zero distortion density (ZDD) upon which
a generalized model is built. The a_{3} parameter is the concentration operator that produces
the chromatographic tailing and fronting using this ZDD as its starting point. We cannot imagine a more
compact optimum with respect to this orthogonality of moment-mapped parameters. In the twice generalized
models, we do use one more parameter in the core model, but this corresponds directly with an adjustment
to the fourth moment of the ZDD.

In a peak fit, there are ways to have a reasonable certainty of a correct model and an optimum fit. Because it is a statistical regression procedure with confidence statistics and probabilities, there are statistical metrics that readily catch incorrect models and overfitting. In the fits above, the probabilities (the probability of the values actually being indistinguishable from zero) are all 0. Again we will defer discussing the assumptions which underlie these confidence statistics. We will instead give several examples of what happens when fitting an incorrect or overspecified IRF or an incorrect core model.

Examples Where Parameter Significance Fails

Convolving Rather then Summing the Two Components of the IRF - An Incorrect IRF

We will start by fitting the same generalized core model, but the IRF will consist of the half-Gaussian and exponential convolving one another instead of summing together in a simultaneous distortion:

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99643976 0.99642188 0.34474151 65,072 3560.24004

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}

1 GenHVL<gex> 3.69149765 2.89959539 0.04570381 -0.0051920 0.01352180 0.00013341 0.00736533

Parameter Statistics

Peak 1 GenHVL<gex>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.69149765 0.00829569 444.989798 3.67010010 3.71289521 0.00000

Center 2.89959539 0.00331705 874.149369 2.89103953 2.90815124 0.00000

Width 0.04570381 0.00100273 45.5794441 0.04311741 0.04829020 0.00000

Distortn -0.0051920 7.5952e-05 -68.358426 -0.0053879 -0.0049961 0.00000

Z-Asym 0.01352180 0.00397407 3.40250724 0.00327125 0.02377236 0.00069

g-sd
0.00013341
1157.31869
__1.1527e-07__
-2985.1394
2985.13966
__1.00000__

e-tau 0.00736533 0.00187375 3.93080002 0.00253226 0.01219840 0.00009

This is an example of how critical it is to get the IRF correct. We are fitting the same GenHVL peak model with only the IRF changed. We even use the same two IRF components, but in a convolution instead of a sum. The fit is poor, at least by contrast, an error of 3560 ppm, and the half-Gaussian width is statistically indistinguishable from 0. The grayed values in the table indicate a failed significance for a given parameter. This can occur with a model that is incorrect, as in this case, as well as with a model which is overspecified, when two or more parameters are strongly correlated and it becomes a mathematical tossup in terms of which portions of the variance each of these parameters capture.

Fitting an IRF with Two Narrow-Width Components - An Overspecified IRF

We will now fit an example of an overspecified IRF. We have created a GenHVL<gpe> UDF where the IRF is a five-parameter sum of a half-Gaussian and order 1.5 kinetic for the narrow component, and the same first order exponential for the higher width component. This is an example of fitting a model that describes two narrow width IRF components, the axial dispersion as a half-Gaussian and the mass transfer resistances with a 1.5 power kinetic decay. There are two additional IRF parameters, the additional narrow width term, and a second area fraction:

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999870 0.99999869 0.00658858 1.1919e+08 1.29760045

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}
a_{8} a_{9}

1 GenHVL<gpe>-udf1 3.82081148 2.87969655 0.03855468 -0.0058916 0.01515923 0.00418343 0.00093313 0.04178429 0.58254686 0.09016889

Parameter Statistics

Peak 1 GenHVL<gpe>-udf1

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

a0 3.82081148 0.00028478 13416.8184 3.82007694 3.82154603 0.00000

a1 2.87969655 0.00047178 6103.92815 2.87847966 2.88091344 0.00000

a2 0.03855468 2.8851e-05 1336.33152 0.03848026 0.03862910 0.00000

a3 -0.0058916 2.8715e-06 -2051.7479 -0.0058990 -0.0058841 0.00000

a4 0.01515923 5.3469e-05 283.513409 0.01502132 0.01529715 0.00000

a5 0.00418343 0.00026756 15.6353479 0.00349329 0.00487357 0.00000

a6
0.00093313
0.00122949
__0.75895921__
-0.0022382
0.00410442
__0.44801__

a7 0.04178429 8.1879e-05 510.316549 0.04157309 0.04199548 0.00000

a8 0.58254686 0.10733620 5.42731038 0.30568753 0.85940618 0.00000

a9
0.09016889
0.11126955
__0.81036448__
-0.1968360
0.37717378
__0.41787__

The fit does have a better r^{2} goodness of fit than either the GenHVL<ge> model with the
half-Gaussian narrow IRF component, and the GenHVL<pe> model with the 1.5 order kinetic decay. The
F-statistic, however, suggests a weaker description of the data, and more importantly, the 1.5 order kinetic
time constant, a_{6}, and the 1.5 order area fraction, a_{9}, test as insignificant. This
is why PFChrom limits all built-in IRFs to three parameters and no more than two components. There is
not enough information in the data for two narrow width components to be realistically fitted.

Fitting a Core Model with a Fourth Moment Adjustment instead of a Third Moment Adjustment - An Incorrect Core Model

Here we fit a model where the kurtosis or fourth moment of the ZDD is adjustable, but where the third moment or skewness is constrained to be zero. The ZDD is thus symmetric and only the thinness or fatness of the tails is adjusted:

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999753 0.99999751 0.00908698 80,564,528 2.47183679

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}

1 GenHVL[Q]<ge> 3.81959030 2.88406266 0.03922594 -0.0058327 1.94088791 0.00044012 0.04209852 0.64684284

Parameter Statistics

Peak 1 GenHVL[Q]<ge>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.81959030 0.00022522 16959.7284 3.81900939 3.82017121 0.00000

Center 2.88406266 0.00070015 4119.21186 2.88225673 2.88586860 0.00000

Width 0.03922594 3.926e-05 999.129240 0.03912467 0.03932720 0.00000

Distortn -0.0058327 3.946e-06 -1478.1241 -0.0058429 -0.0058225 0.00000

Q-power 1.94088791 0.00029393 6603.15370 1.94012975 1.94164607 0.00000

g-sd
0.00044012
0.00117210
__0.37549847__
-0.0025831
0.00346339
__0.70735__

e-tau 0.04209852 4.4059e-05 955.493268 0.04198487 0.04221216 0.00000

g-frac
0.64684284
0.49246444
__1.31348131__
-0.6234006
1.91708632
__0.18924__

In this instance we fit the same IRF, and only the higher moment in the ZDD is changed. This [Q] ZDD is capable of reproducing the HVL, but not the NLC since its ZDD is the asymmetric Giddings. Here we see that we have an exceptional 2.47 ppm error, but both the half-Gaussian width, and its area fraction failed the significance testing. In the conventional analysis, we want a t-value > 2.5 and a probability of zero < 0.01. This is an example of how critical it is to get the ZDD correct.

Examples Where Parameter Significance Succeeds

Fitting a Twice-Generalized Core Model with Both Third and Fourth Moment Adjustments - An Extension to An Appropriate Core Model

If the orthogonality of moments translates to a lack of intercorrelation between the parameters, we should
also be able to fit a twice-generalized model to full significance**:**

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999864 0.99999863 0.00673985 1.2814e+08 1.35884292

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}
a_{8}

1 Gen2HVL<ge> 3.81849882 2.87996856 0.03868929 -0.0058778 1.98538288 0.01146987 0.00417975 0.04196685 0.66548681

Parameter Statistics

Peak 1 Gen2HVL<ge>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.81849882 0.00016906 22587.0020 3.81806276 3.81893488 0.00000

Center 2.87996856 0.00021214 13575.8441 2.87942138 2.88051574 0.00000

Width 0.03868929 3.1929e-05 1211.72535 0.03860693 0.03877165 0.00000

Distortn -0.0058778 2.9055e-06 -2022.9989 -0.0058853 -0.0058703 0.00000

Y-power 1.98538288 0.00137469 1444.24106 1.98183706 1.98892871 0.00000

Y-asym 0.01146987 0.00034598 33.1514053 0.01057745 0.01236229 0.00000

g-sd 0.00417975 0.00020540 20.3492876 0.00364994 0.00470955 0.00000

e-tau 0.04196685 3.3052e-05 1269.73547 0.04188160 0.04205210 0.00000

g-frac 0.66548681 0.00136143 488.815224 0.66197519 0.66899843 0.00000

While the statistics confirm the third moment Y-asym and the half-Gaussian IRF g-sd width are the most weakly determined of the parameters, all test to full 99% significance without difficulty. The fourth moment parameter, Y-power, fits to 1.985 just shy of the 2.0 power of a Gaussian decay. Note that the F-statistic of 128 million for the twice-generalized model is less than the 136 million of the once-generalized model. In this instance, there is no statistical basis for using a twice-generalized model which also adjusts the fourth moment. We only note the stability of a fit where parameters which strictly adjust the specific moments are used. This twice generalized model fit is one with every parameter significant. It is simply not a 'better' model for this data. The F-statistic is used to select the most appropriate model for the data.

Fitting a Different Once-Generalized Model that Performs a Different Third Moment Adjustment

In this case, we will fit an alternative model that adjusts the third moment or skewness of the ZDD. The GenHVL[G] model uses the Skew Normal or GMG as the ZDD instead of the default generalized normal ZDD.

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999884 0.99999883 0.00623526 1.7111e+08 1.16383188

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}

1 GenHVL[G]<ge> 3.81848259 2.86191875 0.03643718 -0.0052848 0.02150691 0.00501900 0.04194502 0.65928185

Parameter Statistics

Peak 1 GenHVL[G]<ge>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.81848259 0.00015440 24731.5265 3.81808435 3.81888084 0.00000

Center 2.86191875 0.00013431 21308.3103 2.86157231 2.86226518 0.00000

Width 0.03643718 2.4197e-05 1505.83501 0.03637476 0.03649959 0.00000

Distortn -0.0052848 3.2396e-06 -1631.2892 -0.0052931 -0.0052764 0.00000

G-sd 0.02150691 2.5111e-05 856.473158 0.02144214 0.02157168 0.00000

g-sd 0.00501900 0.00013860 36.2122220 0.00466150 0.00537650 0.00000

e-tau 0.04194502 3.0735e-05 1364.73281 0.04186574 0.04202429 0.00000

g-frac 0.65928185 0.00090713 726.779448 0.65694205 0.66162166 0.00000

In this case, we have a slightly better F-statistic, and thus potentially a better model for describing this specific data, although as you shall see shortly, this is not the case when a small measure of overload is present. In this fit, all of the parameters are statistically significant. This does illustrate that there is more than one way to model the ZDD asymmetry arising from multiple-site adsorption and other effects which directly impact the actual chromatographic separation. Both the logarithmic scaling of the generalized normal, and the half-Gaussian convolution of the skew normal, can produce a statistically viable picture of the intrinsic ZDD skewness. We selected the GenHVL's and GenNLC's generalized normal for the default once-generalized chromatographic models since it can fit both the HVL and NLC to a much higher precision than the data can be sampled, and as you shall see, it is appreciably more robust with respect to modeling wide ranges of concentration.

Fitting a Different Once-Generalized Model with Two Third-Moment ZDD Adjustments - A Possible Overspecification

In this fit we use the GenHVL[V] as the core model. The 'V' ZDD uses two third moment-adjustments, this logarithmic factor as well as a half-Gaussian convolution in the ZDD. It can be overspecified since there are two separate adjustments of the third moment skewness in the ZDD.

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99999885 0.99999885 0.00619136 1.5185e+08 1.14667712

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}
a_{6}
a_{7}
a_{8}

1 GenHVL[V]<ge> 3.81854086 2.86127001 0.03625329 -0.0051807 0.02337363 -0.0031897 0.00445524 0.04195416 0.66326512

Parameter Statistics

Peak 1 GenHVL[V]<ge>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.81854086 0.00015426 24754.0990 3.81814297 3.81893875 0.00000

Center 2.86127001 0.00021502 13306.8336 2.86071539 2.86182463 0.00000

Width 0.03625329 6.2735e-05 577.880863 0.03609147 0.03641511 0.00000

Distortn -0.0051807 3.6767e-05 -140.90579 -0.0052755 -0.0050859 0.00000

G-sd 0.02337363 0.00062766 37.2393064 0.02175467 0.02499260 0.00000

Z-Asym
-0.0031897
0.00108503
__-2.9397491__
-0.0059884
-0.0003910
__0.00334__

g-sd 0.00445524 0.00028504 15.6301759 0.00372002 0.00519047 0.00000

e-tau 0.04195416 3.0661e-05 1368.33747 0.04187508 0.04203325 0.00000

g-frac 0.66326512 0.00196434 337.652224 0.65819836 0.66833187 0.00000

Although in this example, the significance of all parameters met the 99% limits, it only barely did so in the logarithmic adjustment represented in the Z-Asym parameter. The ppm is the best of the models, but again, one is at a threshold of significance in one of the parameters, and the F-statistic suggests that the model, while not overspecified per se, may be more complex than necessary. One should always use the simplest model that accurately represents the data, even if that model does not have the best r² goodness of fit. For model selection, we rely almost exclusively on the F-statistic.

Fitting a One Component IRF - An Insufficient Description of the IRF

In this next fit, we omit the narrow width component in the IRF and fit only the higher width first order exponential:

Fitted Parameters

r^{2} Coef Det
DF Adj r^{2}
Fit Std Err
F-value
ppm uVar

0.99934974 0.99934694 0.14727979 429,084 650.263870

Peak
Type
a_{0}
a_{1}
a_{2}
a_{3}
a_{4}
a_{5}

1 GenHVL<e> 3.77081642 2.85630628 0.03159503 -0.0063686 0.01236693 0.02538398

Parameter Statistics

Peak 1 GenHVL<e>

Parameter Value Std Error t-value 99% Conf Lo 99% Conf Hi P>|t|

Area 3.77081642 0.00318714 1183.13488 3.76259566 3.77903719 0.00000

Center 2.85630628 0.00052933 5396.06983 2.85494095 2.85767161 0.00000

Width 0.03159503 0.00029123 108.489870 0.03084385 0.03234620 0.00000

Distortn -0.0063686 4.6122e-05 -138.08015 -0.0064876 -0.0062496 0.00000

Z-Asym 0.01236693 0.00088629 13.9536428 0.01008088 0.01465298 0.00000

e-tau 0.02538398 0.00013983 181.531310 0.02502330 0.02574465 0.00000

Although all parameters are significant at the 99% threshold, note the much higher error, 650 ppm relative to values in the single digits. Each of the parameters is significant in the fitted model, but the model itself isn't a particularly good one. One wants a fit with close to zero error and full statistical significance in each parameter.

Confidence Statistics and Assumptions

In a typical regression analysis, one may capture 90% of the variance of the data. The remaining 10%, will consist of residuals (the difference between the data and fit at each point in the data). In such a fit, one wants the assumptions underlying the confidence statistics to be met. The normal statistical assumptions of 'IID' (independent and identically distributed) residuals, of a normal (Gaussian) density, are necessary to treat the statistics of the least-squares fit as valid. One does not want systematic trends in the residuals (the residuals should not be correlated across adjacent points), and the histogram of the residuals should consist of a normal density. If these requirements are not met, the confidence statistics are deemed inaccurate, and a more complex estimate of error, as realized by non-parameteric methods or a computationally intense bootstrap, is then used.

For a PFChrom fit where the unaccounted variance is 1 ppm, as in this example, the residuals will not consist of 10% of the variance or power within the data. In this instance, they represent just one part per million of the variance, .0001%, five orders of magnitude less than this 10%. It is our experience that PFChrom fits can be so close to complete, you can see as many different fits and error distributions as as you like simply by making exceptionally small differences in the intricacies of fitting the baseline in the baseline correction step. In fact, in the kind of baselines observed in higher concentration analyses, you can see differences as great as 100 ppm error vs 1 ppm simply by fitting a linear rather than non-parameteric baseline. A PFChrom fit to analytic chromatographic data will be such that the error you see will likely be a product of the accuracy of the baseline correction.

For the GenHVL<ge> fit the residuals show a strong systematic trend. One point's value is strongly correlated with the point prior and after. This is fully expected. We know we are not capturing the narrow width components of the IRF as they actually exist. We have made a simplification where a single component must be fitted for all narrow width IRF processes. Since we fit a single peak standard in this example, this systematic trend is not a baseline correction issue. The different systematic oscillations do correspond with that which is not being accounted in the model. If our model was 'perfect' at this 1 ppm error, we would see uncorrelated random residuals.

When systematic trends exist, the density of the residuals will seldom be Gaussian. For this GenHVL<ge> fit, the density is clearly far from normal. The overall shape of this density will reflect the subtle nuances in the baseline correction.

Although in this case the lack of normality is obvious, PFChrom does offer a stabilized normal probability (SNP) plot of the residuals with 90, 95, 99, and 99.9 % critical limits. For a density to be assumed normal to a 99.9% confidence, not one single pint should lie above the upper (or below the lower) red horizontal line. Clearly the normality assumption for the residuals fails.

We have just illustrated that the confidence and significance error statistics for the 1 ppm fit are not deemed statistically valid. And yet we have already shown that they catch incorrect or overspecified IRFs, and they catch incorrect, insufficient, and even close-to-overspecified core peak models.

One could argue that anything representing just 1 ppm of the data should never be analyzed at all, akin to studying six sigma outliers or attempting to compute a 99.999% confidence interval, but we have found the fit statistics at near zero error levels incredibly useful for screening models for correctness and the absence of overspecification. From a pragmatic perspective, we know a 95% confidence limit is not actually a true 95% band because of these assumptions being violated. At the same time, we do not dismiss the 1 ppm fit errors as having no value and nothing to tell us, much as we illustrated in the examples above. What we have chosen to do is strictly ad-hoc; we often use the 99% confidence statistics and simply assume they are probably closer to 90% confidence values.

It is also a practical consideration. For such exceptionally low error fits we could only conceive using bootstrap methods with a subsampling of data to estimate a more accurate error in fitting. Fourier methods require uniformly-spaced x-values. The fit we made of the above GenHVL<ge> data using Fourier methods required just .42 seconds for this 1400 point data set. To fit a non-uniformly sampled subset of this data as one element of a bootstrap, fitting the actual integral with an exceptionally fast quadrature routine, required 6.5 minutes. Since no bootstrap with a 95% level would be deemed valid without at least 1000 samples, this simple analysis would require 4.5 days. For a data set with many peaks, each of which would have to be independently fitted with the convolution integral, and to data sets of 10,000 or more points, a true error estimate might require a month or more of continuous computation.

We will also note that statisticians may not be your best source for validation the modeling within PFChrom. They may not see a goodness of fit near 1 ppm error as anything other than overfitting. It is altogether possible statisticians may go their entire careers, with an extensive experience in regression analysis, and never see data comparable to that which is observed in modern chromatographic instruments or models which so precisely describe real-world data.

The Concentration Test

The higher moment adjustments to the ZDD are amplified in the a_{3} chromatographic distortion
operator which produces the observed fronting and tailing in the peaks. This makes the fitting of different
concentrations a kind of litmus test for the robustness and universality of a chromatographic model. It
must hold up at both dilute concentrations and at analytic concentrations where a small measure of overload
slips in, as often occurs in practice, especially when the object of the analysis consists of lower area
component peaks.

If we fit the GenHVL<ge> model to this standard peak at concentrations of 5, 10, 25, and 50 ppm, we would want the fits to be close to equally effective at all of these concentrations. These may seem small differences, but with respect to concentration-dependent shapes, in this case fronting, the differences are immense, as the area normalized plots of the data above suggest. Note that the 5 and 10 ppm (white, yellow) track one another in the initial rise, but the 25 ppm (green) deviates slightly, and the 50 ppm (blue) significantly, suggesting a small measure of overload in these two higher concentration samples.

If we fit these four data sets to the GenHVL<ge> model, we realize fits of 2.11, 1.46, 1.92, and
7.21 ppm error respectively. One would expect the fits to improve with concentration (because the S/N
improves), but only to that point where a measure of overload starts to appear (since the once-generalized
models do not process overload). We have four vastly distinct peak shapes, and only the last of these
has a somewhat higher fit error. For a model to fit such highly differentiated fronted shapes, it must
be capable of accurately representing the true ZDD since the a_{3} chromatographic operator translates
this ZDD into these concentration dependent shapes. To fit this variation in shapes above, the ZDD model,
and the estimation of the IRF, must be exceptionally accurate.

For example, the GenHVL[G]<ge> model, which uses the Skew Normal or GMG ZDD, and which performed so well with the 10 ppm data, realizes 1.86, 1.16, 5.63, and 124.8 ppm errors across these four concentrations. The model works well at low concentrations but not at higher ones.

If we look at the Gen2HVL<ge> twice-generalized model which adds a fourth moment adjustment to the once-generalized default chromatographic model, we see fits with 1.07, 1.36, 1.10, and 2.79 ppm error. This is exactly what a twice generalized model should do, the fourth moment adjustment managing the presence of this small amount of overload.

The Ideal Fit

Have we found the ideal fit? In the real world of statistical modeling, we doubt if any such entity exists. An ideal fit would account everything in the physical process, however small, and we know the PFChrom models cannot do this. No statistical model could do so.

Have we found a suitable universal model for analytic non-gradient shapes? We will leave that assessment to PFChrom's users. We will, however, note that we feel it likely that you will find the once generalized GenHVL and GenNLC models, and the twice-generalized Gen2HVL and Gen2NLC models, to be precisely this.

We must again note the absolute limitations of the nonlinear fitting process. A data set of all intrinsically tailed peaks is unlikely to fit the IRF accurately in an IRF-bearing model since the direction of each peak's native distortion will be the same as that of the IRF, and thus correlated to some measure. An ideal fit would process a set of intrinsically tailed peaks as effortlessly as a data set containing one or more intrinsically fronted peaks. The nature of the nonlinear fitting process makes that impossible. For such tailed peak data, the IRF must be independently estimated and then preprocessed using Fourier deconvolution in order to realize this same kind of effective fit.

The addition of a gradient, and of overload shapes, are managed quite well by the twice generalized models, but a better gradient fit is realized by first modeling and then unwinding the gradient prior to fitting, and the preparative shapes are better fit using an extension to the twice-generalized model that allows the two sides of the ZDD to have independent widths.

The PFChrom chromatographic models are based on a statistical generalization of the Haarhoff-VanderLinde and Wade-Thomas theoretical models. This generalization accounts multiple adsorption site and other asymmetry as well as the addition of two component IRFs that map nearly all of the distortions in a peak that are not a part of the actual chromatographic separation. In our view, these models are not merely built on sound science, but upon the very finest of that science.

While we may not have managed an absolute ideal, and while there may be more of a tool set of functions
for this highest accuracy modeling as opposed to a single universal model, we have mathematically accounted
nearly everything that can be fitted within the nonlinear modeling of chromatographic peaks. Much is new.
You can now quantify the 'aggressiveness' of adsorption with the a_{4} third-moment parameter
which is added to the core HVL and NLC models. You can quantify the narrow width and higher width components
of the instrumental and system distortions, identifying differences in preps, run conditions, flow paths,
and detectors. You will be able to quantity the changes in a column's performance with time. If you are
designing columns, you will be able to quantify design changes in particles, particle treatments, pore
sizes, materials, and transport.