PeakLab v1 Documentation Contents AIST Software Home AIST Software Support

Robust Fitting

By selecting a robust minimization procedure in the Peak Fit Preferences option, robust fitting can be done for any peak fit, including those containing UDFs.

Limitations of Least-Squares

Least-squares fitting involves the minimization of the sum of the squared residuals. There are two instances where this minimization produces a less than satisfactory fit. The first is where significant outliers are present. In this case, the square of the residuals of these outlier points may, within a given region, significantly shift the fitted curve away from the bulk of the data.

The other instance is when the Y-data spans more than several orders of magnitude. The squared residuals of the largest valued Y-points can overwhelm the influence of the squared residuals of the smallest Y-valued points, causing the smallest Y-value points to either be poorly fitted or not fitted at all. Data that requires a logarithmic Y-scale to see all of the points may be a good candidate for robust fitting, especially if four or more major log divisions are present.

Note that PeakLab overcomes the least-squares Y-dynamic range limitation in speciality fitting strategies that manage the fitting of very small amplitude peaks.

Least Absolute Deviation

The essence of robust fitting is to use a minimization that is less influenced by outliers and the dynamic range of the Y-variable. Instead of minimizing the sum of the squares of the residuals, an obvious alternative is the minimization of the sum of the absolute value of the residuals. This is probably the best known robust method, though not necessarily the best, and is usually designated as least absolute deviation or LAD. Of PeakLab's three robust methods, least absolute deviation is the least powerful in terms of managing outliers.

Lorentzian Minimization

The intermediate method in terms of power of robustness is a Lorentzian minimization. Here what is actually being minimized is the sum of LN(1+(ABS(residual))^2). If you are uncertain of which robust method to use, we strongly recommend this Lorentzian minimization. It is very effective when you have noisy data with outliers or if your peak data spans many orders of magnitude in Y and you are not getting the fit you feel you should see on very low magnitude peaks. These Lorentzian robust fits also tend to converge as rapidly as the least absolute deviation minimization.

Pearson VII Limit Minimization

This is the most robust of the three methods. Here the minimization is of the sum of LN(SQRT(1+(ABS(residual))^2)). With this method, outliers tend to have almost no impact on the fit. This minimization represents an extreme one where wild and random errors are expected as a natural course. Such are rarely encountered in chromatography and spectroscopy data.

Gaussian Error Distribution

Each of the various minimization formulas corresponds with a maximum likelihood probability distribution. For least squares, this corresponding error distribution is Gaussian or norma, the blue curve in the plot above. Relative to fit standard error (SE), a least-squares goodness of fit, 95.4% of the data points should have residuals with a magnitude less than 2 SE (1 in 21.98 points should lay outside 2 SE). The value is 99.73% for residuals within 3 SE (1 in 370.4 points should lay outside 3 SE). The Gaussian distribution decays very rapidly. It suggests that only 1 of 15787 points should lay outside 4 SE, only 1 of 1.74 million points should lay outside of 5 SE, and those outside 6 SE ("six sigma") should be only 1 in a half billion. Where data is subject to human error, it is generally agreed that major errors are more likely or probable than the Gaussian distribution suggests.

Double-Sided Exponential Error Distribution

The least absolute deviation fitting corresponds with a double-sided exponential probability distribution, the amber colored curve above. While such a distribution produces reasonably wide tails, it also has a discontinuous first derivative at the peak center. As such, the actual error profile isn't likely to match this double-sided exponential in this center region. The tails of the double-sided exponential may however, be very practical. This distribution suggests that 75.7% of the points should be within 2 SE and 88.0% within 3 SE. This distribution expects 1 of 16.9 points to be outside 4 SE, 1 of 34.3 to be outside 5 SE, and 1 of 69.6 to be outside of 6 SE. It is important here to recognize that such SE-relative values are based upon a least-squares goodness of fit standard deviation.

Lorentzian Error Distribution

The Lorentzian minimization is strongly recommended for data with significant outliers because the Lorentzian distribution is a very natural one both at the center and the very wide tails. It is the green curve in the above plot. Such broad tails in effect state that significant errors are expected or likely and that points with such errors should minimally influence the fit. The Lorentzian distribution suggests that 68.5% of the points should be within 2 SE and 83.3% within 3 SE. In terms of outliers, the Lorentzian expects 1 of 7.9 points to be outside of 4 SE, 1 of 9.8 to be outside of 5 SE, and 1 of 11.4 to be outside of 6 SE. Even out to 10 SE, the Lorentzian contains only 94.9% of the points, less than the Gaussian contains within 2 SE.

Pearson VII Limit Distribution

PeakLab also offers a distribution with the same naturalness about the peak center as the Lorentzian but with extremely wide tails. This is based upon the Pearson VII function with a power term of 0.5, the red density in the plot. This is the smallest power term that can be used in the Pearson VII function and still have the peak converge to a finite area. It is difficult to estimate areas for this distribution since area convergence does not occur until near infinity. We can, however assume a functional range of +/- 50 SE. In this case, 38.3% of the points are expected to be within 2 SE and 45.9% within SE. Here 1 of every 2.25 points is likely to be outside 5 SE and 1 of every 3.22 is likely to be outside 10 SE. If you know your model is very appropriate to your data, but that there is a high likelihood that a significant number of the data points are very seriously in error or perhaps even fabricated, this highly robust method may be a good choice for removing the impact of such points.

Analysis of Residuals

For least-squares fits, the parameter confidence limits, parameter standard errors, and fitted curve confidence and prediction bands can be considered valid only if the hypothesis of normally distributed errors is confirmed. PeakLab offers two procedures for confirming this normality of residuals.

The Distribution Graph option in the Review's Residuals Graph option displays a histogram of the residuals. Although subjective, this may assist you in making the judgment as to whether or not errors are normally distributed. The principal approach PeakLab offers for determination of normality is a Delta Stabilized Normal Probability plot. This plot will contain critical limits that can confirm or reject the assumption of normality for residuals.

Evenness of Residuals Distribution

All minimization models are symmetric ones, which assume a symmetric distribution of residuals. Such may not be the case, as the errors at one end of the X-values may be of different magnitudes, even different signs. In such cases, a more robust method may be desirable to lessen the impact of this uneven distribution of residuals.

Using a Robust Procedure or Least-Squares

Small peaks will influence an overall fit’s merit function far less in a least-squares fit than in a robust minimization. If you fit with very open constraints or with no constraints at all, and small peaks are not sufficiently defined by the data so as to remain in place during the fitting, a robust procedure may be of value. The phenomena of peaks diminishing to zero or negative amplitudes, of narrowing to near zero width, and of shifting outside the x-range of the data occur less frequently when using a robust method. We again point out that PeakLab has special least-squares fitting strategies that automatically manage the fitting of very small amplitude peaks. For small area peaks, a robust procedures should be a choice of last resort unless known outliers or data artifacts are present.

Unless you have a data set of relatively few points with obvious outliers, or the least-squares fit clearly failed to fit the peaks successfully, you will probably not benefit from a robust fit. You should bear in mind that the robust methods do not have the same dynamic for convergence, and additional iterations and lengthier fits will be the price paid for utilizing one of the robust methods.

Least-squares is perfectly sufficient most of the time. We have not found instances with chromatographic data where robust methods were necessary to realize an effective fit. If uncertain, feel free to experiment, but please note the following issues associated with goodness of fit statistics.

Curve-Fit Statistics

To maintain a true basis for comparing fits in PeakLab, all goodness of fit statistics are based upon sum-of-squares, even when a fit is made using a robust method. This raises several important points:

· The computation of standard errors and confidence ranges, as well as prediction and confidence intervals, are based upon sum-of-squares computations and can be assumed accurate only when the residuals are normally distributed. If the residuals distribution has narrower or wider tails than the Gaussian, or has appreciable asymmetry, then this assumption of normality fails, and these statistics should not be used as absolute measures of uncertainty. This is true for both least-squares and robust minimizations.

· It is quite possible to use a robust method to deal with rare outliers or a wide dynamic range on Y, and still see a residuals distribution that is essentially normal. In this case, even though the maximum likelihood estimation was a robust minimization rather than least squares, the reported confidence statistics can still be considered valid. If this seems confusing, consider what happens when a least-squares non-linear fit is stopped somewhat prematurely. Although the true least-squares minimum was not quite realized, the reported confidence statistics are not invalid because of this--they simply fail to reflect the least-squares fit at full minimization. A robust fit is similar in that it will likewise not reflect this least-squares minimum.

· It should be borne in mind that the various goodness of fit statistics are sum-of-squares relative, and as such the the results from different minimizations cannot be compared, even for the same peak setup. Even though a robust fit may offer a clearly superior estimation by visual inspection, the goodness of fit statistics for robust fits may approach but will never match or exceed those for a least-squares fit. This is simply one more reason to rely on graphical inspection as to the value of a robust fit. The instances where a robust fit is less influenced by outliers or where it fully manages a wide dynamic range in Y will not be at all apparent from the statistical indices of fit.