PFChrom v5 Documentation Contents AIST Software Home AIST Software Support
Statistical Density Modeling - Coronavisus - Analysis Protocol
Data Preparation
In order to prepare the COVID-19 data for a component density modeling using PFChrom, we found that a high degree of specialized smoothing was needed in the raw data. To do this we use a three-step Savitzky-Golay filter method on the cumulative data, and smooth to a first derivative on the third pass.
Fitting the Smoothed Data to Component Densities
Estimating the skew of each cluster or peak in the overall density won't be possible due to the high measure of overlap, but a single shared asymmetry can be fitted across all peaks. This forces all peaks in a country's density to have the same shape.
We generated a large variety of theoretical SEIR shapes and fit this data to most of the closed-form statistical models. Unless the shape was extreme, only the Pearson IV model was able to fit the noise-free SEIR shapes to an r2 of 0.9999 or better, but we were not able to successfully fit the Pearson IV model to multiple densities within COVID-19 data to statistical significance in all parameters. Instead we had the best success using the simplest model furnishing this third moment adjustment, an asymmetric generalized normal. This is a model that uses a logarithmic transform of the x variable to generate a left or right skewness to the peak.
In the China data, where there are clearly multiple densities, the Pearson IV did not offer better fits than the generalized normal, despite its capability of approximating SEIR shapes.
In the procedure used for these data, PFChrom locates the local maximum and hidden peaks by generating a smooth second derivative of the density data, as smoothed above. With some experience, it actually becomes rather easy to spot the hidden peaks in a data set. This second derivative method for finding hidden peaks is often successful, although there are instances where peaks must be manually added.
As we look at fits of six generalized normal peaks to the China data at four different smoothing levels, note the third (green) peak which shifts inside the yellow one as the smoothing increases. In the China data, the reporting changed at the timing of the apex, and the bimodality is apparent at the first smoothing level. While we would prefer the 15-point data, we generally had to use the 17-point data in order to generate fits with no less than 20 ppm unaccounted variance. It was also the case that the 15-point fits, as a consequence of less smoothing and weaker fits, often failed to fit one or more of the parameters to 95% statistical significance. In this example, we can't be fully sure sure the yellow and green peaks are real or an artifact of the revision in how the data were reported. For the China data, the only country with a high death-rate and a full cycle of resolution, it seemed best to use the 17-point window. The yellow and green peaks remain slightly separate, and the fit comes in under the 20 ppm error, every fitted parameter significant.
Fitting Normals
If the shared a3 asymmetry in a generalized normal fit failed significance (could not be determined to be non-zero), the a3 would be locked at 1E-6 (an effective zero, the generalized normal has a singularity at an a3=0.0), and normals would be fitted. If a fit produced a statistically significant asymmetry for the peaks, normals were not fitted.
Interpreting Parameter Values
"China COVID-19 Deaths 1/10-4/13 SG-D1-15"
Fitted Parameters
r2 Coef Det DF Adj r2 Fit Std Err F-value ppm uVar
0.99996112 0.99995127 0.29067528 108,599 38.8773735
Peak Type a0 a1 a2 a3
1 GenNorm[m] 1212.88797 28.7141999 8.20045436 0.14021384
2 GenNorm[m] 1103.63283 36.3993629 5.20358287 0.14021384
3 GenNorm[m] 613.786433 44.1422965 3.86816664 0.14021384
4 GenNorm[m] 252.204970 55.4436691 3.88850774 0.14021384
5 GenNorm[m] 163.311673 67.6108332 7.28514474 0.14021384
6 GenNorm[m] 37.2405316 85.0297173 4.96768995 0.14021384
"China COVID-19 Deaths 1/10-4/13 SG-D1-17"
Fitted Parameters
r2 Coef Det DF Adj r2 Fit Std Err F-value ppm uVar
0.99998185 0.99997725 0.19823722 232,566 18.1545461
Peak Type a0 a1 a2 a3
1 GenNorm[m] 984.623525 27.2205059 7.64891252 0.14304894
2 GenNorm[m] 1457.95617 36.9814473 6.09135240 0.14304894
3 GenNorm[m] 523.304294 44.4536671 4.27119265 0.14304894
4 GenNorm[m] 246.996076 56.3964751 4.13734336 0.14304894
5 GenNorm[m] 127.109781 68.8648378 6.25977748 0.14304894
6 GenNorm[m] 43.6927874 84.7888273 5.52886284 0.14304894
"China COVID-19 Deaths 1/10-4/13 SG-D1-19"
Fitted Parameters
r2 Coef Det DF Adj r2 Fit Std Err F-value ppm uVar
0.99999078 0.99998845 0.14100014 458,103 9.21665497
Peak Type a0 a1 a2 a3
1 GenNorm[m] 622.566703 24.2420345 6.48785455 0.12406784
2 GenNorm[m] 2007.89322 37.0502339 7.13599414 0.12406784
3 GenNorm[m] 373.330184 44.3870080 4.53260676 0.12406784
4 GenNorm[m] 233.033247 57.3043002 4.51396101 0.12406784
5 GenNorm[m] 95.3151223 69.5862089 5.34954858 0.12406784
6 GenNorm[m] 52.4394777 84.0799064 6.16389235 0.12406784
"China COVID-19 Deaths 1/10-4/13 SG-D1-21"
Fitted Parameters
r2 Coef Det DF Adj r2 Fit Std Err F-value ppm uVar
0.99999098 0.99998870 0.13921254 468,222 9.01748019
Peak Type a0 a1 a2 a3
1 GenNorm[m] 343.810792 21.5325174 5.58867669 0.09737871
2 GenNorm[m] 2427.98735 36.6557543 8.19289212 0.09737871
3 GenNorm[m] 275.992561 43.7190769 4.88761593 0.09737871
4 GenNorm[m] 203.702870 58.1623187 4.79586971 0.09737871
5 GenNorm[m] 69.2937023 69.5872201 4.81245958 0.09737871
6 GenNorm[m] 66.3130297 82.7669562 7.37179214 0.09737871
The above parameters are from the four fits above. The a0 area parameter will consist of the number of deaths in that cluster or population this peak represents. The a1 parameter is the location in time, t0=0, t1=first non-zero day where a death was reported. The generalized normal used is one parameterized to where the a1 parameter is the mean of the asymmetric density, not the mean of the underlying or deconvolved Gaussian. The a2 is the standard deviation of the underlying Gaussian. The a3 is the statistical asymmetry in the generalized normal, positive for a right skew and negative for a left skew. The a0 to a3 parameters thus estimate moments 0, 1, 2, and 3. PFChrom offers just about every generalized normal parameterization you could wish. You can even fit the moments directly.
When six peaks are fitted, this means there are six statistically distinguishable clusters. Again, each identifiable and fitted peak may consist of any number of blended densities and an estimated shared asymmetry will reflect any bias. In this data the shared a3 fit to an appreciable right-skew in each of the peaks. This was consistent, diminishing with the magnitude of smoothing. If you look at the peaks in the fits above, this asymmetry is visually apparent in the larger peaks.
Measured Values
Measured Values
Peak Type Amplitude Center FWHM Asym50 FW Base Asym10
1 GenNorm[m] 51.8828785 25.5875634 17.7305349 1.18344213 35.8944807 1.35931130
2 GenNorm[m] 96.4682261 35.6810234 14.1200381 1.18344214 28.5852309 1.35931130
3 GenNorm[m] 49.3808644 43.5418234 9.90082318 1.18344214 20.0436652 1.35931130
4 GenNorm[m] 24.0614635 55.5132065 9.59055428 1.18344213 19.4155431 1.35931129
5 GenNorm[m] 8.18415083 67.5284577 14.5104553 1.18344204 29.3756088 1.35931124
6 GenNorm[m] 3.18513237 83.6084876 12.8161612 1.18344214 25.9456047 1.35931130
Peak Type Area % Area Mean StdDev Skewness Kurtosis
1 GenNorm[m] 984.622842 29.1225427 27.2205247 7.76724171 0.43435072 3.33713955
2 GenNorm[m] 1457.95617 43.1224922 36.9814472 6.18561260 0.43432324 3.33723578
3 GenNorm[m] 523.304294 15.4779587 44.4536671 4.33728715 0.43432418 3.33724508
4 GenNorm[m] 246.996075 7.30549144 56.3964750 4.20136617 0.43432244 3.33722823
5 GenNorm[m] 127.027835 3.75714780 68.8468408 6.31869478 0.39385582 3.16407953
6 GenNorm[m] 41.0573237 1.21436718 83.9955337 4.75019925 -0.0758490 2.46357299
All Total 3380.96454 100.000000
The measured values estimate the FWHM (full-width at half-maximum) and half-height asymmetry, and well as integrated moments. For the generalized normal, where analytic moments are available, you should rely on those for the areas since these analyses are only done for the range of sampled data. For partial peaks, these measured values will only reflect the portion of the peak that was actually sampled. Each of these values for a partial peak will fail to reflect the values for the full peak which extends beyond the range of the data.
Analytic Moments
Analytic Moments
Peak Type FnArea % FnArea FnMean FnStdDev FnSkewness FnKurtosis
1 GenNorm[m] 984.623525 29.0991689 27.2205059 7.76727549 0.43432419 3.33724514
2 GenNorm[m] 1457.95617 43.0878521 36.9814473 6.18561293 0.43432419 3.33724514
3 GenNorm[m] 523.304294 15.4655253 44.4536671 4.33728715 0.43432419 3.33724514
4 GenNorm[m] 246.996076 7.29962300 56.3964751 4.20136660 0.43432419 3.33724514
5 GenNorm[m] 127.109781 3.75655151 68.8648378 6.35664430 0.43432419 3.33724514
6 GenNorm[m] 43.6927874 1.29127912 84.7888273 5.61441914 0.43432419 3.33724514
All Total 3383.68263 100.000000
If analytic moments exist, these will estimate the exact moments for the entire peak. The sum of the FnArea will be the estimated overall deaths.
You can use the sum of the areas in the analytic moments and the sum of the areas in the measured values to estimate how far along things are toward an endpoint. In the case of the China data above, all is nearly finished (3381/3384). In most of the examples for the different countries, an estimate is given which is far from an endpoint.
Advanced Area Analysis
Advanced Area Analysis
Peak Type Area % Area ApexAsym
1 GenNorm[m] 984.622842 29.1225427 1.25669762
2 GenNorm[m] 1457.95617 43.1224922 1.25669569
3 GenNorm[m] 523.304294 15.4779587 1.25669569
4 GenNorm[m] 246.996075 7.30549144 1.25669568
5 GenNorm[m] 127.027835 3.75714780 1.25524072
6 GenNorm[m] 41.0573237 1.21436718 1.12057621
All Total 3380.96454 100.000000
The ApexAsym is the area to the right of the apex of the component to the area to the left. With a shared a3, these values will be a constant if the whole of the peak has been sampled.
Parameter Statistics
Parameter Statistics
Peak 1 GenNorm[m]
Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi
Area 984.623525 101.529507 9.69790507 782.409986 1186.83706
Mean 27.2205059 0.95746305 28.4298240 25.3135530 29.1274589
Width 7.64891252 0.50347316 15.1922944 6.64615881 8.65166623
Shape 0.14304894 0.02446285 5.84760020 0.09432696 0.19177092
Peak 2 GenNorm[m]
Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi
Area 1457.95617 61.7191538 23.6224264 1335.03182 1580.88052
Mean 36.9814473 0.56421530 65.5449206 35.8577151 38.1051795
Width 6.09135240 0.25347875 24.0310184 5.58650573 6.59619908
Shape 0.14304894 0.02446285 5.84760020 0.09432696 0.19177092
Peak 3 GenNorm[m]
Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi
Area 523.304294 79.4870882 6.58351320 364.992038 681.616551
Mean 44.4536671 0.13283299 334.658332 44.1891072 44.7182269
Width 4.27119265 0.07595286 56.2347859 4.11991943 4.42246588
Shape 0.14304894 0.02446285 5.84760020 0.09432696 0.19177092
Peak 4 GenNorm[m]
Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi
Area 246.996076 15.6774287 15.7548844 215.771770 278.220381
Mean 56.3964751 0.32439963 173.848764 55.7503772 57.0425730
Width 4.13734336 0.09815704 42.1502453 3.94184667 4.33284005
Shape 0.14304894 0.02446285 5.84760020 0.09432696 0.19177092
Peak 5 GenNorm[m]
Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi
Area 127.109781 12.1909201 10.4265945 102.829459 151.390103
Mean 68.8648378 0.56713681 121.425442 67.7352870 69.9943887
Width 6.25977748 0.46990459 13.3213796 5.32388137 7.19567359
Shape 0.14304894 0.02446285 5.84760020 0.09432696 0.19177092
Peak 6 GenNorm[m]
Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi
Area 43.6927874 4.60841407 9.48108974 34.5143353 52.8712395
Mean 84.7888273 0.45444831 186.575296 83.8837151 85.6939396
Width 5.52886284 0.29872367 18.5082851 4.93390309 6.12382259
Shape 0.14304894 0.02446285 5.84760020 0.09432696 0.19177092
Parameters which fail the specified confidence level will be grayed. You will want to be certain each of the parameters pass significance. This may mean that you have to use a higher level of smoothing than you would prefer. For post-apex data, the 17-length data usually offered full significance for all parameters once the correct count of peaks was fitted.
Overlap Areas
Overlap Areas
Peak Type Peak 1 Peak 2 Peak 3 Peak 4 Peak 5 Peak 6
1 GenNorm[m] 984.622842 502.889623 116.211691 11.9978066 2.12140979 0.03063219
2 GenNorm[m] 502.889623 1457.95617 440.972222 50.9775550 6.59025892 0.05937452
3 GenNorm[m] 116.211691 440.972222 523.304294 60.1440743 5.38023221 0.01258276
4 GenNorm[m] 11.9978066 50.9775550 60.1440743 246.996075 39.2356240 0.42000080
5 GenNorm[m] 2.12140979 6.59025892 5.38023221 39.2356240 127.027835 14.5886470
6 GenNorm[m] 0.03063219 0.05937452 0.01258276 0.42000080 14.5886470 41.0573237
This table can be used to see how much each of the populations (identified peaks) overlap. The diagonals will report the area the individual peak.
PFChrom Methodology
(1) Copy the cumulative deaths data from the CSV file in Excel. Copy just the one column of cumulative information. Include an initial starting day with a 0 value (day 0 will have 0 deaths). In PFChrom, File | Import Data Set(s) | Import from Clipboard. Use the default starting value of 0 and x-increment of 1. OK, enter a title, OK. Right click the main graph, select Replicate this Data Set a Specified Number of Times, enter 3 and OK. You will now have four identical cumulatives in the main window graphs.
(2) Data | Smooth, select S-G 1st deriv (do not check the Quadratic option). Double click the first graph. Enter the percent width of the data necessary to produce a window of n pts=15. Double click again to restore all graphs. Double click the second graph. Enter the percent of the overall data necessary to generate a window of 17 n pts. Double click once more to return to all graphs. Double click the third graph. Enter the percent of the overall data necessary to generate a window of 19 n pts. Double click to return to all graphs. Double click the fourth graph. Enter the percent of the overall data necessary to generate a window of 21 n pts. Double click once more to return to all graphs. Then OK, OK, OK. You will then be back in the main window with four different smoothed densities derived from this one raw cumulative.
(3) Hidden Peaks - Second Derivative. No baseline. Sm n(1)=2 (data is already smoothed). Statistical family, GenNorm[m] model. Amp% 1.5, D2% 0, vary width a2 checked, shape a3 unchecked. Right click first graph, Copy Numeric Estimates from this Data Set. For all other graphs, right click, Paste Copied Numeric Estimates to this Data Set. You will now have identical starting estimates in all four data sets consisting of the automatic estimates for the first, the least smoothed data set.
(4) Click the Modify Peak Fit Preferences button to the right of the Peak Fit button. Set Maximum Total Iterations to 5000, Converge to Significant Digits to 12, and Merit Must Improve % in 100 Iterations to 0.0. Set the Maximum Iterations per Sequence to 50, and the Iterations/Each Separate Peak Fit to 50. Remove the a2 Constrain Parameter check in the Global Constraints - Sequential Fits. Click OK. This should allow all fits to go to completion without having to restart them as a consequence of constraints which guide the overall fitting.
(5) Click Peak Fit. Select the Fit with Reduced Data Prefit, Cycle Peaks, 2 Pass, Lock Shared Parameters on Pass 1. Be sure the Fit using Sequential Constraints box checked. Click OK. Click Review Fit when the fit is finished. Use the Review's Numeric Summary to see which fits are wholly specified (all of the parameters are significant or non-grayed). The length 15 data sets will possibly be insufficiently smoothed. The length 21 data set will be probably be oversmoothed. You can use the Review's selection option to display only the fits with all parameters significant and the Average Multiple Fits option in the Numeric Summary to produce an average of just those specific fits.
Inferences
When the data represent a finished density, the amplitudes can be used to estimate any expected post-apex peaks. In the China data, the first peak in the post-apex region has an amplitude of 24 as compared to a maximum of 103, about a quarter. If another country saw a post-apex region peak which comes in at a quarter or less of the maximum daily death total, there would be at least a suggestion that a country was matching China's success.
With the data from the fits, you can answer just about any question you wish. For example, in the China data the fourth blue peak occurs at 56.4 days. With an overall apex of 36 days, the peak appears about 20 days post-apex, with a half-height width of 10 days. If a post-apex region peak appears 2-3 weeks after the apex, and the deaths are about a quarter of the maximum per day, and those further drop to about an eighth of that maximum in about five more days, a country would be equaling China's success at this stage in the process.
To see overall area percentiles, use the right click Evaluate Fit Model feature. If the Analytic Moments areas sum to 3383, as they do in in the China data, one can integrate from 0 (in X) to an entered value (in X) to get an area. If one were to enter different values looking for 99% of the 3383 value (3349), 81 days was required for this 99 percentile, of 81-36=45 days post-apex.
The total estimated deaths will be the sum of Analytic Areas. One important estimate is how close one is to the projected endpoint as the the density currently stands. The sum of measured Areas should be very close to the current cumulative deaths. Bear in mind that China recorded three post-apex region peaks, and four peaks past the actual recorded apex in the data.