Statistical Density Modeling - Coronavisus

PFChrom v5 Documentation Contents AIST Software Home AIST Software Support

Statistical Density Modeling - Coronavisus - Analysis Protocol

Data Preparation

In order to prepare the COVID-19 data for a component density modeling using PFChrom, we found that a high degree of specialized smoothing was needed in the raw data. To do this we use a three-step Savitzky-Golay filter method on the cumulative data, and smooth to a first derivative on the third pass.

Fitting the Smoothed Data to Component Densities

Estimating the skew of each cluster or peak in the overall density won't be possible due to the high measure of overlap, but a single shared asymmetry can be fitted across all peaks. This forces all peaks in a country's density to have the same shape.

We generated a large variety of theoretical SEIR shapes and fit this data to most of the closed-form statistical models. Unless the shape was extreme, only the Pearson IV model was able to fit the noise-free SEIR shapes to an r² of 0.9999 or better, but we were not able to successfully fit the Pearson IV model to multiple densities within COVID-19 data to statistical significance in all parameters. Instead we had the best success using the simplest model furnishing this third moment adjustment, an asymmetric generalized normal. This is a model that uses a logarithmic transform of the x variable to generate a left or right skewness to the peak.

In the China data, where there are clearly multiple densities, the Pearson IV did not offer better fits than the generalized normal, despite its capability of approximating SEIR shapes.

In the procedure used for these data, PFChrom locates the local maximum and hidden peaks by generating a smooth second derivative of the density data, as smoothed above. With some experience, it actually becomes rather easy to spot the hidden peaks in a data set. This second derivative method for finding hidden peaks is often successful, although there are instances where peaks must be manually added.

As we look at fits of six generalized normal peaks to the China data at four different smoothing levels, note the third (green) peak which shifts inside the yellow one as the smoothing increases. In the China data, the reporting changed at the timing of the apex, and the bimodality is apparent at the first smoothing level. While we would prefer the 15-point data, we generally had to use the 17-point data in order to generate fits with no less than 20 ppm unaccounted variance. It was also the case that the 15-point fits, as a consequence of less smoothing and weaker fits, often failed to fit one or more of the parameters to 95% statistical significance. In this example, we can't be fully sure sure the yellow and green peaks are real or an artifact of the revision in how the data were reported. For the China data, the only country with a high death-rate and a full cycle of resolution, it seemed best to use the 17-point window. The yellow and green peaks remain slightly separate, and the fit comes in under the 20 ppm error, every fitted parameter significant.

Fitting Normals

If the shared a₃ asymmetry in a generalized normal fit failed significance (could not be determined to be non-zero), the a₃ would be locked at 1E-6 (an effective zero, the generalized normal has a singularity at an a₃=0.0), and normals would be fitted. If a fit produced a statistically significant asymmetry for the peaks, normals were not fitted.

Interpreting Parameter Values

"China COVID-19 Deaths 1/10-4/13 SG-D1-15"

Fitted Parameters

r² Coef Det DF Adj r² Fit Std Err F-value ppm uVar

0.99996112 0.99995127 0.29067528 108,599 38.8773735

Peak Type a0 a1 a2 a3

1 GenNorm[m] 1212.88797 28.7141999 8.20045436 0.14021384

2 GenNorm[m] 1103.63283 36.3993629 5.20358287 0.14021384

3 GenNorm[m] 613.786433 44.1422965 3.86816664 0.14021384

4 GenNorm[m] 252.204970 55.4436691 3.88850774 0.14021384

5 GenNorm[m] 163.311673 67.6108332 7.28514474 0.14021384

6 GenNorm[m] 37.2405316 85.0297173 4.96768995 0.14021384

"China COVID-19 Deaths 1/10-4/13 SG-D1-17"

Fitted Parameters

r² Coef Det DF Adj r² Fit Std Err F-value ppm uVar

0.99998185 0.99997725 0.19823722 232,566 18.1545461

Peak Type a0 a1 a2 a3

1 GenNorm[m] 984.623525 27.2205059 7.64891252 0.14304894

2 GenNorm[m] 1457.95617 36.9814473 6.09135240 0.14304894

3 GenNorm[m] 523.304294 44.4536671 4.27119265 0.14304894

4 GenNorm[m] 246.996076 56.3964751 4.13734336 0.14304894

5 GenNorm[m] 127.109781 68.8648378 6.25977748 0.14304894

6 GenNorm[m] 43.6927874 84.7888273 5.52886284 0.14304894

"China COVID-19 Deaths 1/10-4/13 SG-D1-19"

Fitted Parameters

r² Coef Det DF Adj r² Fit Std Err F-value ppm uVar

0.99999078 0.99998845 0.14100014 458,103 9.21665497

Peak Type a0 a1 a2 a3

1 GenNorm[m] 622.566703 24.2420345 6.48785455 0.12406784

2 GenNorm[m] 2007.89322 37.0502339 7.13599414 0.12406784

3 GenNorm[m] 373.330184 44.3870080 4.53260676 0.12406784

4 GenNorm[m] 233.033247 57.3043002 4.51396101 0.12406784

5 GenNorm[m] 95.3151223 69.5862089 5.34954858 0.12406784

6 GenNorm[m] 52.4394777 84.0799064 6.16389235 0.12406784

"China COVID-19 Deaths 1/10-4/13 SG-D1-21"

Fitted Parameters

r² Coef Det DF Adj r² Fit Std Err F-value ppm uVar

0.99999098 0.99998870 0.13921254 468,222 9.01748019

Peak Type a0 a1 a2 a3

1 GenNorm[m] 343.810792 21.5325174 5.58867669 0.09737871

2 GenNorm[m] 2427.98735 36.6557543 8.19289212 0.09737871

3 GenNorm[m] 275.992561 43.7190769 4.88761593 0.09737871

4 GenNorm[m] 203.702870 58.1623187 4.79586971 0.09737871

5 GenNorm[m] 69.2937023 69.5872201 4.81245958 0.09737871

6 GenNorm[m] 66.3130297 82.7669562 7.37179214 0.09737871

The above parameters are from the four fits above. The a₀ area parameter will consist of the number of deaths in that cluster or population this peak represents. The a₁ parameter is the location in time, t₀=0, t₁=first non-zero day where a death was reported. The generalized normal used is one parameterized to where the a₁ parameter is the mean of the asymmetric density, not the mean of the underlying or deconvolved Gaussian. The a₂ is the standard deviation of the underlying Gaussian. The a₃ is the statistical asymmetry in the generalized normal, positive for a right skew and negative for a left skew. The a₀ to a₃ parameters thus estimate moments 0, 1, 2, and 3. PFChrom offers just about every generalized normal parameterization you could wish. You can even fit the moments directly.

When six peaks are fitted, this means there are six statistically distinguishable clusters. Again, each identifiable and fitted peak may consist of any number of blended densities and an estimated shared asymmetry will reflect any bias. In this data the shared a₃ fit to an appreciable right-skew in each of the peaks. This was consistent, diminishing with the magnitude of smoothing. If you look at the peaks in the fits above, this asymmetry is visually apparent in the larger peaks.

Measured Values

Measured Values

Peak Type Amplitude Center FWHM Asym50 FW Base Asym10

1 GenNorm[m] 51.8828785 25.5875634 17.7305349 1.18344213 35.8944807 1.35931130

2 GenNorm[m] 96.4682261 35.6810234 14.1200381 1.18344214 28.5852309 1.35931130

3 GenNorm[m] 49.3808644 43.5418234 9.90082318 1.18344214 20.0436652 1.35931130

4 GenNorm[m] 24.0614635 55.5132065 9.59055428 1.18344213 19.4155431 1.35931129

5 GenNorm[m] 8.18415083 67.5284577 14.5104553 1.18344204 29.3756088 1.35931124

6 GenNorm[m] 3.18513237 83.6084876 12.8161612 1.18344214 25.9456047 1.35931130

Peak Type Area % Area Mean StdDev Skewness Kurtosis

1 GenNorm[m] 984.622842 29.1225427 27.2205247 7.76724171 0.43435072 3.33713955

2 GenNorm[m] 1457.95617 43.1224922 36.9814472 6.18561260 0.43432324 3.33723578

3 GenNorm[m] 523.304294 15.4779587 44.4536671 4.33728715 0.43432418 3.33724508

4 GenNorm[m] 246.996075 7.30549144 56.3964750 4.20136617 0.43432244 3.33722823

5 GenNorm[m] 127.027835 3.75714780 68.8468408 6.31869478 0.39385582 3.16407953

6 GenNorm[m] 41.0573237 1.21436718 83.9955337 4.75019925 -0.0758490 2.46357299

All Total 3380.96454 100.000000

The measured values estimate the FWHM (full-width at half-maximum) and half-height asymmetry, and well as integrated moments. For the generalized normal, where analytic moments are available, you should rely on those for the areas since these analyses are only done for the range of sampled data. For partial peaks, these measured values will only reflect the portion of the peak that was actually sampled. Each of these values for a partial peak will fail to reflect the values for the full peak which extends beyond the range of the data.

Analytic Moments

Analytic Moments

Peak Type FnArea % FnArea FnMean FnStdDev FnSkewness FnKurtosis

1 GenNorm[m] 984.623525 29.0991689 27.2205059 7.76727549 0.43432419 3.33724514

2 GenNorm[m] 1457.95617 43.0878521 36.9814473 6.18561293 0.43432419 3.33724514

3 GenNorm[m] 523.304294 15.4655253 44.4536671 4.33728715 0.43432419 3.33724514

4 GenNorm[m] 246.996076 7.29962300 56.3964751 4.20136660 0.43432419 3.33724514

5 GenNorm[m] 127.109781 3.75655151 68.8648378 6.35664430 0.43432419 3.33724514

6 GenNorm[m] 43.6927874 1.29127912 84.7888273 5.61441914 0.43432419 3.33724514

All Total 3383.68263 100.000000

If analytic moments exist, these will estimate the exact moments for the entire peak. The sum of the FnArea will be the estimated overall deaths.

You can use the sum of the areas in the analytic moments and the sum of the areas in the measured values to estimate how far along things are toward an endpoint. In the case of the China data above, all is nearly finished (3381/3384). In most of the examples for the different countries, an estimate is given which is far from an endpoint.

Advanced Area Analysis

Advanced Area Analysis

Peak Type Area % Area ApexAsym

1 GenNorm[m] 984.622842 29.1225427 1.25669762

2 GenNorm[m] 1457.95617 43.1224922 1.25669569

3 GenNorm[m] 523.304294 15.4779587 1.25669569

4 GenNorm[m] 246.996075 7.30549144 1.25669568

5 GenNorm[m] 127.027835 3.75714780 1.25524072

6 GenNorm[m] 41.0573237 1.21436718 1.12057621

All Total 3380.96454 100.000000

The ApexAsym is the area to the right of the apex of the component to the area to the left. With a shared a₃, these values will be a constant if the whole of the peak has been sampled.

Parameter Statistics

Parameter Statistics

Peak 1 GenNorm[m]

Parameter Value Std Error t-value 95% Conf Lo 95% Conf Hi

Area 984.623525 101.529507 9.69790507 782.409986 1186.83706

Mean 27.2205059 0.95746305 28.4298240 25.3135530 29.1274589

Width 7.64891252 0.50347316 15.1922944 6.64615881 8.65166623