PFChrom v5 Documentation Contents            AIST Software Home            AIST Software Support

Statistical Density Modeling - Coronavisus - Analysis Protocol


Data Preparation

In order to prepare the COVID-19 data for a component density modeling using PFChrom, we found that a high degree of specialized smoothing was needed in the raw data. To do this we use a three-step Savitzky-Golay filter method on the cumulative data, and smooth to a first derivative on the third pass.

Fitting the Smoothed Data to Component Densities

Estimating the skew of each cluster or peak in the overall density won't be possible due to the high measure of overlap, but a single shared asymmetry can be fitted across all peaks. This forces all peaks in a country's density to have the same shape.

We generated a large variety of theoretical SEIR shapes and fit this data to most of the closed-form statistical models. Unless the shape was extreme, only the Pearson IV model was able to fit the noise-free SEIR shapes to an r2 of 0.9999 or better, but we were not able to successfully fit the Pearson IV model to multiple densities within COVID-19 data to statistical significance in all parameters. Instead we had the best success using the simplest model furnishing this third moment adjustment, an asymmetric generalized normal. This is a model that uses a logarithmic transform of the x variable to generate a left or right skewness to the peak.

In the China data, where there are clearly multiple densities, the Pearson IV did not offer better fits than the generalized normal, despite its capability of approximating SEIR shapes.

In the procedure used for these data, PFChrom locates the local maximum and hidden peaks by generating a smooth second derivative of the density data, as smoothed above. With some experience, it actually becomes rather easy to spot the hidden peaks in a data set. This second derivative method for finding hidden peaks is often successful, although there are instances where peaks must be manually added.

Corona_China_9.png

As we look at fits of six generalized normal peaks to the China data at four different smoothing levels, note the third (green) peak which shifts inside the yellow one as the smoothing increases. In the China data, the reporting changed at the timing of the apex, and the bimodality is apparent at the first smoothing level. While we would prefer the 15-point data, we generally had to use the 17-point data in order to generate fits with no less than 20 ppm unaccounted variance. It was also the case that the 15-point fits, as a consequence of less smoothing and weaker fits, often failed to fit one or more of the parameters to 95% statistical significance. In this example, we can't be fully sure sure the yellow and green peaks are real or an artifact of the revision in how the data were reported. For the China data, the only country with a high death-rate and a full cycle of resolution, it seemed best to use the 17-point window. The yellow and green peaks remain slightly separate, and the fit comes in under the 20 ppm error, every fitted parameter significant.

Fitting Normals

If the shared a3 asymmetry in a generalized normal fit failed significance (could not be determined to be non-zero), the a3 would be locked at 1E-6 (an effective zero, the generalized normal has a singularity at an a3=0.0), and normals would be fitted. If a fit produced a statistically significant asymmetry for the peaks, normals were not fitted.

Interpreting Parameter Values

 

"China COVID-19 Deaths 1/10-4/13 SG-D1-15"

Fitted Parameters

r2 Coef Det      DF Adj r2      Fit Std Err      F-value      ppm uVar

0.99996112      0.99995127      0.29067528      108,599      38.8773735

 Peak      Type       a0       a1       a2       a3

    1      GenNorm[m]      1212.88797      28.7141999      8.20045436      0.14021384

    2      GenNorm[m]      1103.63283      36.3993629      5.20358287      0.14021384

    3      GenNorm[m]      613.786433      44.1422965      3.86816664      0.14021384

    4      GenNorm[m]      252.204970      55.4436691      3.88850774      0.14021384

    5      GenNorm[m]      163.311673      67.6108332      7.28514474      0.14021384

    6      GenNorm[m]      37.2405316      85.0297173      4.96768995      0.14021384

 

"China COVID-19 Deaths 1/10-4/13 SG-D1-17"

Fitted Parameters

r2 Coef Det      DF Adj r2      Fit Std Err      F-value      ppm uVar

0.99998185      0.99997725      0.19823722      232,566      18.1545461

 Peak      Type       a0       a1       a2       a3

    1      GenNorm[m]      984.623525      27.2205059      7.64891252      0.14304894

    2      GenNorm[m]      1457.95617      36.9814473      6.09135240      0.14304894

    3      GenNorm[m]      523.304294      44.4536671      4.27119265      0.14304894

    4      GenNorm[m]      246.996076      56.3964751      4.13734336      0.14304894

    5      GenNorm[m]      127.109781      68.8648378      6.25977748      0.14304894

    6      GenNorm[m]      43.6927874      84.7888273      5.52886284      0.14304894

 

"China COVID-19 Deaths 1/10-4/13 SG-D1-19"

Fitted Parameters

r2 Coef Det      DF Adj r2      Fit Std Err      F-value      ppm uVar

0.99999078      0.99998845      0.14100014      458,103      9.21665497

 Peak      Type       a0       a1       a2       a3

    1      GenNorm[m]      622.566703      24.2420345      6.48785455      0.12406784

    2      GenNorm[m]      2007.89322      37.0502339      7.13599414      0.12406784

    3      GenNorm[m]      373.330184      44.3870080      4.53260676      0.12406784

    4      GenNorm[m]      233.033247      57.3043002      4.51396101      0.12406784

    5      GenNorm[m]      95.3151223      69.5862089      5.34954858      0.12406784

    6      GenNorm[m]      52.4394777      84.0799064      6.16389235      0.12406784

 

"China COVID-19 Deaths 1/10-4/13 SG-D1-21"

Fitted Parameters

r2 Coef Det      DF Adj r2      Fit Std Err      F-value      ppm uVar

0.99999098      0.99998870      0.13921254      468,222      9.01748019

 Peak      Type       a0       a1       a2       a3

    1      GenNorm[m]      343.810792      21.5325174      5.58867669      0.09737871

    2      GenNorm[m]      2427.98735      36.6557543      8.19289212      0.09737871

    3      GenNorm[m]      275.992561      43.7190769      4.88761593      0.09737871

    4      GenNorm[m]      203.702870      58.1623187      4.79586971      0.09737871

    5      GenNorm[m]      69.2937023      69.5872201      4.81245958      0.09737871

    6      GenNorm[m]      66.3130297      82.7669562      7.37179214      0.09737871

 

The above parameters are from the four fits above. The a0 area parameter will consist of the number of deaths in that cluster or population this peak represents. The a1 parameter is the location in time, t0=0, t1=first non-zero day where a death was reported. The generalized normal used is one parameterized to where the a1 parameter is the mean of the asymmetric density, not the mean of the underlying or deconvolved Gaussian. The a2 is the standard deviation of the underlying Gaussian. The a3 is the statistical asymmetry in the generalized normal, positive for a right skew and negative for a left skew. The a0 to a3 parameters thus estimate moments 0, 1, 2, and 3. PFChrom offers just about every generalized normal parameterization you could wish. You can even fit the moments directly.

When six peaks are fitted, this means there are six statistically distinguishable clusters. Again, each identifiable and fitted peak may consist of any number of blended densities and an estimated shared asymmetry will reflect any bias. In this data the shared a3 fit to an appreciable right-skew in each of the peaks. This was consistent, diminishing with the magnitude of smoothing. If you look at the peaks in the fits above, this asymmetry is visually apparent in the larger peaks.

Measured Values

 

Measured Values

 Peak      Type       Amplitude       Center       FWHM       Asym50       FW Base       Asym10

    1      GenNorm[m]      51.8828785      25.5875634      17.7305349      1.18344213      35.8944807      1.35931130

    2      GenNorm[m]      96.4682261      35.6810234      14.1200381      1.18344214      28.5852309      1.35931130

    3      GenNorm[m]      49.3808644      43.5418234      9.90082318      1.18344214      20.0436652      1.35931130

    4      GenNorm[m]      24.0614635      55.5132065      9.59055428      1.18344213      19.4155431      1.35931129

    5      GenNorm[m]      8.18415083      67.5284577      14.5104553      1.18344204      29.3756088      1.35931124

    6      GenNorm[m]      3.18513237      83.6084876      12.8161612      1.18344214      25.9456047      1.35931130

 

 Peak      Type       Area       % Area       Mean       StdDev       Skewness       Kurtosis

    1      GenNorm[m]      984.622842      29.1225427      27.2205247      7.76724171      0.43435072      3.33713955

    2      GenNorm[m]      1457.95617      43.1224922      36.9814472      6.18561260      0.43432324      3.33723578

    3      GenNorm[m]      523.304294      15.4779587      44.4536671      4.33728715      0.43432418      3.33724508

    4      GenNorm[m]      246.996075      7.30549144      56.3964750      4.20136617      0.43432244      3.33722823

    5      GenNorm[m]      127.027835      3.75714780      68.8468408      6.31869478      0.39385582      3.16407953

    6      GenNorm[m]      41.0573237      1.21436718      83.9955337      4.75019925      -0.0758490      2.46357299

  All      Total      3380.96454      100.000000

 

The measured values estimate the FWHM (full-width at half-maximum) and half-height asymmetry, and well as integrated moments. For the generalized normal, where analytic moments are available, you should rely on those for the areas since these analyses are only done for the range of sampled data. For partial peaks, these measured values will only reflect the portion of the peak that was actually sampled. Each of these values for a partial peak will fail to reflect the values for the full peak which extends beyond the range of the data.

Analytic Moments

 

Analytic Moments

 Peak      Type       FnArea       % FnArea       FnMean       FnStdDev      FnSkewness      FnKurtosis

    1      GenNorm[m]      984.623525      29.0991689      27.2205059      7.76727549      0.43432419      3.33724514

    2      GenNorm[m]      1457.95617      43.0878521      36.9814473      6.18561293      0.43432419      3.33724514

    3      GenNorm[m]      523.304294      15.4655253      44.4536671      4.33728715      0.43432419      3.33724514

    4      GenNorm[m]      246.996076      7.29962300      56.3964751      4.20136660      0.43432419      3.33724514

    5      GenNorm[m]      127.109781      3.75655151      68.8648378      6.35664430      0.43432419      3.33724514

    6      GenNorm[m]      43.6927874      1.29127912      84.7888273      5.61441914      0.43432419      3.33724514

  All      Total      3383.68263      100.000000

 

If analytic moments exist, these will estimate the exact moments for the entire peak. The sum of the FnArea will be the estimated overall deaths.

You can use the sum of the areas in the analytic moments and the sum of the areas in the measured values to estimate how far along things are toward an endpoint. In the case of the China data above, all is nearly finished (3381/3384). In most of the examples for the different countries, an estimate is given which is far from an endpoint.

Advanced Area Analysis

 

Advanced Area Analysis

 Peak      Type       Area        % Area       ApexAsym

    1      GenNorm[m]      984.622842      29.1225427      1.25669762

    2      GenNorm[m]      1457.95617      43.1224922      1.25669569

    3      GenNorm[m]      523.304294      15.4779587      1.25669569

    4      GenNorm[m]      246.996075      7.30549144      1.25669568

    5      GenNorm[m]      127.027835      3.75714780      1.25524072

    6      GenNorm[m]      41.0573237      1.21436718      1.12057621

  All      Total      3380.96454      100.000000       

 

The ApexAsym is the area to the right of the apex of the component to the area to the left. With a shared a3, these values will be a constant if the whole of the peak has been sampled.

Parameter Statistics

 

Parameter Statistics

Peak 1 GenNorm[m]

 Parameter      Value      Std Error      t-value      95% Conf Lo      95% Conf Hi

         Area      984.623525      101.529507      9.69790507      782.409986      1186.83706

         Mean      27.2205059      0.95746305      28.4298240      25.3135530      29.1274589

        Width      7.64891252      0.50347316      15.1922944      6.64615881      8.65166623

        Shape      0.14304894      0.02446285      5.84760020      0.09432696      0.19177092

 

Peak 2 GenNorm[m]

 Parameter      Value      Std Error      t-value      95% Conf Lo      95% Conf Hi

         Area      1457.95617      61.7191538      23.6224264      1335.03182      1580.88052

         Mean      36.9814473      0.56421530      65.5449206      35.8577151      38.1051795

        Width      6.09135240      0.25347875      24.0310184      5.58650573      6.59619908

        Shape      0.14304894      0.02446285      5.84760020      0.09432696      0.19177092

 

Peak 3 GenNorm[m]

 Parameter      Value      Std Error      t-value      95% Conf Lo      95% Conf Hi

         Area      523.304294      79.4870882      6.58351320      364.992038      681.616551

         Mean      44.4536671      0.13283299      334.658332      44.1891072      44.7182269

        Width      4.27119265      0.07595286      56.2347859      4.11991943      4.42246588

        Shape      0.14304894      0.02446285      5.84760020      0.09432696      0.19177092

 

Peak 4 GenNorm[m]

 Parameter      Value      Std Error      t-value      95% Conf Lo      95% Conf Hi

         Area      246.996076      15.6774287      15.7548844      215.771770      278.220381

         Mean      56.3964751      0.32439963      173.848764      55.7503772      57.0425730

        Width      4.13734336      0.09815704      42.1502453      3.94184667      4.33284005

        Shape      0.14304894      0.02446285      5.84760020      0.09432696      0.19177092

 

Peak 5 GenNorm[m]

 Parameter      Value      Std Error      t-value      95% Conf Lo      95% Conf Hi

         Area      127.109781      12.1909201      10.4265945      102.829459      151.390103

         Mean      68.8648378      0.56713681      121.425442      67.7352870      69.9943887

        Width      6.25977748      0.46990459      13.3213796      5.32388137      7.19567359

        Shape      0.14304894      0.02446285      5.84760020      0.09432696      0.19177092

 

Peak 6 GenNorm[m]

 Parameter      Value      Std Error      t-value      95% Conf Lo      95% Conf Hi

         Area      43.6927874      4.60841407      9.48108974      34.5143353      52.8712395

         Mean      84.7888273      0.45444831      186.575296      83.8837151      85.6939396

        Width      5.52886284      0.29872367      18.5082851      4.93390309      6.12382259

        Shape      0.14304894      0.02446285      5.84760020      0.09432696      0.19177092

 

Parameters which fail the specified confidence level will be grayed. You will want to be certain each of the parameters pass significance. This may mean that you have to use a higher level of smoothing than you would prefer. For post-apex data, the 17-length data usually offered full significance for all parameters once the correct count of peaks was fitted.

Overlap Areas

 

Overlap Areas

 Peak      Type       Peak 1       Peak 2       Peak 3       Peak 4       Peak 5       Peak 6

    1      GenNorm[m]      984.622842      502.889623      116.211691      11.9978066      2.12140979      0.03063219

    2      GenNorm[m]      502.889623      1457.95617      440.972222      50.9775550      6.59025892      0.05937452

    3      GenNorm[m]      116.211691      440.972222      523.304294      60.1440743      5.38023221      0.01258276

    4      GenNorm[m]      11.9978066      50.9775550      60.1440743      246.996075      39.2356240      0.42000080

    5      GenNorm[m]      2.12140979      6.59025892      5.38023221      39.2356240      127.027835      14.5886470

    6      GenNorm[m]      0.03063219      0.05937452      0.01258276      0.42000080      14.5886470      41.0573237

 

This table can be used to see how much each of the populations (identified peaks) overlap. The diagonals will report the area the individual peak.

PFChrom Methodology

(1) Copy the cumulative deaths data from the CSV file in Excel. Copy just the one column of cumulative information. Include an initial starting day with a 0 value (day 0 will have 0 deaths). In PFChrom, File | Import Data Set(s) | Import from Clipboard. Use the default starting value of 0 and x-increment of 1. OK, enter a title, OK. Right click the main graph, select Replicate this Data Set a Specified Number of Times, enter 3 and OK. You will now have four identical cumulatives in the main window graphs.

(2) Data | Smooth, select S-G 1st deriv (do not check the Quadratic option). Double click the first graph. Enter the percent width of the data necessary to produce a window of n pts=15. Double click again to restore all graphs. Double click the second graph. Enter the percent of the overall data necessary to generate a window of 17 n pts. Double click once more to return to all graphs. Double click the third graph. Enter the percent of the overall data necessary to generate a window of 19 n pts. Double click to return to all graphs. Double click the fourth graph. Enter the percent of the overall data necessary to generate a window of 21 n pts. Double click once more to return to all graphs. Then OK, OK, OK. You will then be back in the main window with four different smoothed densities derived from this one raw cumulative.

(3) Hidden Peaks - Second Derivative. No baseline. Sm n(1)=2 (data is already smoothed). Statistical family, GenNorm[m] model. Amp% 1.5, D2% 0, vary width a2 checked, shape a3 unchecked. Right click first graph, Copy Numeric Estimates from this Data Set. For all other graphs, right click, Paste Copied Numeric Estimates to this Data Set. You will now have identical starting estimates in all four data sets consisting of the automatic estimates for the first, the least smoothed data set.

(4) Click the Modify Peak Fit Preferences button to the right of the Peak Fit button. Set Maximum Total Iterations to 5000, Converge to Significant Digits to 12, and Merit Must Improve % in 100 Iterations to 0.0. Set the Maximum Iterations per Sequence to 50, and the Iterations/Each Separate Peak Fit to 50. Remove the a2 Constrain Parameter check in the Global Constraints - Sequential Fits. Click OK. This should allow all fits to go to completion without having to restart them as a consequence of constraints which guide the overall fitting.

(5) Click Peak Fit. Select the Fit with Reduced Data Prefit, Cycle Peaks, 2 Pass, Lock Shared Parameters on Pass 1. Be sure the Fit using Sequential Constraints box checked. Click OK. Click Review Fit when the fit is finished. Use the Review's Numeric Summary to see which fits are wholly specified (all of the parameters are significant or non-grayed). The length 15 data sets will possibly be insufficiently smoothed. The length 21 data set will be probably be oversmoothed. You can use the Review's selection option to display only the fits with all parameters significant and the Average Multiple Fits option in the Numeric Summary to produce an average of just those specific fits.

Inferences

When the data represent a finished density, the amplitudes can be used to estimate any expected post-apex peaks. In the China data, the first peak in the post-apex region has an amplitude of 24 as compared to a maximum of 103, about a quarter. If another country saw a post-apex region peak which comes in at a quarter or less of the maximum daily death total, there would be at least a suggestion that a country was matching China's success.

With the data from the fits, you can answer just about any question you wish. For example, in the China data the fourth blue peak occurs at 56.4 days. With an overall apex of 36 days, the peak appears about 20 days post-apex, with a half-height width of 10 days. If a post-apex region peak appears 2-3 weeks after the apex, and the deaths are about a quarter of the maximum per day, and those further drop to about an eighth of that maximum in about five more days, a country would be equaling China's success at this stage in the process.

To see overall area percentiles, use the right click Evaluate Fit Model feature. If the Analytic Moments areas sum to 3383, as they do in in the China data, one can integrate from 0 (in X) to an entered value (in X) to get an area. If one were to enter different values looking for 99% of the 3383 value (3349), 81 days was required for this 99 percentile, of 81-36=45 days post-apex.

The total estimated deaths will be the sum of Analytic Areas. One important estimate is how close one is to the projected endpoint as the the density currently stands. The sum of measured Areas should be very close to the current cumulative deaths. Bear in mind that China recorded three post-apex region peaks, and four peaks past the actual recorded apex in the data.














 



c:\1pf\v5 help\home.gif