Submit manuscript...
eISSN: 2576-4500

Aeronautics and Aerospace Open Access Journal

Research Article Volume 6 Issue 3

Statistical investigation on covid 19 epidemics

Sebastiano Tosto

ENEA Casaccia Research Center, Italy

Correspondence: Sebastiano Tosto, ENEA Casaccia Research Center, via Anguillarese 301 00060 Roma, Italy

Received: July 08, 2022 | Published: July 19, 2022

Citation: Tosto S. Statistical investigation on covid 19 epidemics. Aeron Aero Open Access J. 2022;6(3):83-88. DOI: 10.15406/aaoaj.2022.06.00144

Download PDF

Abstract

This paper describes a statistical investigation on the virus infection in Italy. Analysis of epidemic data and simulation tests, demonstrate the usefulness of statistical approach to foresee in advance the contagion widespread and organize preventive containment measures. The approach has general character, it is valid and applicable also to other Countries.

Keywords: covid, epidemic, analysis, statistics, simulation tests

Introduction

The persisting alert status required by the recent epidemic Covid-19 virus all over the world, has stimulated the necessity of investigating new medical therapies to face the epidemic and evidenced the usefulness of predicting somehow the widespread of the infection process in the territories. The latter aspect, although having prevalent non-medical character, has relevant social interest to program in advance and organize financial interventions by the local health Institutions. For this reason, has been carried out the present statistical study on the number ni=ni(t)ni=ni(t)  of infected individuals as a function of time and related consequences in Italy. The basic questions that motivated this study were: did the progressive widespread of infection be somehow predictable since the early days of its evolution? Could the number of deaths be reasonably estimated and the probability of recovery rationally assessed? Despite the core problem of these questions has epidemiological character, there is a reasonable chance that the statistical investigation of data released by caregiver Institutions proves useful, whatever the key medical factors that govern the actual evolution of the infection might be. The ramp up stage of nini  is intuitively understandable. In fact nini  are acknowledged after a natural incubation time of the virus and successive medical tests to evidence its presence in the patients; all of this implies a latency time lapse during which number nini  of new infected persons, sometime asymptomatic, increments faster than its conversion to the numbers of recoveries nh=nh(t)nh=nh(t)  and deaths nd=nd(t)nd=nd(t) . The ramp down of nini  is also understandable because of several reasons, e.g. containment measures of the contagion promptly organized, enhanced efficacy of therapies meanwhile fine-tuned, statutory quarantine of infected individuals and eventually herd protection during the advanced stage of the epidemic. Clearly between these actions there is a transient stage where both effects balance, whence the occurrence of a peak of nini  as a function of time. This is in brief the statistical core of the endemic infection, which consists of numbers and mathematical functions that mirror and codify the activities in progress of Hospitals and Research Institutes. The ability of the statistical methods to summarize the behavior of great numbers of events of any kind, suggests the chance of elaborating a calculation model that accounts for the numerical aspects of the collective medical problem.

The present paper aims to investigate what kind of information can be obtained examining uniquely the infection data available in Italy along with that of recoveries nhnh  and deaths ndnd . As the results appear encouraging, nothing prevents thinking that an analogous study can be carried out even elsewhere.

The statistical model

The Figure 1 shows the source of data released by “Protezione Civile”: the ordinate reports the pertinent values of the aforesaid three numbers, the abscissa expresses the time in days in the year 2020; the initial time is therefore February 24-th, the final time is May 10-th corresponding to the last reading useful to implement the statistical calculations presented here. These time boundaries are assumed enough to delineate the phenomenon in agreement with the purposes of the present paper. The data implemented for the calculations are cumulative at national level and gender inclusive (male + female); also, by definition, they are irrespective of the age of subjects, pre-existing state of health of the individuals and different realities of the regional health care organizations. The basic hypothesis is that nini  is anyway representable by a time function F=F(t)F=F(t)  characterized by a maximum at t=tmaxt=tmax  to which corresponds a peak number ni=nmaxi=ni(tmax)ni=nmaxi=ni(tmax)  of infected, as it is shown in figure 1. The following calculations are carried out as a function of time expressed in days starting from February 24-th, which becomes day 1.

Figure 1 Source of data released by Protezione civile.

As any statistical analysis requires homogeneous and comparable samples, hold the following basic assumptions about the mathematical processing of data:

  1. The plots of fig 1 implement numbers sufficiently homogeneous to allow the self-consistency of a statistical model throughout the data sampling period. This implies for example that the initial measures of containment of virus infection do not change as a function of time, because this would perturb the natural course of recoveries and deaths with respect to the initial conditions.
  2. The change of environmental conditions to which the virus seems sensitive, e.g. typically the temperature or the degree of humidity, are not explicitly considered. A deep knowledge of the vitality of the virus and its importance from an epidemiological point of view are certainly essential factors for predicting behaviour and impact on the population. Yet, in lack of pertinent information, these factors and even the ability of the virus to modify itself are neglected.

The data in Figure 1 is the reference standpoint for any subsequent reasoning. In all of the following plots, the graphic symbols signify the respective experimental data, the curves characterise the trend of the mathematical functions aimed to represent them. The overlapping of plots and data is systematically shown throughout this paper to verify the chance of describing correctly the medical assessment via appropriate mathematical functions. This point is not trivial, because in fact the statistical homogeneity of data implemented for calculations is not controllable “a priori”. The statistical parameters of the curves calculated by the statistical software are numbers with several decimal places, omitted or rounded for brevity; it is essential to demonstrate that best fit analytical functions consistent with the observed data effectively exist, whatever the actual computer numerical outputs might be. In other words, the matching between experimental symbols and representative curves rather than the inherent mathematical details is the simplest proof about the reliability of the present calculation model.

The paper consists of two parts

The first one concerns all infection data available up to the time considered in the present paper, with the aim of describing the whole infection event as comprehensively as possible.

The second part repeats some selected calculations considering a few early data only: the aim is to show that actually these initial data are enough to extrapolate reasonably the whole event, thus evidencing that crucial features like peak time and related maximum numbers of infected, recoveries and deaths could have been decently estimated since the beginning of the infection. In fact, the calculations able to reproduce the observed data also provide extrapolated information.

Just that's the main purpose of the model. Three reference books on the standard statistical techniques are reported.1–4 The present calculations have been carried out implementing the “nonlinear fit” statistical packages of MAPLE software. An ample literature has been produced recently to activate medical resources and face urgently the pandemic phenomenon; as expected, the prevalent character of these papers concerns the medical/clinical aspects of the virus problem. See5 as concerns the infection in U.K.6 as concerns virus diagnostics,7 for prevention,8 for implications of infection,9 for transmission from animals to humans and for asymptomatic contacts,10 for infection in Germany.

Among all medical aspects of the Covid 19 infection, the present paper focuses the attention of the reader on the impact of the virus aggressiveness on the population, strictly related to the statistical parameters emerging from the calculations.

Results of the statistical analysis of the trends depicted in the Figure 1

The search for the test function F suitable to represent the observed data of virus spread controlling ni=nobsini=nobsi  among the population must account for some mathematical requirements:

  1. It must have a maximum as a function of time;
  2. It must tend to zero for time tending to infinite, otherwise the number of new infections would persist indefinitely, contrarily to what emerges from epidemiological data historically occurred in the past;
  3. The calculated peak must be later than the first detection date, otherwise F would automatically be incorrect;
  4. Subsequently, a similar analysis will be also extended to the statistics of deaths and recoveries.

The results of a preliminary data analysis using a Gaussian "bell" type test function, typically used to represent statistical phenomena that fulfil in principle the aforementioned boundary conditions, did not succeed; the initial trend of the Gaussian curve before the maximum increases too slowly to justify the number of nini  of (Figure 1), which instead have a more drastic exponential growth. For brevity this result is not reported here. A better representative function was instead the following modified Gaussian

Fi=A(t+t1)xexp(k(tt0)2)Fi=A(t+t1)xexp(k(tt0)2) ,  (1)

with A, t0, t1, k, xA, t0, t1, k, x  best fit numerical parameters to be found. The problem is just to calculate these constants in order to reproduce and then extrapolate the few initial data available as reliably as possible. The graph in Figure 2 shows the result of nonlinear best fit regression of the representative function (1). Omitting here the mathematical details on the choice of the analytical form of F, shortly sketched below, we emphasize that the time factor (t+t1)4(t+t1)4  is essential to modify the Gaussian according to the high increasing rate of the early nini ; this factor is thus representative of the clinical aggressiveness and infectiveness of the virus.

Figure 2 Result of nonlinear best fit regression of the representative function.

It is noted that with this representative function, the agreement with the experimental contagion data is reasonable, while the negative exponential in (1) is actually predictive of tmaxtmax  and subsequent descent of the number of contagions as a function of time previously postulated “a priori”.

With this function it is simple to calculate data of interest, summarized in the following eqs (2), which quantitatively account for the calculated peak number of infected persons nmaxinmaxi  and the time t(nmaxi)t(nmaxi)  to reach this peak corresponding to F/t=0F/t=0 , i.e.:

t(nmaxi)=56.8,nmaxi=108897t(nmaxi)=56.8,nmaxi=108897 ;

the comparison of nmaxinmaxi  with the observed number of cases nini  at the day of peak yields

nobsi=108014,deviation=0.82%nobsi=108014,deviation=0.82% .  (2)

The fractional time could mean in principle the reaching of peak time between the 56-th and 57-th day after February 24-th; actually, this result must be intended as a numerical output with mere statistical meaning. What is important is that the deviation between calculated and observed number of infects is satisfactory. The peak number nmaxinmaxi  of infected people is, as of today's date, about 1.4 per thousand of the Italian population with the present data used for the calculations.

Apparently, this calculation seems redundant, once having already visible all experimental values of nini  before and after the peak. Actually, these results are significant for the comparison with the corresponding extrapolated values of the next section; it will be possible to assess quantitatively the predicting ability of analogous calculations obtained examining a few early data only.

The next plot of (Figure 3) represents in an analogous way the numbers of recoveries and deaths to complete the description of the global infection. The figure is obtained simply with the data corresponding to the days of (Figure 1) via the respective best fit functions, which are respectively

Fh=A(t+t0)31+B(t+t1)4,Fd=AC(t+t0)+D(t+t0)2+(t+t0)31+B(t+t1)3 ,  (3)

Figure 3 Numbers of recoveries and deaths to complete the description of the global infection.

where the capital letters and t0,1  are the statistical parameters to calculate explicitly the time trends of nh  and nd . The (Figures 2&3) concern raw observed data, to show the actual chance of representing them via mathematical functions. A further important point to support the validity of the present approach is just the statistical nature of the infection, which appears in fact confirmed defining the function

πhπd=nhnindni   (4)

plotted in the Figure 4. Assuming that both nh  and nd  are consequences of a unique cluster ni  of infected individuals, πh  and πd  are definable as probabilities. The ordinate of the plot reports the joint probability that among ni  cases of infection there can be both recoveries and deaths, whose respective probabilities are combined according to the law of independent events. Indeed it is reasonable to think that πh  and πd  concern in fact independent classes of healthy and unhealthy, smoker and non-smoker individuals of different gender and age, whose immune defences imply different reactions to the infection.

Figure 4 Statistical nature of the infection.

On the one hand it is reasonable to think that πh  and πd  depend specifically upon the current clinical anamnesis of individuals with their own past stories, upon sex and all the other variables that contribute to cumulative (Figure 1). So, the collective trends of πh  and πd  concern in fact independent classes of healthy and unhealthy people.

On the other hand, from a statistical point of view, in a sample of many subjects the differences from one individual to another appear summarized in a single collective behaviour whose global result is uniquely definable, the one summarized in the figure 1 that is the target of the present paper.

Just the regular trend of this figure encourages thinking that, despite the wide variety of biological parameters triggered by the complex virus/human body interaction, it is anyway possible to identify a well-defined collective behaviour hidden in the apparent randomness of personal situations. The identification of this joint probability is interesting as it concerns the chance of combining the available data to infer more information and deserves a tentative explanation.

Let π1=phPd  indicate an unhealthy patient with small probability ph  of surviving and thus more likely to die with high probability Pd , whereas instead π2=Phpd  concerns the opposite chance for a healthy patient; Ph  and Pd  add further information to the left hand side of eq (4) because now the probabilities π1  and π2  emphasize explicitly the health states of the respective patients. Let the numbers nh  be large enough to gather individuals that anyway recover at the end of their therapeutic cycle, regardless of whether their initial health conditions are in principle classifiable with both low or high recovering probabilities; a typical example of this idea is that some individuals recover without need of intensive therapy, other ones instead need unfortunately invasive and prolonged hospital treatments but anyway recover. So, the initial ph  and Ph  are not crucial themselves in determining the final recovery probability nh/ni ; in other words, different patients classified as ph  and Ph  merge into a global nh  defining a unique nh/ni . An analogous consideration holds for nd/ni  regardless of pd  and Pd : in fact it is not true that die only individuals in intensive therapy, i.e. the ones classified as Pd . So when calculating π1π2  the respective health conditions represented by the probabilities phPh  and pdPd  are statistically equivalent to and simply summarized by nh  and nd  of eq (4). As a matter of fact, the respective curve of fig 4 is well defined and thus explainable with Phph  and Pdpd  described by a unique time behaviour.

Although this conclusion was not strictly evident “a priori”, such a correlation is directly confirmed in Figure 5. The plot shows that the number of recoveries nh  and deaths nd  are both functions of time mutually correlated through their common origin from the number of infected ni  only and thus are in fact uniquely definable: the implicit functions nh=H(ni)  and nd=D(ni)  merge together into the function

recoveriesdeaths=At+Bt4 .  (5)

Figure 5 Number of recoveries and deaths.

The regression coefficients at the right of the equation, determined as described above, provide quantitative estimates of the progress of the contagion containment as a function of time.

The trend of nh  vs nd  evidences that the numbers of deaths and recoveries are well correlated in the population infected by the virus despite the diversity of the individuals. It is reasonable to guess that during the initial incubation period, soon after the virus attack, the ni  patients are in a transient state with suspended probabilities πh  and πd  before that they turn into either observed state defined by nh  or nd ; also here the solid line indicates the existence of a single ideal cumulative probability only, without discriminating serious patients recovered after intensive care or mild patients treated with simple drug therapy. The left-hand side represents the therapeutic efficacy; the different powers of t  at the right-hand side concern short term and long-term ratios, when the second addend becomes expectedly predominant. So (5) measures the success of the full therapeutic cycle.

Herd immunity

The statistical model so far described has considered the raw ni  of (Figure 1) to describe in more detail in Figure 2 the ramp up law of the total number of infected people as a function of time towards to and around the peak. Presumably in this time range no further considerations are necessary besides the obvious requirement of data homogeneous and statistically representative. Yet there is no reason to exclude long term effects, like for example the so called “herd immunity”. While the mere numerical analysis of data does not contain itself epidemiological information, this latter should be hopefully inferred from the former: no statistical model can be considered exhaustive without the ability of extracting information about this protective effect possibly hidden in the raw data. Since the herd immunity is reasonably ineffective in the case of the first initial data but becomes more important subsequently, the second half of the curve in the next (Figure 6) has been drawn cautiously by dashed line; without accounting for herd immunity, in principle this part of the plot estimates only the upper limit of the number of infections. Although the satisfactory matching of raw ni observed and calculated curve is good enough to neglect the necessity of corrections, at least with the data presently available, in this section however it is shown how to account for and assess the expected deviation from the long-term data.

In principle realistic values nherdini  accounting for the herd effect should be recalculated from the curve of (Figure 6) introducing a corrective factor f(ni)<1  at t>>tpeak  through iterative calculations until convergence; however it requires hypothesizing an average number of contacts among a selected group of infected and recovered individuals to verify quantitatively the probability of a successful herd protection. In practice the approach to find an expression linking nherdi  and ni  is more simply carried out by introducing a controllable correction factor that modifies reasonably the available short-term results already obtained.

Figure 6 ni observed and calculated curve.

An appropriate strategy to represent mathematically the herd effect is like this

nherdinif(ni)=ni1+Cninh   (6)

being C a proportionality constant; i.e. ni  is reduced by a variable factor which becomes progressively more important with the increase of ni  itself and with the number nh  of recoveries.

For C=0  then f(ni)1 , i.e. the herd effect is zero.

If C0 , then the function (6) is proportional for nh  to n1h . That is, if the number of recoveries increases then the number of infected individuals tends to zero, which is precisely what the protective effect of herd protection requires: from a probabilistic point of view, a healthy person surrounded only by recovered people or non-infected people has little chance of becoming infected.

Eq (6) is calculable with ni  and nh  extrapolated from the short-term data, where the herd effect is ineffective. Although a heuristic “ab initio” calculation model would be preferable, this equation yields the sought information once more by best fit statistical analysis of long-term infection data. In fact, the approach based on (6) is sensible only if the virus does not change; the early trend summarized by short term data is a fixed fingerprint of a given virus, whereas instead changes merely the long-term probability of infections by contact with recovered neighbours. Yet, if the virus evolves as a function of time, then a unique ni  curve is no longer representative; it should be replaced by two curves pertinent to the virus before and after modification, i.e. the herd effect is no longer definable via correction of a unique short-term curve. However, the analysis of the long-term curve could in principle provide updated information about the virus modification occurred in the meantime; in this respect it is still sensible to check how the long term curve is modified by the constant coefficient C  to infer information about the different short term trend.

An example of simulated results is shown in the Figure 6 with two test values of the constant C  purposely chosen, i.e. C=1010  and C=1011 .

This figure still reports by comparison the data of figure 1 with and without the correction introduced by eq (6); indeed, the main plot is just that of (Figure 2) plus an extrapolated part to be compared with two herd dotted curves. The results with C=1010  show tmax  equal to about 35 days, while the maximum peak of infected people nmax  decreases to about 70,000 people infected. It is clear that this value of C  foresees a too optimistic herd effect, which should have been already observed experimentally 50 days after February 24; since this is not realistic, the simulation was repeated with C=1011 , which in fact foresees the peak at 60 days, with a number of infected equal to about 102.000. As expected, anyway the alteration of the raw data of ni  is practically zero at t<tmax ; the differences with respect to Figure 2 begin to become relevant much later only, reasonably at increasing ni  after the peak. The dot line that represents the long-term extrapolation of the best fit curve of ni  suggests its tendency to approach that of nherdi . Just this tendency to merge the early curves supports the validity of this reasonable value of C .

It can be concluded that this constant can be either (i) predetermined in order to simulate a herd effect practically negligible initially but progressively increasing with the increase of the infected and of the recoveries or (ii) determined uniquely by the statistical treatment of long term “tail data”.

Early data analysis

This section concerns the results obtainable analysing the first 20 initial data only, nredi<20  of the fig 1; the notation stands for “reduced” number of early observed data. The Figure 7 shows the result of best fit analysis of these few data only, which of course are insufficient to match decently the true trend of all observed data also reported by comparison. Nonetheless it appears that the peak time is still quite correct, it differs from the true peak time by about one day, whereas the related number of infected is underestimated by about 10%  only. These results show however that the main epidemiological features of virus infection could have been realistically anticipated by a prompt analysis of statistical data. This is the true meaning of the present model.

Figure 7 Result of best fit analysis.

Discussion

The overall indication that emerges from the present statistical model is that even a minimal alteration of environmental data causes a drastic variation in the spread of the virus and its ability of contagion, which is quantified by the best fit parameters of the exponential eq (1). On the one hand, owing to the exponential form, even small changes of these parameters affect strongly the dependence of ni  upon time. On the other hand, just this mathematical feature describing the contagion suggests that even elementary protection measures preventively actuated, e.g. safety distance between individuals or use of protection masks and so on, are effective in controlling the subsequent epidemic data. This conclusion appears evident in the Figure 6: even small changes of C, purposely introduced in the eq (6) to simulate a typical environmental effect such as herd protection, are enough to modify significantly the long-term evolution of the infection in progress. Also, the fig 2 evidences an anomalous dispersion of data just after the peak, which implies an initial decreasing rate of ni  slower than predicted by the exponential of (1).

This is not surprising. It is known in general that when a strong effect governs predominantly the kinetic of any process, an external perturbation simply adds a minor spread of data around the leading trend obtainable via statistical considerations only. Instead an equilibrium state in nature is by definition the resultant of many counterbalancing effects, thus particularly sensitive to any external perturbation; specifically, appear around the peak time the possible consequences due to the previous health conditions of the patients and their lifestyle.

Another problem is that the strict measures of restrain initially put in place by the Government authorities have been partially loosened, probably misinterpreting too optimistically the stage after the peak of infected patients as a symptom of reduced danger of contagion. After this short stage, however, the further ni  data have resumed the behaviour predicted by the eq (1).

It is commonly believed that the turning point of the infection advancement is identifiable with the peak time tmax  of ni ; undoubtedly this is correct as concerns the effectiveness of the measures to prevent new contagions, but it does not highlight adequately the current state of the epidemic evolution.

Actually, the analysis so far carried out suggests the significance of the relationship between the incoming number ni  of hospitalized patients and outcoming numbers nh  and nd  representing either possible evolution of ni  after a successful or vain therapy cycle. Yet the experimental data of Figure 1 show the existence of a time such that ninh<0 , which also implies ni(nh+nd)<0 ; the time evolution of ni  and nh  observed at increasing times suggests therefore the relevance of the number

N=ni(nh+nd)

that involves explicitly the effectiveness of the therapies on nh  patients. Clearly N>0  implies overcrowding of hospitalized patients exceeding the chance of their efficient management; N<0  implies the decreasing rate of new infected patients with respect to their conversion rate to either nh  or nd . If however the growing rate of ni  decreases until becoming null at the peak time, while in the meantime the sum nh+nd  continues to increase, then one expects first N=0  at a given t*  and next even N<0  at t>t* . Hence, assuming that ni  still evolves at t>tmax  into nh(t)  and nd(t) , it happens that: (i) ni  tends to zero while contributing to the growth of nh , (ii) nd  tends towards its final value of total deaths, (iii) N  becomes more and more negative.

The (Figure 8) confirms these ideas: it appears that t*65  days after which N<0  indicates a number of new infected ni<nh+nd  progressively decreasing.

Figure 8 Number of new infected ni<nh+nd progressively decreasing.

On the one hand it is not surprising that the time t*  at which N=0  does not necessarily coincide with tmax57  days calculated considering ni  only. On the other hand it is clear why N  and not ni  is the crucial indicator of the turning-point between the aggressive stage of the virus and its progressive attenuation; when comparing ni  and nh+nd , the sign of N  depends upon the relative rates at which the addends prevail each other.

All of this is explained in the Figure 9, showing that in effect N  consists of three statistical distributions that end down with negative values; the three contributions of representative best fit functions resolved by the peak analysis are reasonably related to the three respective basic stages of contagion with . The peak splitting equation has the analytical form

A(tt1)4exp(a(t34)4)+B(tt2)6exp(b(t42)2)+C(tt3)4exp(c(t45)2)   (7)

Figure 9 Effect N consists of three statistical distributions.

Conclusion

The statistical investigation provides reliable and significant indications on the epidemiological evolution of COVID-19, even extensible to extrapolated times. Although further data are still expected in the next days beyond the considered interval, the most critical aspects of the infection seem adequately represented even at the current end time.

The present information methodology has general character; it is valid and applicable without need of specific hypotheses besides the basic assumptions of any statistical model.

-The results show that the evolution of virus infection could have been predictable with reasonable approximation since the early days, which means good chance of organizing care structures and medical materials (e.g. face masks) useful in the most acute stage of the infection.

-The plots allow estimating quantitatively the efficacy of therapeutic protocols through the condition ninhnd=0 , which holds twice: at the peak time of ni  and before the onset of negative values ninhnd<0  that defines the final stage of predominant contagion.

The approach is useful to forecast and plan future containment and coordination activities to support sufferance of sick.

Acknowledgments

None.

Conflicts of interest

The author declares that there is no conflict of interest.

References

Creative Commons Attribution License

©2022 Tosto. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.