Research Article Volume 6 Issue 3
ENEA Casaccia Research Center, Italy
Correspondence: Sebastiano Tosto, ENEA Casaccia Research Center, via Anguillarese 301 00060 Roma, Italy
Received: July 08, 2022 | Published: July 19, 2022
Citation: Tosto S. Statistical investigation on covid 19 epidemics. Aeron Aero Open Access J. 2022;6(3):83-88. DOI: 10.15406/aaoaj.2022.06.00144
This paper describes a statistical investigation on the virus infection in Italy. Analysis of epidemic data and simulation tests, demonstrate the usefulness of statistical approach to foresee in advance the contagion widespread and organize preventive containment measures. The approach has general character, it is valid and applicable also to other Countries.
Keywords: covid, epidemic, analysis, statistics, simulation tests
The persisting alert status required by the recent epidemic Covid-19 virus all over the world, has stimulated the necessity of investigating new medical therapies to face the epidemic and evidenced the usefulness of predicting somehow the widespread of the infection process in the territories. The latter aspect, although having prevalent non-medical character, has relevant social interest to program in advance and organize financial interventions by the local health Institutions. For this reason, has been carried out the present statistical study on the number ni=ni(t)ni=ni(t) of infected individuals as a function of time and related consequences in Italy. The basic questions that motivated this study were: did the progressive widespread of infection be somehow predictable since the early days of its evolution? Could the number of deaths be reasonably estimated and the probability of recovery rationally assessed? Despite the core problem of these questions has epidemiological character, there is a reasonable chance that the statistical investigation of data released by caregiver Institutions proves useful, whatever the key medical factors that govern the actual evolution of the infection might be. The ramp up stage of nini is intuitively understandable. In fact nini are acknowledged after a natural incubation time of the virus and successive medical tests to evidence its presence in the patients; all of this implies a latency time lapse during which number nini of new infected persons, sometime asymptomatic, increments faster than its conversion to the numbers of recoveries nh=nh(t)nh=nh(t) and deaths nd=nd(t)nd=nd(t) . The ramp down of nini is also understandable because of several reasons, e.g. containment measures of the contagion promptly organized, enhanced efficacy of therapies meanwhile fine-tuned, statutory quarantine of infected individuals and eventually herd protection during the advanced stage of the epidemic. Clearly between these actions there is a transient stage where both effects balance, whence the occurrence of a peak of nini as a function of time. This is in brief the statistical core of the endemic infection, which consists of numbers and mathematical functions that mirror and codify the activities in progress of Hospitals and Research Institutes. The ability of the statistical methods to summarize the behavior of great numbers of events of any kind, suggests the chance of elaborating a calculation model that accounts for the numerical aspects of the collective medical problem.
The present paper aims to investigate what kind of information can be obtained examining uniquely the infection data available in Italy along with that of recoveries nhnh and deaths ndnd . As the results appear encouraging, nothing prevents thinking that an analogous study can be carried out even elsewhere.
The statistical model
The Figure 1 shows the source of data released by “Protezione Civile”: the ordinate reports the pertinent values of the aforesaid three numbers, the abscissa expresses the time in days in the year 2020; the initial time is therefore February 24-th, the final time is May 10-th corresponding to the last reading useful to implement the statistical calculations presented here. These time boundaries are assumed enough to delineate the phenomenon in agreement with the purposes of the present paper. The data implemented for the calculations are cumulative at national level and gender inclusive (male + female); also, by definition, they are irrespective of the age of subjects, pre-existing state of health of the individuals and different realities of the regional health care organizations. The basic hypothesis is that nini is anyway representable by a time function F=F(t)F=F(t) characterized by a maximum at t=tmaxt=tmax to which corresponds a peak number ni=nmaxi=ni(tmax)ni=nmaxi=ni(tmax) of infected, as it is shown in figure 1. The following calculations are carried out as a function of time expressed in days starting from February 24-th, which becomes day 1.
As any statistical analysis requires homogeneous and comparable samples, hold the following basic assumptions about the mathematical processing of data:
The data in Figure 1 is the reference standpoint for any subsequent reasoning. In all of the following plots, the graphic symbols signify the respective experimental data, the curves characterise the trend of the mathematical functions aimed to represent them. The overlapping of plots and data is systematically shown throughout this paper to verify the chance of describing correctly the medical assessment via appropriate mathematical functions. This point is not trivial, because in fact the statistical homogeneity of data implemented for calculations is not controllable “a priori”. The statistical parameters of the curves calculated by the statistical software are numbers with several decimal places, omitted or rounded for brevity; it is essential to demonstrate that best fit analytical functions consistent with the observed data effectively exist, whatever the actual computer numerical outputs might be. In other words, the matching between experimental symbols and representative curves rather than the inherent mathematical details is the simplest proof about the reliability of the present calculation model.
The paper consists of two parts
The first one concerns all infection data available up to the time considered in the present paper, with the aim of describing the whole infection event as comprehensively as possible.
The second part repeats some selected calculations considering a few early data only: the aim is to show that actually these initial data are enough to extrapolate reasonably the whole event, thus evidencing that crucial features like peak time and related maximum numbers of infected, recoveries and deaths could have been decently estimated since the beginning of the infection. In fact, the calculations able to reproduce the observed data also provide extrapolated information.
Just that's the main purpose of the model. Three reference books on the standard statistical techniques are reported.1–4 The present calculations have been carried out implementing the “nonlinear fit” statistical packages of MAPLE software. An ample literature has been produced recently to activate medical resources and face urgently the pandemic phenomenon; as expected, the prevalent character of these papers concerns the medical/clinical aspects of the virus problem. See5 as concerns the infection in U.K.6 as concerns virus diagnostics,7 for prevention,8 for implications of infection,9 for transmission from animals to humans and for asymptomatic contacts,10 for infection in Germany.
Among all medical aspects of the Covid 19 infection, the present paper focuses the attention of the reader on the impact of the virus aggressiveness on the population, strictly related to the statistical parameters emerging from the calculations.
Results of the statistical analysis of the trends depicted in the Figure 1
The search for the test function F suitable to represent the observed data of virus spread controlling ni=nobsini=nobsi among the population must account for some mathematical requirements:
The results of a preliminary data analysis using a Gaussian "bell" type test function, typically used to represent statistical phenomena that fulfil in principle the aforementioned boundary conditions, did not succeed; the initial trend of the Gaussian curve before the maximum increases too slowly to justify the number of nini of (Figure 1), which instead have a more drastic exponential growth. For brevity this result is not reported here. A better representative function was instead the following modified Gaussian
Fi=A(t+t1)xexp(−k(t−t0)2)Fi=A(t+t1)xexp(−k(t−t0)2) , (1)
with A, t0, t1, k, xA, t0, t1, k, x best fit numerical parameters to be found. The problem is just to calculate these constants in order to reproduce and then extrapolate the few initial data available as reliably as possible. The graph in Figure 2 shows the result of nonlinear best fit regression of the representative function (1). Omitting here the mathematical details on the choice of the analytical form of F, shortly sketched below, we emphasize that the time factor (t+t1)4(t+t1)4 is essential to modify the Gaussian according to the high increasing rate of the early nini ; this factor is thus representative of the clinical aggressiveness and infectiveness of the virus.
It is noted that with this representative function, the agreement with the experimental contagion data is reasonable, while the negative exponential in (1) is actually predictive of tmaxtmax and subsequent descent of the number of contagions as a function of time previously postulated “a priori”.
With this function it is simple to calculate data of interest, summarized in the following eqs (2), which quantitatively account for the calculated peak number of infected persons nmaxinmaxi and the time t(nmaxi)t(nmaxi) to reach this peak corresponding to ∂F/∂t=0∂F/∂t=0 , i.e.:
t(nmaxi)=56.8,nmaxi=108897t(nmaxi)=56.8,nmaxi=108897 ;
the comparison of nmaxinmaxi with the observed number of cases nini at the day of peak yields
nobsi=108014,deviation=0.82%nobsi=108014,deviation=0.82% . (2)
The fractional time could mean in principle the reaching of peak time between the 56-th and 57-th day after February 24-th; actually, this result must be intended as a numerical output with mere statistical meaning. What is important is that the deviation between calculated and observed number of infects is satisfactory. The peak number nmaxinmaxi of infected people is, as of today's date, about 1.4 per thousand of the Italian population with the present data used for the calculations.
Apparently, this calculation seems redundant, once having already visible all experimental values of nini before and after the peak. Actually, these results are significant for the comparison with the corresponding extrapolated values of the next section; it will be possible to assess quantitatively the predicting ability of analogous calculations obtained examining a few early data only.
The next plot of (Figure 3) represents in an analogous way the numbers of recoveries and deaths to complete the description of the global infection. The figure is obtained simply with the data corresponding to the days of (Figure 1) via the respective best fit functions, which are respectively
Fh=A(t+t0)31+B(t+t1)4,Fd=A′C(t+t′0)+D(t+t′0)2+(t+t′0)31+B′(t+t′1)3 , (3)
where the capital letters and t0,1 are the statistical parameters to calculate explicitly the time trends of nh and nd . The (Figures 2&3) concern raw observed data, to show the actual chance of representing them via mathematical functions. A further important point to support the validity of the present approach is just the statistical nature of the infection, which appears in fact confirmed defining the function
πhπd=nhnindni (4)
plotted in the Figure 4. Assuming that both nh and nd are consequences of a unique cluster ni of infected individuals, πh and πd are definable as probabilities. The ordinate of the plot reports the joint probability that among ni cases of infection there can be both recoveries and deaths, whose respective probabilities are combined according to the law of independent events. Indeed it is reasonable to think that πh and πd concern in fact independent classes of healthy and unhealthy, smoker and non-smoker individuals of different gender and age, whose immune defences imply different reactions to the infection.
On the one hand it is reasonable to think that πh and πd depend specifically upon the current clinical anamnesis of individuals with their own past stories, upon sex and all the other variables that contribute to cumulative (Figure 1). So, the collective trends of πh and πd concern in fact independent classes of healthy and unhealthy people.
On the other hand, from a statistical point of view, in a sample of many subjects the differences from one individual to another appear summarized in a single collective behaviour whose global result is uniquely definable, the one summarized in the figure 1 that is the target of the present paper.
Just the regular trend of this figure encourages thinking that, despite the wide variety of biological parameters triggered by the complex virus/human body interaction, it is anyway possible to identify a well-defined collective behaviour hidden in the apparent randomness of personal situations. The identification of this joint probability is interesting as it concerns the chance of combining the available data to infer more information and deserves a tentative explanation.
Let π1=phPd indicate an unhealthy patient with small probability ph of surviving and thus more likely to die with high probability Pd , whereas instead π2=Phpd concerns the opposite chance for a healthy patient; Ph and Pd add further information to the left hand side of eq (4) because now the probabilities π1 and π2 emphasize explicitly the health states of the respective patients. Let the numbers nh be large enough to gather individuals that anyway recover at the end of their therapeutic cycle, regardless of whether their initial health conditions are in principle classifiable with both low or high recovering probabilities; a typical example of this idea is that some individuals recover without need of intensive therapy, other ones instead need unfortunately invasive and prolonged hospital treatments but anyway recover. So, the initial ph and Ph are not crucial themselves in determining the final recovery probability nh/ni ; in other words, different patients classified as ph and Ph merge into a global nh defining a unique nh/ni . An analogous consideration holds for nd/ni regardless of pd and Pd : in fact it is not true that die only individuals in intensive therapy, i.e. the ones classified as Pd . So when calculating π1π2 the respective health conditions represented by the probabilities phPh and pdPd are statistically equivalent to and simply summarized by nh and nd of eq (4). As a matter of fact, the respective curve of fig 4 is well defined and thus explainable with Phph and Pdpd described by a unique time behaviour.
Although this conclusion was not strictly evident “a priori”, such a correlation is directly confirmed in Figure 5. The plot shows that the number of recoveries nh and deaths nd are both functions of time mutually correlated through their common origin from the number of infected ni only and thus are in fact uniquely definable: the implicit functions nh=H(ni) and nd=D(ni) merge together into the function
recoveriesdeaths=At+Bt4 . (5)
The regression coefficients at the right of the equation, determined as described above, provide quantitative estimates of the progress of the contagion containment as a function of time.
The trend of nh vs nd evidences that the numbers of deaths and recoveries are well correlated in the population infected by the virus despite the diversity of the individuals. It is reasonable to guess that during the initial incubation period, soon after the virus attack, the ni patients are in a transient state with suspended probabilities πh and πd before that they turn into either observed state defined by nh or nd ; also here the solid line indicates the existence of a single ideal cumulative probability only, without discriminating serious patients recovered after intensive care or mild patients treated with simple drug therapy. The left-hand side represents the therapeutic efficacy; the different powers of t at the right-hand side concern short term and long-term ratios, when the second addend becomes expectedly predominant. So (5) measures the success of the full therapeutic cycle.
Herd immunity
The statistical model so far described has considered the raw ni of (Figure 1) to describe in more detail in Figure 2 the ramp up law of the total number of infected people as a function of time towards to and around the peak. Presumably in this time range no further considerations are necessary besides the obvious requirement of data homogeneous and statistically representative. Yet there is no reason to exclude long term effects, like for example the so called “herd immunity”. While the mere numerical analysis of data does not contain itself epidemiological information, this latter should be hopefully inferred from the former: no statistical model can be considered exhaustive without the ability of extracting information about this protective effect possibly hidden in the raw data. Since the herd immunity is reasonably ineffective in the case of the first initial data but becomes more important subsequently, the second half of the curve in the next (Figure 6) has been drawn cautiously by dashed line; without accounting for herd immunity, in principle this part of the plot estimates only the upper limit of the number of infections. Although the satisfactory matching of raw ni observed and calculated curve is good enough to neglect the necessity of corrections, at least with the data presently available, in this section however it is shown how to account for and assess the expected deviation from the long-term data.
In principle realistic values nherdi≠ni accounting for the herd effect should be recalculated from the curve of (Figure 6) introducing a corrective factor f(ni)<1 at t>>tpeak through iterative calculations until convergence; however it requires hypothesizing an average number of contacts among a selected group of infected and recovered individuals to verify quantitatively the probability of a successful herd protection. In practice the approach to find an expression linking nherdi and ni is more simply carried out by introducing a controllable correction factor that modifies reasonably the available short-term results already obtained.
An appropriate strategy to represent mathematically the herd effect is like this
nherdi→nif(ni)=ni1+Cninh (6)
being C a proportionality constant; i.e. ni is reduced by a variable factor which becomes progressively more important with the increase of ni itself and with the number nh of recoveries.
For C=0 then f(ni)≡1 , i.e. the herd effect is zero.
If C≠0 , then the function (6) is proportional for nh→∞ to n−1h . That is, if the number of recoveries increases then the number of infected individuals tends to zero, which is precisely what the protective effect of herd protection requires: from a probabilistic point of view, a healthy person surrounded only by recovered people or non-infected people has little chance of becoming infected.
Eq (6) is calculable with ni and nh extrapolated from the short-term data, where the herd effect is ineffective. Although a heuristic “ab initio” calculation model would be preferable, this equation yields the sought information once more by best fit statistical analysis of long-term infection data. In fact, the approach based on (6) is sensible only if the virus does not change; the early trend summarized by short term data is a fixed fingerprint of a given virus, whereas instead changes merely the long-term probability of infections by contact with recovered neighbours. Yet, if the virus evolves as a function of time, then a unique ni curve is no longer representative; it should be replaced by two curves pertinent to the virus before and after modification, i.e. the herd effect is no longer definable via correction of a unique short-term curve. However, the analysis of the long-term curve could in principle provide updated information about the virus modification occurred in the meantime; in this respect it is still sensible to check how the long term curve is modified by the constant coefficient C to infer information about the different short term trend.
An example of simulated results is shown in the Figure 6 with two test values of the constant C purposely chosen, i.e. C=10−10 and C=10−11 .
This figure still reports by comparison the data of figure 1 with and without the correction introduced by eq (6); indeed, the main plot is just that of (Figure 2) plus an extrapolated part to be compared with two herd dotted curves. The results with C=10−10 show tmax equal to about 35 days, while the maximum peak of infected people nmax decreases to about 70,000 people infected. It is clear that this value of C foresees a too optimistic herd effect, which should have been already observed experimentally 50 days after February 24; since this is not realistic, the simulation was repeated with C=10−11 , which in fact foresees the peak at 60 days, with a number of infected equal to about 102.000. As expected, anyway the alteration of the raw data of ni is practically zero at t<tmax ; the differences with respect to Figure 2 begin to become relevant much later only, reasonably at increasing ni after the peak. The dot line that represents the long-term extrapolation of the best fit curve of ni suggests its tendency to approach that of nherdi . Just this tendency to merge the early curves supports the validity of this reasonable value of C .
It can be concluded that this constant can be either (i) predetermined in order to simulate a herd effect practically negligible initially but progressively increasing with the increase of the infected and of the recoveries or (ii) determined uniquely by the statistical treatment of long term “tail data”.
Early data analysis
This section concerns the results obtainable analysing the first 20 initial data only, nredi<20 of the fig 1; the notation stands for “reduced” number of early observed data. The Figure 7 shows the result of best fit analysis of these few data only, which of course are insufficient to match decently the true trend of all observed data also reported by comparison. Nonetheless it appears that the peak time is still quite correct, it differs from the true peak time by about one day, whereas the related number of infected is underestimated by about 10% only. These results show however that the main epidemiological features of virus infection could have been realistically anticipated by a prompt analysis of statistical data. This is the true meaning of the present model.
The overall indication that emerges from the present statistical model is that even a minimal alteration of environmental data causes a drastic variation in the spread of the virus and its ability of contagion, which is quantified by the best fit parameters of the exponential eq (1). On the one hand, owing to the exponential form, even small changes of these parameters affect strongly the dependence of ni upon time. On the other hand, just this mathematical feature describing the contagion suggests that even elementary protection measures preventively actuated, e.g. safety distance between individuals or use of protection masks and so on, are effective in controlling the subsequent epidemic data. This conclusion appears evident in the Figure 6: even small changes of C, purposely introduced in the eq (6) to simulate a typical environmental effect such as herd protection, are enough to modify significantly the long-term evolution of the infection in progress. Also, the fig 2 evidences an anomalous dispersion of data just after the peak, which implies an initial decreasing rate of ni slower than predicted by the exponential of (1).
This is not surprising. It is known in general that when a strong effect governs predominantly the kinetic of any process, an external perturbation simply adds a minor spread of data around the leading trend obtainable via statistical considerations only. Instead an equilibrium state in nature is by definition the resultant of many counterbalancing effects, thus particularly sensitive to any external perturbation; specifically, appear around the peak time the possible consequences due to the previous health conditions of the patients and their lifestyle.
Another problem is that the strict measures of restrain initially put in place by the Government authorities have been partially loosened, probably misinterpreting too optimistically the stage after the peak of infected patients as a symptom of reduced danger of contagion. After this short stage, however, the further ni data have resumed the behaviour predicted by the eq (1).
It is commonly believed that the turning point of the infection advancement is identifiable with the peak time tmax of ni ; undoubtedly this is correct as concerns the effectiveness of the measures to prevent new contagions, but it does not highlight adequately the current state of the epidemic evolution.
Actually, the analysis so far carried out suggests the significance of the relationship between the incoming number ni of hospitalized patients and outcoming numbers nh and nd representing either possible evolution of ni after a successful or vain therapy cycle. Yet the experimental data of Figure 1 show the existence of a time such that ni−nh<0 , which also implies ni−(nh+nd)<0 ; the time evolution of ni and nh observed at increasing times suggests therefore the relevance of the number
N=ni−(nh+nd)
that involves explicitly the effectiveness of the therapies on nh patients. Clearly N>0 implies overcrowding of hospitalized patients exceeding the chance of their efficient management; N<0 implies the decreasing rate of new infected patients with respect to their conversion rate to either nh or nd . If however the growing rate of ni decreases until becoming null at the peak time, while in the meantime the sum nh+nd continues to increase, then one expects first N=0 at a given t* and next even N<0 at t>t* . Hence, assuming that ni still evolves at t>tmax into nh(t) and nd(t) , it happens that: (i) ni tends to zero while contributing to the growth of nh , (ii) nd tends towards its final value of total deaths, (iii) N becomes more and more negative.
The (Figure 8) confirms these ideas: it appears that t*≈65 days after which N<0 indicates a number of new infected ni<nh+nd progressively decreasing.
On the one hand it is not surprising that the time t* at which N=0 does not necessarily coincide with tmax≈57 days calculated considering ni only. On the other hand it is clear why N and not ni is the crucial indicator of the turning-point between the aggressive stage of the virus and its progressive attenuation; when comparing ni and nh+nd , the sign of N depends upon the relative rates at which the addends prevail each other.
All of this is explained in the Figure 9, showing that in effect N consists of three statistical distributions that end down with negative values; the three contributions of representative best fit functions resolved by the peak analysis are reasonably related to the three respective basic stages of contagion with . The peak splitting equation has the analytical form
A(t−t1)4exp(−a(t−34)4)+B(t−t2)6exp(−b(t−42)2)+C(t−t3)4exp(−c(t−45)2) (7)
The statistical investigation provides reliable and significant indications on the epidemiological evolution of COVID-19, even extensible to extrapolated times. Although further data are still expected in the next days beyond the considered interval, the most critical aspects of the infection seem adequately represented even at the current end time.
The present information methodology has general character; it is valid and applicable without need of specific hypotheses besides the basic assumptions of any statistical model.
-The results show that the evolution of virus infection could have been predictable with reasonable approximation since the early days, which means good chance of organizing care structures and medical materials (e.g. face masks) useful in the most acute stage of the infection.
-The plots allow estimating quantitatively the efficacy of therapeutic protocols through the condition ni−nh−nd=0 , which holds twice: at the peak time of ni and before the onset of negative values ni−nh−nd<0 that defines the final stage of predominant contagion.
The approach is useful to forecast and plan future containment and coordination activities to support sufferance of sick.
None.
The author declares that there is no conflict of interest.
©2022 Tosto. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.