Loading [MathJax]/jax/output/CommonHTML/jax.js
Submit manuscript...
eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Research Article Volume 12 Issue 4

The Pezeta regression model: an alternative to unit Lindley regression model

Lucas D Ribeiro Reis

Department of Economics, Federal University of Alagoas, Brazil

Correspondence: Lucas D Ribeiro Reis, Department of Economics, Federal University of Alagoas, Brazil

Received: July 18, 2023 | Published: August 4, 2023

Citation: Reis LDR. The Pezeta regression model: an alternative to unit Lindley regression model. Biom Biostat Int J. 2023;12(4):107-112. DOI: 10.15406/bbij.2023.12.00393

Download PDF

Abstract

A new probability distribution is proposed in this paper. This new distribution has support on the interval (0,1)  and was obtained after transforming the random variable with exponential distribution. The mode, quantile function, median, ordinary moments and density function belongs to exponential family of distributions are demonstrated. The maximum likelihood method is used to obtain the parameter estimate. A regression model for the median of the distribution is also proposed. Closed-form expressions for the score vector and Fisher’s information matrix are demonstrated. A simulation study and an application to real data showed the good performance of the proposed regression model.

Keywords: unit interval, exponential family, exponential distribution, mode, ordinary moments, regression model

Introduction

The probability density function (pdf) of a random variable W with exponential distribution is given by

r(w;λ)=λeλw,w>0,

where λ>0  is scale parameter.

Taking Y=1/(1+W) , the cdf and pdf of Y are

F(y;λ)=exp[λ(1y1)],0<y<1  and

f(y;λ)=λy2exp[λ(1y1)],0<y<1,   1

respectively.

Here, we will call the random variable with pdf (1) of Pezeta distribution, and denote this random variable as Ypezeta(λ) . The Figure 1 shows some forms of the density function (1) for selected values of λ . This figure reveals that the peseta distribution is unimodal, and may also present positive (when λ approaches 0 ) and negative (whenmoves away from ) asymmetry.

Figure 1 Some forms of the pdf (1), for special cases.

The first derivative of the log-pdf is

ζ(y)=ddyln(f(y;λ))=2y+λy2.  

Solving ζ(y)=0 , the mode of Y  is

mode(Y)={λ2,λ<2,1,λ2.

The r th ordinary moment of Y is

E(Yr)=10λyr2exp[λ(1y1)]dy=λeλEr(λ),

where En(x)=1znexzdz denotes the exponential integral function.1

By inverting F(y;λ)=p , the quantile function is given by

Q(p;λ)=[1λ1ln(p)]1,0<p<1.

The median is obtained when p=0.5 . So, the median of Y  is

median(Y)=[1λ1ln(0.5)]1.

Using the quantile function, the random variable

Y=[1λ1ln(V)]1 has density function (1), where V  is a uniform random variable over the interval (0,1) .

The paper is structured as follows. In Section 4, it is shown that the distribution belongs to the exponential distribution family. The mean and variance of the sufficient statistic are also presented. In Section 5, the maximum likelihood method to obtain the parameter estimate is presented. Analytical expressions for the bias correction of the maximum likelihood estimator are also presented. In Section 6, a new regression model is introduced. In Sections 7 and 8, numerical and empirical results are presented, respectively. Finally, Section 9 concludes the paper.

Exponential family

Let the random variable Y with pdf f(y;θ) , in which θ is the parameter that indexes the distribution. This random variable belongs to the exponential family if its pdf can be written as

f(y;θ)=h(y)exp[η(θ)t(y)b(θ)],   2

where the functions η(θ) , b(θ) , t(y) and h(y) assume values in subsets of the reals.

Note that, the pdf (1) can be written as

f(y;λ)=1y2exp[λ(1y1)(ln(λ))].  

that belongs to exponential family (2), where η(λ)=λ , t(y)=1y1 , b(λ)=ln(λ) and h(y)=1/y2 . Thus, by the factorization criterion t(y) is sufficient statistics for λ . The fact that Y belongs to exponential family, the mean and variance of t(Y) are given by

E[t(Y)]=1λ  and V[t(Y)]=1λ2,

respectively.

Maximum likelihood estimation

For a random sample of size n of the random variable Y with density function (1), the log-likelihood function for λ is given by

0(λ)=nln(λ)+λni=1(1y1i)2ni=1ln(yi).

The maximum likelihood estimator (MLE) of λ is the solution of

0(λ)λ=nλ+ni=1(1y1i)=0.

So, the MLE of λ is

ˆλ=nni=1(1y1i) .

The second derivative of 0(λ)  is given as

20(λ)λ2=nλ2<0,

showing that ˆλ  really is the point that a maximizes the function 0(λ) . It can be further shown that the variance and standard error of ˆλ  are expressed as V(ˆλ)=ˆλ2/n and se(ˆλ)=ˆλ/n , respectively.

MLE bias correction

Generally, when n is small, the MLEs tends to be biased. Here, a bias correction of the MLE of the parameter that indexes the Pezeta distribution will be presented. Here, the bias of ˆλ  can be expressed2 as

B(ˆλ)=V(λ)2(12κλλλ+κλλ,λ),

where κλλλ=E[d30(λ)dλ3]  and κλλ,λ=E[d20(λ)dλ2d0(λ)dλ] .

Note that κλλλ=2n/λ3  and

20(λ)λ20(λ)λ=n2λ3nλ2ni=1(1y1i).

From Section 4, follows that

ni=1E[(1y1i)]=ni=1E[t(yi)]=nλ,

resulting in

κλλ,λ=n2λ3nλ2E[ni=1(1y1i)]=n2λ3nλ2ni=1E[(1y1i)]=n2λ3+n2λ3=0 .

Thus, the bias of ˆλ  is

B(ˆλ)=(ˆλ2n)2(2n2ˆλ3)=λn.

Finally, it follows that the bias-corrected MLE ˆλ  is given by

ˆλBC=ˆλB(ˆλ)=ˆλ(11n)  .

The Pezeta regression model

Starting from the Pezeta distribution, in this section a new regression model will be introduced for the dependent variable with support at (0,1). This model has a regression structure on the median of the distribution. Thus, in the presence of outliers in the data, this new regression model has an advantage over regression models with a mean structure.

By taking median(Y)=τ and isolating for λ , results in

λ=ln(0.5)1τ1.

Under this parameterization, the density function (1) becomes

f(y;τ)=ln(0.5)y2(1τ1)exp[ln(0.5)1τ1(1y1)],0<y<1,   3

and the corresponding cdf and quantile function are given by

F(y;τ)=exp[ln(0.5)1τ1(1y1)],0<y<1   4

and

Q(p;τ)=[11τ1ln(0.5)ln(p)]1,0<p<1,

respectively, where 0<τ<1  denotes the median of Y.

The random variable Y with pdf (3) is denoted as Ypezeta(τ) . Some plots of the pdf (3) are shown in Figure 2. These plots reveal that the pdf can be asymmetric to the left and asymmetric to the right.

Figure 2 Some forms of the pdf (3), for special cases.

Here, the regression model for the median has the following regression structure

ηi=g(τi)=XTiβ .

where β=(β1,,βk) is k -vector of unknown parameters, xi=(xi1,,xik) is vector of k explanatory variables (k<n) , which are assumed fixed and known and ηi  is the linear predictor. For model with intercept, it is assumed that xi1=1, i . The g() is a link function strictly monotonic and twice differentiable, such that g:(0,1) . Examples of some link functions can be: (i) standard logistic quantile function g(τ)=ln[τ/(1τ)] ; and (ii) standard Cauchy quantile function g(τ)=tan(π(τ0.5)) .

From Equation (3) the log-likelihood function for a random sample of size n  is given by

 (β)=ni=1i(τi),  

where

i(τi)=ln(ln(0.5)1τ1i)+ln(0.5)1τ1i(1y1i)2ln(yi).

Differentiating i(τi)  with respect to τi 

i(τi)τi=1τiτ2iln(0.5)τ2i(1τ1i)2(1y1i)
=ai(1+˙yi),   5

 where ai=1/(τiτ2i)  and ˙yi=ln(0.5)(1y1i)/(1τ1i) .5 Since that E[(τi)/τi]=0 , then E[˙yi]=1 , i .

The differential total of (β) is given by

(β)βj=ni=1i(τi)τidτidηiηiβj   .6

Note that, dτi/dηi=1/g'(τi) and ηi/βj=xij , then the score vector of βj  is given by

(β)βj=ni=1ai(1+˙yi)g(τi)xij.

The score vector in matrix form is U(β)=XGv , where X is a n×k matrix whose i th row is xi,G=diag{1/g(τ1),1/g(τn)}   (diagonal matrix) and v=(a1(1+˙y1),,an(1+˙yn)) .

The MLE of β , say ˆβ , is the solution of U(β)=0 . There is no analytical solution for this nonlinear system, and so the MLE of β  must be obtained numerically, from iterative methods. However, these iterative methods require initial guesses for parameter values. As in Ribeiro-Reis,3 the initial guess for ˆβ  will be the ordinary least squares estimator of the regression g(y) on X , which is ˆβ(0)=(XX)1Xg(y) .

From Equation (6), the second derivative of (β) with respect to βl  is

2(β)βjβl=ni=1τi(i(τi)τi1g(τi)xij)dτidηiηiβl
=ni=1[2i(τi)τ2i1g(τi)xij+i(τi)τiτi(1g(τi))xij]1g(τi)xil.  

Once that E[i(τi)/τi]=0 , then

E[2(β)βjβl]=ni=1E[2i(τi)τ2i]1g'(τi)2xijxil.  

From Equation (5), follows that

2i(τi)τ2i=a'i(1+˙yi)+ai˙y'i,

where a'i=ai/τi  and ˙y'i=˙yi/τi .

Since that E[˙yi]=1 , then the expected value is

E[2i(τi)τ2i]=a'i(1+E[˙yi])+aiE[˙y'i]=aiE[˙y'i].

We still have to
˙y'i=ln(0.5)τ2i(1τ1i)2(1y1i)=ai˙yi,

resulting in E[˙y'i]=aiE[˙yi]=ai and hence E[2i(τi)τ2i]=a2i .

Finally,

E[2(β)βjβl]=ni=1a2ig'(τi)2xijxil.

Let P=diag{a21,,a2n} , the expression in matrix form is

E[2(β)βjβl]=XPG2X.

So, the Fisher expected information matrix is

K(β)=XPG2X.

Under the usual regularity conditions for MLEs, when the sample size is large,

βaNk(β,K(β)1),

where a  denotes asymptotic distribution. So, confidence intervals and hypothesis testing can be performed using the normal distribution. Based on asymptotic distribution, the 100(1α)%  confidence intervals for βj  is given by ˆβj±z(1α/2)θjj,j=1,k,

where z(1α/2)  is the (1α/2)  quantile of the standard normal distribution and θjj  denotes the j th diagonal element of the matrix K(β)1 .

Residuals

Residual analysis is a good indicator to tell if an estimated model is well-adjusted.3 If the residuals do not show an adequate behavior, then the estimated model is poor. Here, the Dunn-Smyth4 residuals will be addressed. The Dunn-Smyth residuals are defined as

ˆri=QN(F(yi;ˆτi)),

in which QN()  denotes the quantile function of the standard normal distribution and F(yi;ˆτi) is the cdf (4) evaluated in ˆτ . If the model is well estimated, then the Dunn-Smyth residuals are expected to have a random behavior around zero, with approximately 95% of the values falling within the range (2,2)  .5,6

Simulation

To show the performance of the MLEs for the proposed regression model, a numerical study using Monte Carlo simulations, with 10000 repetitions, is performed. The simulated regression model is given by

ln(τi1τi)=β1+β2xi2+β3xi3+β4xi3,

in which all explanatory variables x ’s are generated from the standard normal distribution. Three sample sizes n={50,100,300} are considered, with the true values of the parameters being:β1=1.7, β2=2.4, β3=0.9  and β4=4.2

The performance measures analyzed in the simulations will be based on the average estimates (AEs), mean squared errors (MSEs) and the 95% coverage rates (CRs) for the parameters. The simulations were done in the matrix programming language Ox Console.7

The simulation results are shown in Table 1. As can be seen, as the sample size increases, the MLEs and CRs converge to their true values, and the MSEs decrease. Thus, we can see the good performance of the estimates for the regression model introduced here.

n  

Parameter

AE

MSE

CR (95%)

50

β1  

1.740808

0.022806

93.79

  β2  

2.401397

0.039183

93.51

  β3  

0.900474

0.018413

93.35

  β4  

4.19932

0.041534

94.14

150

β1  

1.713127

0.006826

94.91

  β2  

2.402275

0.008375

94.59

  β3  

0.900578

0.005515

94.65

  β4  

4.20142

0.007643

94.55

300

β1  

1.707051

0.003405

94.97

  β2  

2.400968

0.003944

94.90

  β3  

0.90126

0.003165

94.93

  β4  

4.200144

0.00353

94.64

Table 1 Simulations results

Application

The Pezeta regression model is compared with the unit-Lindley (UL) regression model, which was introduced by Mazucheli et al.8 The density function of the UL model is given by

fUL(y;τ)=(1τ)2τ(1y)3exp{y(1τ)τ(1y)},0<y<1 , where 0<τ<1  denotes the mean of the distribution.

The data used here were analyzed by Smithson & Verkuilen.9 The response variable (y)  is the accuracy that presents scores on a test of reading accuracy taken by 44 children in Australian. The explanatory variables are dyslexia (x2)  and nonverbal intelligence quotient x3. The variable x2  is a categorical variable that takes value 1  if the child has dyslexia and value 0  if the child does not have dyslexia. The variable x3  is converted into z  scores. These data are available in the betareg package.10

The fitted model is given by

ln(τi1τi)=β1+β2xi2+β3xi3+β4(xi2×xi3),i=1,2,,44,

where τi  refers to the median for the Pezeta regression model and to the mean for the UL regression model.

To discriminate between the two regression models, the usual statistics were used: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and Hannan–Quinn Information Criterion (HQIC). The model that presents the smallest values of these statistics is chosen as a superior model for the data in question. The formulas for the AIC, BIC and HQIC statistics can be consulted at Ribeiro-Reis.6

All calculations in this application were made using the language Ox Console.7 The results of the estimates for the Pezeta and UL regression models are shown in Table 2. Note that the two models share the same sign for the parameter estimates. It is also noticed that all estimates of the coefficients for the Pezeta regression model are highly significant. In turn, in the UL regression model the β4  estimate was not statistically significant.

Parameter

Estimate

Std error

z  -value

p  -value

   

Pezeta

   
β1  

1.98376

0.235234

8.433134

0.000000

β2  

1.248062

0.376479

3.315087

0.000916

β3  

1.204827

0.249365

4.831581

0.000001

 β4  

1.256459

0.375852

3.34296

0.000829

   

 UL

   
β1  

3.18122

0.166414

19.11631

0.000000

β2  

3.169079

0.27772

11.41107

0.000000

β3  

0.293898

0.176392

1.666164

0.095681

β4  

0.358143

0.275959

1.297811

0.194352

Table 2 Summary estimates for Pezeta and UL regression models

The statistics for the choices of the two models are in Table 3. It is noted that all three statistics have their lowest values for the Pezeta regression model, indicating that this model is more appropriate for the data in question.

Model

      AIC

     BIC

    HQIC

Pezeta

80.1893

73.0526

77.5427

UL

76.5169

69.3802

  73.8703

Table 3 Information criteria

The Dunn-Smyth residuals, with their respective simulated envelopes, for the Pezeta and UL regression models are shown in Figures 3 & 4, respectively. It is verified that the residuals for the Pezeta model presents a more random behavior around zero, than the UL model. The simulated envelope corroborates this, since in the Pezeta model there are only  of the observations outside the simulated envelope. In contrast, in the UL model, the number of observations outside the simulated envelope is 72.73%, indicating the poor fit of the UL model.

Figure 3 Dunn-Smyth residuals for Pezeta regression model.
(a) residuals versus index
(b) simulated envelope

Figure 4 Dunn-Smyth residuals for unit Lindley regression model.
(a) residuals versus index
(b) simulated envelope

Conclusions

In this paper, a new probability distribution with support on the interval (0,1)  was proposed. This new distribution is obtained through a transformation of the random variable with exponential distribution. Several properties were discussed, such as mode, ordinary moments, quantile function, random number generation, exponential family and maximum likelihood estimation (with and without bias correction).

Subsequently, a regression model for the dependent variable in the unit interval was introduced. The regression is structured on the median of the distribution, which means that, in the presence of outliers in the data, the proposed regression model is more robust than the regression models with structure on the mean. The maximum likelihood method is considered for parameter estimation. Analytical expressions are obtained for the score vector and for the Fisher information matrix. Fisher’s information matrix is very important to obtain the standard errors of the estimated coefficients.

A simulation study on finite samples showed that the maximum likelihood estimators are consistent, indicating that as the sample size increases, the estimators converge to their true parameters. An application to real data is also made, to show the usefulness of the model in practice. The proposed regression model is compared with the unit Lindley regression model. The results showed that the regression model proposed in this paper is superior to the unit Lindley regression model.

Suggestions for future research can be: (i) bias correction for the estimated coefficients of the regression model; (ii) introduce the version of the regression model for time series.

Acknowledgments

None.

Conflicts of interest

The author declare that there is no conflicts of interest.

Funding

None.

References

Creative Commons Attribution License

©2023 Reis. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.