Research Article Volume 12 Issue 4
The Pezeta regression model: an alternative to unit Lindley regression model
Lucas D Ribeiro Reis
Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.
Department of Economics, Federal University of Alagoas, Brazil
Correspondence: Lucas D Ribeiro Reis, Department of Economics, Federal University of Alagoas, Brazil
Received: July 18, 2023 | Published: August 4, 2023
Citation: Reis LDR. The Pezeta regression model: an alternative to unit Lindley regression model. Biom Biostat Int J. 2023;12(4):107-112. DOI: 10.15406/bbij.2023.12.00393
Download PDF
Abstract
A new probability distribution is proposed in this paper. This new distribution has support on the interval
and was obtained after transforming the random variable with exponential distribution. The mode, quantile function, median, ordinary moments and density function belongs to exponential family of distributions are demonstrated. The maximum likelihood method is used to obtain the parameter estimate. A regression model for the median of the distribution is also proposed. Closed-form expressions for the score vector and Fisher’s information matrix are demonstrated. A simulation study and an application to real data showed the good performance of the proposed regression model.
Keywords: unit interval, exponential family, exponential distribution, mode, ordinary moments, regression model
Introduction
The probability density function (pdf) of a random variable
with exponential distribution is given by
where
is scale parameter.
Taking
, the cdf and pdf of
are
and
1
respectively.
Here, we will call the random variable with pdf (1) of Pezeta distribution, and denote this random variable as
. The Figure 1 shows some forms of the density function (1) for selected values of
. This figure reveals that the peseta distribution is unimodal, and may also present positive (when
approaches
) and negative (whenmoves away from ) asymmetry.
Figure 1 Some forms of the pdf (1), for special cases.
The first derivative of the log-pdf is
Solving
, the mode of
is
The
th ordinary moment of
is
where
denotes the exponential integral function.1
By inverting
, the quantile function is given by
The median is obtained when
. So, the median of
is
Using the quantile function, the random variable
has density function (1), where
is a uniform random variable over the interval
.
The paper is structured as follows. In Section 4, it is shown that the distribution belongs to the exponential distribution family. The mean and variance of the sufficient statistic are also presented. In Section 5, the maximum likelihood method to obtain the parameter estimate is presented. Analytical expressions for the bias correction of the maximum likelihood estimator are also presented. In Section 6, a new regression model is introduced. In Sections 7 and 8, numerical and empirical results are presented, respectively. Finally, Section 9 concludes the paper.
Exponential family
Let the random variable
with pdf
, in which
is the parameter that indexes the distribution. This random variable belongs to the exponential family if its pdf can be written as
2
where the functions
,
,
and
assume values in subsets of the reals.
Note that, the pdf (1) can be written as
that belongs to exponential family (2), where
,
,
and
. Thus, by the factorization criterion
is sufficient statistics for
. The fact that
belongs to exponential family, the mean and variance of
are given by
and
respectively.
Maximum likelihood estimation
For a random sample of size
of the random variable
with density function (1), the log-likelihood function for
is given by
The maximum likelihood estimator (MLE) of
is the solution of
So, the MLE of
is
.
The second derivative of
is given as
showing that
really is the point that a maximizes the function
. It can be further shown that the variance and standard error of
are expressed as
and
, respectively.
MLE bias correction
Generally, when
is small, the MLEs tends to be biased. Here, a bias correction of the MLE of the parameter that indexes the Pezeta distribution will be presented. Here, the bias of
can be expressed2 as
where
and
.
Note that
and
From Section 4, follows that
resulting in
.
Thus, the bias of
is
Finally, it follows that the bias-corrected MLE
is given by
.
The Pezeta regression model
Starting from the Pezeta distribution, in this section a new regression model will be introduced for the dependent variable with support at (0,1). This model has a regression structure on the median of the distribution. Thus, in the presence of outliers in the data, this new regression model has an advantage over regression models with a mean structure.
By taking
and isolating for
, results in
Under this parameterization, the density function (1) becomes
3
and the corresponding cdf and quantile function are given by
4
and
respectively, where
denotes the median of
The random variable
with pdf (3) is denoted as
. Some plots of the pdf (3) are shown in Figure 2. These plots reveal that the pdf can be asymmetric to the left and asymmetric to the right.
Figure 2 Some forms of the pdf (3), for special cases.
Here, the regression model for the median has the following regression structure
.
where
is
-vector of unknown parameters,
is vector of
explanatory variables
, which are assumed fixed and known and
is the linear predictor. For model with intercept, it is assumed that
. The
is a link function strictly monotonic and twice differentiable, such that
. Examples of some link functions can be: (i) standard logistic quantile function
; and (ii) standard Cauchy quantile function
.
From Equation (3) the log-likelihood function for a random sample of size
is given by
where
Differentiating
with respect to
5
where
and
.5 Since that
, then
,
.
The differential total of
is given by
.6
Note that,
and
, then the score vector of
is given by
The score vector in matrix form is
, where
is a
matrix whose
th row is
(diagonal matrix) and
.
The MLE of
, say
, is the solution of
. There is no analytical solution for this nonlinear system, and so the MLE of
must be obtained numerically, from iterative methods. However, these iterative methods require initial guesses for parameter values. As in Ribeiro-Reis,3 the initial guess for
will be the ordinary least squares estimator of the regression
on
, which is
.
From Equation (6), the second derivative of
with respect to
is
Once that
, then
From Equation (5), follows that
where
and
.
Since that
, then the expected value is
We still have to
resulting in
and hence
.
Finally,
Let
, the expression in matrix form is
So, the Fisher expected information matrix is
Under the usual regularity conditions for MLEs, when the sample size is large,
where
denotes asymptotic distribution. So, confidence intervals and hypothesis testing can be performed using the normal distribution. Based on asymptotic distribution, the
confidence intervals for
is given by
where
is the
quantile of the standard normal distribution and
denotes the
th diagonal element of the matrix
.
Residuals
Residual analysis is a good indicator to tell if an estimated model is well-adjusted.3 If the residuals do not show an adequate behavior, then the estimated model is poor. Here, the Dunn-Smyth4 residuals will be addressed. The Dunn-Smyth residuals are defined as
in which
denotes the quantile function of the standard normal distribution and
is the cdf (4) evaluated in
. If the model is well estimated, then the Dunn-Smyth residuals are expected to have a random behavior around zero, with approximately 95% of the values falling within the range
.5,6
Simulation
To show the performance of the MLEs for the proposed regression model, a numerical study using Monte Carlo simulations, with 10000 repetitions, is performed. The simulated regression model is given by
in which all explanatory variables
’s are generated from the standard normal distribution. Three sample sizes
are considered, with the true values of the parameters being:
and
The performance measures analyzed in the simulations will be based on the average estimates (AEs), mean squared errors (MSEs) and the 95% coverage rates (CRs) for the parameters. The simulations were done in the matrix programming language Ox Console.7
The simulation results are shown in Table 1. As can be seen, as the sample size increases, the MLEs and CRs converge to their true values, and the MSEs decrease. Thus, we can see the good performance of the estimates for the regression model introduced here.
|
Parameter
|
AE
|
MSE
|
CR (95%)
|
50
|
|
1.740808
|
0.022806
|
93.79
|
|
|
2.401397
|
0.039183
|
93.51
|
|
|
0.900474
|
0.018413
|
93.35
|
|
|
4.19932
|
0.041534
|
94.14
|
150
|
|
1.713127
|
0.006826
|
94.91
|
|
|
2.402275
|
0.008375
|
94.59
|
|
|
0.900578
|
0.005515
|
94.65
|
|
|
4.20142
|
0.007643
|
94.55
|
300
|
|
1.707051
|
0.003405
|
94.97
|
|
|
2.400968
|
0.003944
|
94.90
|
|
|
0.90126
|
0.003165
|
94.93
|
|
|
4.200144
|
0.00353
|
94.64
|
Table 1 Simulations results
Application
The Pezeta regression model is compared with the unit-Lindley (UL) regression model, which was introduced by Mazucheli et al.8 The density function of the UL model is given by
, where
denotes the mean of the distribution.
The data used here were analyzed by Smithson & Verkuilen.9 The response variable
is the accuracy that presents scores on a test of reading accuracy taken by 44 children in Australian. The explanatory variables are dyslexia
and nonverbal intelligence quotient
The variable
is a categorical variable that takes value 1 if the child has dyslexia and value 0 if the child does not have dyslexia. The variable
is converted into
scores. These data are available in the betareg package.10
The fitted model is given by
where
refers to the median for the Pezeta regression model and to the mean for the UL regression model.
To discriminate between the two regression models, the usual statistics were used: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC) and Hannan–Quinn Information Criterion (HQIC). The model that presents the smallest values of these statistics is chosen as a superior model for the data in question. The formulas for the AIC, BIC and HQIC statistics can be consulted at Ribeiro-Reis.6
All calculations in this application were made using the language Ox Console.7 The results of the estimates for the Pezeta and UL regression models are shown in Table 2. Note that the two models share the same sign for the parameter estimates. It is also noticed that all estimates of the coefficients for the Pezeta regression model are highly significant. In turn, in the UL regression model the
estimate was not statistically significant.
Parameter
|
Estimate
|
Std error
|
-value
|
-value
|
|
|
Pezeta
|
|
|
|
1.98376
|
0.235234
|
8.433134
|
0.000000
|
|
1.248062
|
0.376479
|
3.315087
|
0.000916
|
|
1.204827
|
0.249365
|
4.831581
|
0.000001
|
|
1.256459
|
0.375852
|
3.34296
|
0.000829
|
|
|
UL
|
|
|
|
3.18122
|
0.166414
|
19.11631
|
0.000000
|
|
3.169079
|
0.27772
|
11.41107
|
0.000000
|
|
0.293898
|
0.176392
|
1.666164
|
0.095681
|
|
0.358143
|
0.275959
|
1.297811
|
0.194352
|
Table 2 Summary estimates for Pezeta and UL regression models
The statistics for the choices of the two models are in Table 3. It is noted that all three statistics have their lowest values for the Pezeta regression model, indicating that this model is more appropriate for the data in question.
Model
|
AIC
|
BIC
|
HQIC
|
Pezeta
|
80.1893
|
73.0526
|
77.5427
|
UL
|
76.5169
|
69.3802
|
73.8703
|
Table 3 Information criteria
The Dunn-Smyth residuals, with their respective simulated envelopes, for the Pezeta and UL regression models are shown in Figures 3 & 4, respectively. It is verified that the residuals for the Pezeta model presents a more random behavior around zero, than the UL model. The simulated envelope corroborates this, since in the Pezeta model there are only of the observations outside the simulated envelope. In contrast, in the UL model, the number of observations outside the simulated envelope is 72.73%, indicating the poor fit of the UL model.
Figure 3 Dunn-Smyth residuals for Pezeta regression model.
(a) residuals versus index
(b) simulated envelope
Figure 4 Dunn-Smyth residuals for unit Lindley regression model.
(a) residuals versus index
(b) simulated envelope
Conclusions
In this paper, a new probability distribution with support on the interval
was proposed. This new distribution is obtained through a transformation of the random variable with exponential distribution. Several properties were discussed, such as mode, ordinary moments, quantile function, random number generation, exponential family and maximum likelihood estimation (with and without bias correction).
Subsequently, a regression model for the dependent variable in the unit interval was introduced. The regression is structured on the median of the distribution, which means that, in the presence of outliers in the data, the proposed regression model is more robust than the regression models with structure on the mean. The maximum likelihood method is considered for parameter estimation. Analytical expressions are obtained for the score vector and for the Fisher information matrix. Fisher’s information matrix is very important to obtain the standard errors of the estimated coefficients.
A simulation study on finite samples showed that the maximum likelihood estimators are consistent, indicating that as the sample size increases, the estimators converge to their true parameters. An application to real data is also made, to show the usefulness of the model in practice. The proposed regression model is compared with the unit Lindley regression model. The results showed that the regression model proposed in this paper is superior to the unit Lindley regression model.
Suggestions for future research can be: (i) bias correction for the estimated coefficients of the regression model; (ii) introduce the version of the regression model for time series.
Acknowledgments
Conflicts of interest
The author declare that there is no conflicts of interest.
Funding
References
©2023 Reis. This is an open access article distributed under the terms of the,
which
permits unrestricted use, distribution, and build upon your work non-commercially.