Research Article Volume 6 Issue 1
Department of Biostatistics Program, LSUHSC School of Public Health, USA
Correspondence: Evrim Oral, Biostatistics Program, LSUHSC School of Public Health, 2020 Gravier Street, 3rd floor New Orleans, LA 70112, USA, Tel (504) 568-6094
Received: April 25, 2017 | Published: May 25, 2017
Citation: Oral E. Modified maximum likelihood estimation in poisson regression. Biom Biostat Int J. 2017;6(1):249-254. DOI: 10.15406/bbij.2017.06.00154
In Generalized Linear Models, likelihood equations are intractable and do not have explicit solutions; thus, they must be solved by using Newton-type algorithms. Solving these equations by iterations, however, can be problematic: the iterations might converge to wrong values or the iterations might not converge at all. In this study, we derive the modified maximum likelihood estimators for Poisson regression model and study their properties. We also search the robustness of these estimators when there are outliers in the covariates.
Keywords: count data, poisson regression, modified maximum likelihood, newton-type algorithms, dixon’s outlier model
Poisson regression is widely used for modeling count data, especially when there is no over- or under- dispersion.1 Since the likelihood equations from this model are intractable, solving these equations requires using iterative methods, such as Newton Raphson or Fisher scoring. However, using iterative methods to find maximum likelihood estimators (MLEs) can generally be problematic and time should be spent to investigate the stability of such solutions.2-4 Specifically the following difficulties can arise: the iterations might converge to wrong values if the likelihood equations have multiple roots, or the iterations might not converge at all. See5 and6 for a discussion about situations where one encounters these difficulties in solving MLEs. The common software, Stata for example, is known to be very sensitive to numerical iterations. Researchers have reported problems in getting Poisson regression estimates with the “poisson” command, which encounters problems in locating the maximum and does not converge.7 Note that in Poisson regression modeling, additional problems might occur. The most common problem is the over- or under- dispersion in data, in which case using a more flexible model such as negative binomial regression is more appropriate. The second problem is analogous to the complete separation or quasi-complete separation problem in binary regression: the MLEs may not exist for certain data configurations, see.7,8 In this study we do not consider either of these problems and focus only on the case where it is appropriate to model the data with Poisson regression.
Unlike the maximum likelihood (ML) technique, modified maximum likelihood (MML) methodology produces explicit estimates. MML achieves explicit estimates by linearizing intractable functions within the likelihood equations using the ordered statistics.9 Asymptotically, MML estimators (MMLEs) are known to be unbiased and have minimum variances, i.e. they are fully efficient. For small sample sizes MMLEs have negligible bias and their variances are only marginally bigger than the minimum variance bounds, i.e. they are highly efficient.10-15
In the GLM setting, Tiku and Vaughan3 used the MML methodology to extend the techniques of traditional logistic regression to non-logistic density functions. Oral and Gunay16 and Oral17 later extended the work in3 to the binary regression model with one stochastic covariate. Oral18 derived the MMLEs in general GLMs which use canonical link when there is only one risk factor. In this study, we derive the explicit MMLEs of Poisson regression model, generalize the derivations to more than one covariate, and study their robustness properties via simulations.
The Univariate Poisson regression model is given by
E(Yi|Xi=xi)=μiE(Yi|Xi=xi)=μi
=exp(zi) (1)
where zi=α+βxi , for 1≤i≤n , and the outcome Y has the probability distribution
fy(yi)=exp(−μi) μyiyi!, yi=0, 1, 2, ... (2)
Note that in equation (2), Y is presumed to increase with X so that b is a priori greater than zero. For the model given in (1)-(2), the log-likelihood function of the random sample (yi , xi) can be written as
lnL∝n∑i=1yizi−n∑i=1g(zi) , (3)
where g(zi)=exp(zi) . The likelihood equations for estimating a and b do not have explicit solutions because of the nonlinear function g(zi) , 1≤i≤n . To obtain the MMLEs, we first express the likelihood equations in terms of the ordered variates z(1)≤z(2)≤...≤z(n) . The likelihood equations can be re-written as
∂lnL∂α=n∑i=1{y[i]−g(z(i))}=0 (4)
and
∂lnL∂β=n∑i=1x(i){y[i]−g(z(i))}=0 (5)
Where y[i] is the concomitant of x(i) (1≤i≤n) and g(z(i))=exp(z(i)) . Linearizing the intractable function g(zi) by using the first two terms of its Taylor series expansion around the population quantiles t(i)=E(z(i)) ( 1≤i≤n ), we find
g (z(i))≅ai+bi z(i) , (6)
where ai=exp(t(i)){ 1−t(i)} and bi=exp(t(i)) for 1≤i≤n . In order to calculate t(i)=E(z(i)) values, we define a dummy random variable U with the probability density function
f(u)=exp(u), u<0 (7)
and re-write the model (1) as
E(Yi|Xi=xi)=μi=F(zi) , (8)
where F(u)=exp(u) , u<0 . Thus, the t(i) values can be obtained from the equation
t(i)=ln(i/(n+1)) , (9)
For 1≤i≤n ; asymptotically z(i)−t(i)≅0. Alternatively, when n is large one can utilize the standard normal distribution
t(i)=Φ−1(i/(n+1)) . (10)
Incorporating (6) into (4)-(5) and solving the resulting modified likelihood equations yield the explicit MMLEs below:
ˆα=δm−ˆβ ˉxa and ˆβ=n∑i=1δi(x(i)−ˉxa)n∑i=1bi(x(i)−ˉxa)2 , (10)
where
δ=n∑i=1δi , δi=y[i]−ai , m=n∑i=1bi , (11)
and
ˉxa=(n∑i=1bxi(i))/m . (12)
The MMLEs derived above are asymptotically equivalent to their corresponding MLEs, giving them the same attractive asymptotic properties; however, one can refine the estimates by re-calculating ai and bi values by replacing the theoretical population quantiles with their estimated values t(i)=ˆα+ˆβ x(i) ( 1≤i≤n) . This process might be repeated until a desired convergence is met. The stabilization generally is reached within a few iterations.
Asymptotic variances and co-variances
Vaughan and Tiku15 proved rigorously that the MMLEs are asymptotically unbiased and their variances and co-variances are exactly the same as those of the MLEs. In the present situation, therefore, the asymptotic variances and the covariance of ˆα and ˆβ are given by I−1(α,β) , where I is the Fisher Information matrix consisting of the elements −E(∂2lnL*/∂α2) , −E(∂2lnL*/∂α ∂β) , and −E(∂2lnL*/∂β2) . From the modified likelihood equations
∂lnL∂α∝∂lnL*∂α=n∑i=1{yi−(ai+bizi)}=0 ,
∂lnL∂β∝∂lnL*∂β=n∑i=1xi{yi−(ai+bizi)}=0 ,
the Fisher Information matrix can be easily obtained as
V=I−1(γ0,γ1)=[n∑i=1Qin∑i=1Qixin∑i=1Qixin∑i=1Qix2i]−1 , (13)
where Qi=exp(zi) . V is estimated by replacing Qi with its estimate ˆQi=exp(ˆzi) , ˆzi=ˆα+ˆβ xi ( 1≤i≤n ). Since zi values converge to ti values as n tends to infinity, ai and bi values are treated as constant coefficients for large n, see also.3 Hence, the asymptotic variances can be estimated by
Var(ˆα)=n∑i=1ˆQix2i/{n∑i=1ˆQin∑i=1ˆQix2i−(n∑i=1ˆQixi)2} , (14)
Var(ˆβ)=n∑i=1ˆQi/{n∑i=1ˆQin∑i=1ˆQix2i−(n∑i=1ˆQixi)2} . (15)
Hypothesis testing
Testing the null hypothesis H0:β=0 is of great practical importance in Poisson regression modelling. The likelihood ratio statistic for testing H0 is LR=−2(L0−L1) , where L0 and L1 denote the maximized log-modified likelihood functions under the null and alternative hypotheses, respectively. The null distribution of LR is asymptotically a chi-square with 1 degree of freedom. Large values of LR lead the rejection of H0 . Alternatively, the Wald statistic W (the ratio of ˆβ to its standard error) might be used. Since ˆβ is asymptotically equivalent to the MLE, the null distribution of W is asymptotically normal N(0,1). Large values of W lead to the rejection of H0 .
Numerical example
To compare ML and MML estimates numerically, we analyzed the data given on page 82 of Agresti.19 The data is from a study of nesting horseshoe crabs where the response Y is the number of satellites that each female crab has, and the corresponding values of the covariate X is the carapace width of 173 crabs. The study investigates the relationship between Y and X. We calculated the MMLEs from equations (10)-(12) and their approximate standard errors from (14)-(15). The FORTRAN code written to carry out the calculations can be obtained from the author. Our results are completely consistent with those given in,19 which is expected; see Table 1.
Coefficient |
Estimate |
SE |
W |
LR |
|
ML |
a |
-3.3048 |
0.5422 |
||
b |
0.164 |
0.02 |
8.2 |
64.9 |
|
Coefficient |
Estimate |
SE |
W |
LR |
|
MML |
a |
-3.3047 |
0.5423 |
||
b |
0.164 |
0.0199 |
8.241 |
64.91 |
Table 1 MLEs and MMLEs along with their standard errors for horseshoe crab data
Remark: In solving (10)-(12), Oral18 proposed to calculate the initial t(i) values from the least squares estimators (LSEs), which is a different approach than using equation (9) (Approach 1) or equation (10) (Approach 2). Since t(i)=E(α+β x(i)) , t(i) values can be approximated by ˜t(i)=˜α+˜β x(i) , where
˜α=ˉy−˜β ˉx and ˜β=n∑i=1(xi−ˉx)yi/n∑i=1(xi−ˉx)2
are the LSEs; see also3 and.20 When using this approach (say, Approach 3), the t(i) values need to be revised after the first iteration with their estimated values t(i)=ˆα+ˆβ x(i) ( 1≤i≤n) as described above. Estimating population quantiles from the LSEs changes neither the derivations/solutions (10)-(12) nor the results. However, the total revision number needed for stabilization under different approaches is not the same, see also3 page 889. To compare the performance of these three approaches, we conducted a simulation study where we calculated the bias values and variances of the resulting MLEs as well as the coverage probabilities. We also provided average revision numbers needed for stabilization from each approach. We set α to zero and considered various values for β and sample size n. Our results from 10,000 Monte Carlo runs are given in Table 2.
As can be seen from Table 2, all approaches provide same biases and variances after stabilization, which is expected. The resulting coverage probabilities from all approaches are close to 0.95, and as sample size increases, both bias values and variances decrease, also as expected. For a given (α,β) value, it can be seen that the fastest stabilization is achieved by Approach 2 (i.e. equation (10)). Thus, although all three approaches yield the same results, we suggest to calculate initial t(i) values ( 1≤i≤n) from equation (10) because the stabilization from this approach is the fastest one.
|
|
β |
0.1 |
|
0.5 |
|
1.0 |
|
n |
|
|
ˆα |
ˆβ |
ˆα |
ˆβ |
ˆβ |
|
30 |
Approach 1 |
Bias |
0.0355 |
0.0006 |
0.0341 |
0.0067 |
0.0278 |
0.0078 |
|
Variance |
0.0373 |
0.0389 |
0.0403 |
0.0369 |
0.0451 |
0.0328 |
|
|
Coverage prob. |
0.9521 |
0.9537 |
0.9522 |
0.9541 |
0.9525 |
0.9488 |
|
|
No of revisions |
4.67 |
|
3.45 |
|
6.10 |
|
|
Approach 2 |
Bias |
0.0355 |
0.0006 |
0.0341 |
0.0067 |
0.0279 |
0.0079 |
|
|
Variance |
0.0373 |
0.0388 |
0.0403 |
0.0369 |
0.0451 |
0.0328 |
|
|
Coverage prob. |
0.9521 |
0.9537 |
0.9522 |
0.9541 |
0.9525 |
0.9488 |
|
|
No of revisions |
2.65 |
|
2.00 |
|
1.81 |
|
|
Approach 3 |
Bias |
0.0355 |
0.0006 |
0.0341 |
0.0067 |
0.0278 |
0.0079 |
|
|
Variance |
0.0372 |
0.0389 |
0.0402 |
0.0368 |
0.0451 |
0.0328 |
|
|
Coverage prob. |
0.9521 |
0.9537 |
0.9523 |
0.9540 |
0.9525 |
0.9488 |
|
|
No of revisions |
2.98 |
|
3.04 |
|
4.63 |
|
|
100 |
Approach 1 |
Bias |
0.0098 |
0.0007 |
0.0079 |
0.0002 |
0.0081 |
0.0017 |
|
Variance |
0.0104 |
0.0104 |
0.0113 |
0.0094 |
0.0126 |
0.0074 |
|
|
Coverage prob. |
0.9504 |
0.9518 |
0.9470 |
0.9494 |
0.9538 |
0.9495 |
|
|
No of revisions |
4.78 |
|
3.24 |
|
6.59 |
|
|
Approach 2 |
Bias |
0.0098 |
0.0007 |
0.0079 |
0.0001 |
0.0081 |
0.0016 |
|
|
Variance |
0.0103 |
0.0104 |
0.0114 |
0.0095 |
0.0126 |
0.0073 |
|
|
Coverage prob. |
0.9504 |
0.9518 |
0.9470 |
0.9494 |
0.9538 |
0.9496 |
|
|
No of revisions |
2.91 |
|
2.01 |
|
1.32 |
|
|
Approach 3 |
Bias |
0.0098 |
0.0007 |
0.0078 |
0.0002 |
0.0082 |
0.0017 |
|
|
Variance |
0.0104 |
0.0104 |
0.0113 |
0.0094 |
0.0126 |
0.0073 |
|
|
Coverage prob. |
0.9505 |
0.9518 |
0.9470 |
0.9494 |
0.9539 |
0.9495 |
|
|
No of revisions |
2.99 |
|
3.00 |
|
4.69 |
|
|
250 |
Approach 1 |
Bias |
0.0041 |
0.0002 |
0.0041 |
0.0000 |
0.0013 |
0.0007 |
|
Variance |
0.0041 |
0.0040 |
0.0044 |
0.0036 |
0.0049 |
0.0026 |
|
|
Coverage prob. |
0.9506 |
0.9452 |
0.9500 |
0.9479 |
0.9507 |
0.9513 |
|
|
No of revisions |
4.78 |
|
3.11 |
|
6.87 |
|
|
Approach 2 |
Bias |
0.0041 |
0.0002 |
0.0040 |
0.0000 |
0.0012 |
0.0007 |
|
|
Variance |
0.0040 |
0.0041 |
0.0044 |
0.0036 |
0.0049 |
0.0026 |
|
|
Coverage prob. |
0.9506 |
0.9452 |
0.9500 |
0.9479 |
0.9507 |
0.9513 |
|
|
No of revisions |
2.90 |
|
2.00 |
|
1.12 |
|
|
Approach 3 |
Bias |
0.0041 |
0.0002 |
0.0041 |
0.0001 |
0.0013 |
0.0007 |
|
|
Variance |
0.0041 |
0.0041 |
0.0044 |
0.0036 |
0.0049 |
0.0026 |
|
|
Coverage prob. |
0.9506 |
0.9452 |
0.9500 |
0.9479 |
0.9507 |
0.9513 |
|
|
No of revisions |
2.99 |
|
3.00 |
|
4.76 |
|
Table 2 Bias values, variances, convergence probabilities and average revision numbers using three different approaches to calculate t(i) values ( 1≤i≤n)
Generalization to multivariable case
Now consider k (k≥2) covariates and assume all of them take positive values without loss of generality. The Poisson regression model with k covariates can be written as
E(Yi|xi1,xi2,. . .,xik)=F(zi) (16)
where F(zi)=exp(zi) and
zi=β0+k∑j=1βjxij , (17)
for 1≤i≤n. , 1≤j≤k . Following the same lines of,3 in order to rank the z-values we can assume that all covariates are equally effective in increasing the response Y, i.e. we initially take βj ’s all equal, and order the z-values that would correspond to the ordered x-values, where xi=xi1+xi2+...+xik (1≤i≤n) . In other words, the ordered z-values become
z(i)=β0+β1x*i1+. . .+βkx*ik , (18)
where the vector [1x*i1x*i2..x*ik] is the ith row of the matrix
X*=[1x*11x*12...x*1k1x*21x*22...x*2k..............1x*n1x*n2...x*nk] , (19)
which is constructed by arranging the rows of the X matrix
X=[1x11x12...x1k1x21x22...x2k..............1xn1xn2...xnk] ,
so as to correspond to the ordered x(i) value (1≤i≤n) . The MMLEs can be obtained along the same lines as in the Univariate case:
ˆΓ=(X*TM X*)−1X*TΔ (20)
where Δ=[δ1δ2..δn]T , δi is given by (11) and M is the nxn diagonal matrix
M=[b10...00b2...0..........00...bn] .
The asymptotic variance-covariance matrix V of the estimators can be derived from the Fisher information matrix V=I−1(β0,β1,...,βk) as given below
V=[∑Qi∑Qix1i...∑Qixki∑Qix1i∑Qix21i...∑Qixkix1i......... ...∑Qixki∑Qixkix1i...∑Qix2ki]−1 (21)
where Qi=exp(zi) . V is estimated by replacing Qi by
ˆQi=exp(ˆzi) , ˆz(i)=ˆβ0+ˆβ1x*i1+. . .+ˆβkx*ik (1≤i≤n) .
Robustness
Measures of influence considered in linear regression models, such as high leverage values, are analogous in the GLM framework. Large leverage values typically mean that there are outliers in covariates. When outliers present in the data, inferences based on MLEs becomes unreliable. In fact, it has been showed that MLEs are not robust in GLMs.21 In Poisson regression setting, if there are outliers in the continuous covariates, the estimates can be influenced. Thus, we also studied the robustness properties of the derived MMLEs under several outlier models. We considered the Univariate model given by (1) for simplicity. We assumed that α=0 and performed a Monte-Carlo study for different values of β , where (n-r) of the observations X1, X2, . . .,Xn (we don’t know which) come from the Standard Normal Distribution with σ=1 and the remaining r observations come from the Normal distribution with a scale cσ where c is a positive constant. We calculated the value of r from the equation r=[0.1 n+0.5] (Dixon’s outlier model). The outlier models considered are:
(a) (n-r) come from N(0,1) and r come from N(0,1) (No outlier situation),
(b) (n-r) come from N(0,1) and r come from N(0,1.5),
(c) (n-r) come from N(0,1) and r come from N(0,2),
(d) (n-r) come from N(0,1) and r come from N(0,4).
Note that the model (a) above does not involve outliers and is given for the sake of comparisons. In order to be able to make direct comparisons, after generating the X values we divided them by the standard deviation of the distribution for each model. After generating the X values, we calculated zi=α+βxi and μi=exp(zi) for 1≤i≤n to generate Yi values from Poisson distribution with mean μi ( 1≤i≤n ). The values obtained from 5000 runs are given in (Table 3).
|
|
Model (a): No outlier |
Model (b) |
||||||
β |
n |
Bias( ˆα ) |
Bias( ˆα ) |
Var( ˆα ) |
Var( ˆα ) |
Bias( ˆα ) |
Bias( ˆβ ) |
Var( ˆα ) |
Var( ˆβ ) |
0.1 |
30 |
0.0326 |
0.0075 |
0.0371 |
0.0385 |
0.0330 |
0.0021 |
0.0377 |
0.0388 |
|
50 |
0.0195 |
0.0004 |
0.0214 |
0.0219 |
0.0211 |
0.0033 |
0.0214 |
0.0218 |
|
100 |
0.0082 |
0.0039 |
0.0103 |
0.0102 |
0.0094 |
0.0008 |
0.0103 |
0.0104 |
0.2 |
30 |
0.0316 |
0.0020 |
0.0377 |
0.0388 |
0.0302 |
0.0034 |
0.0388 |
0.0386 |
|
50 |
0.0202 |
0.0029 |
0.0217 |
0.0215 |
0.0200 |
0.0007 |
0.0219 |
0.0212 |
|
100 |
0.0108 |
0.0000 |
0.0104 |
0.0102 |
0.0115 |
0.0002 |
0.0108 |
0.0103 |
0.4 |
30 |
0.0323 |
0.0033 |
0.0391 |
0.0376 |
0.0346 |
0.0012 |
0.0407 |
0.0385 |
|
50 |
0.0207 |
0.0005 |
0.0224 |
0.0209 |
0.0165 |
0.0004 |
0.0230 |
0.0209 |
|
100 |
0.0086 |
0.0004 |
0.0109 |
0.0097 |
0.0089 |
0.0009 |
0.0113 |
0.0099 |
|
|
Model (c) |
Model (d) |
||||||
β |
n |
Bias( ˆα ) |
Bias( ˆα ) |
Var( ˆα ) |
Var( ˆα ) |
Bias( ˆα ) |
Bias( ˆβ ) |
Var( ˆα ) |
Var( ˆβ ) |
0.1 |
30 |
0.02827 |
0.0003 |
0.0381 |
0.0413 |
0.0385 |
0.0026 |
0.0372 |
0.0477 |
|
50 |
0.0192 |
0.0013 |
0.0220 |
0.0220 |
0.0211 |
0.0004 |
0.0213 |
0.0246 |
|
100 |
0.0095 |
0.0001 |
0.0105 |
0.0104 |
0.0099 |
0.0002 |
0.0108 |
0.0107 |
0.2 |
30 |
0.0362 |
0.0036 |
0.0396 |
0.0419 |
0.0350 |
0.0080 |
0.0375 |
0.0467 |
|
50 |
0.0163 |
0.0014 |
0.0208 |
0.0211 |
0.0189 |
0.0057 |
0.0221 |
0.0242 |
|
100 |
0.0111 |
0.0010 |
0.0106 |
0.0105 |
0.0109 |
0.0013 |
0.0099 |
0.0103 |
0.4 |
30 |
0.0313 |
0.0015 |
0.0396 |
0.0370 |
0.0314 |
0.0017 |
0.0379 |
0.0461 |
|
50 |
0.0178 |
0.0030 |
0.0226 |
0.0207 |
0.0166 |
0.0001 |
0.0218 |
0.0223 |
|
100 |
0.0079 |
0.0003 |
0.0110 |
0.0093 |
0.0074 |
0.0039 |
0.0111 |
0.0088 |
Table 3 Empirical biases and variances from Dixon’s outlier models
As can be seen from the table, the biases in the estimates are negligible for all models. The variances Var(ˆβ) (hence the Wald statistics W) are almost the same for a given n for the models (a), (b), (c) and (d), which means that the MMLEs are robust to outliers in the covariate. Note that the MML methodology achieves robustness through the t(i) ( 1≤i≤n ) values.
Poisson regression serves as a useful technique to model count data. The MLEs in Poisson regression are obtained via Newton-type algorithms; however these algorithms might not converge or converge to inaccurate values. In this study we derived the explicit MMLEs for Poisson regression. We also considered the case where there are outliers in the (continuous) covariate, which generally is the case in real life applications, and searched the properties of the derived MMLEs under several data violations. Although the scope of the simulations reported here is limited, we can conclude that MML methodology provides robust estimation in Poisson regression.
We are grateful to the reviewer for the helpful comment which helped to improve this manuscript.
None.
©2017 Oral. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.
2 7