Application of GLM (logistic regression) on serological data of malaria infection

doi:10.15406/bbij.2019.08.00261

eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Mini Review Volume 8 Issue 1

Application of GLM (logistic regression) on serological data of malaria infection

Getachew Tekle

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Department of Statistics, Wachemo University, Ethiopia

Correspondence: Getachew Tekle, MSc. in Biostatistics, Department of Statistics, Wachemo University, Ethiopia

Received: October 22, 2018 | Published: January 9, 2019

Citation: Tekle G. Application of GLM (logistic regression) on serological data of malaria infection. Biom Biostat Int J. 2019;8(1):1-4. DOI: 10.15406/bbij.2019.08.00261

Download PDF

Abstract

As a nation reduces the burden of falciparum malaria, identifying areas of transmission becomes increasingly difficult. Over the past decade, the field of utilizing malaria serological assays to measure exposure has grown rapidly, and a variety of serological methods for data acquisition and analysis of human IgG against falciparum antigens are available.¹

The main Objective of this case study is to model the probability of infection as a function of age (the prevalence of malaria infection).

Introduction

Variables

The predictor variable (age) is continuous and the dependent variable serology/ disease status is binary (where, sero-positive or sero-negative).

Data: Serological data of malaria

Serology is the scientific study of plasma serum and other bodily fluids. In practice, the term usually refers to the diagnostic identification of antibodies in the serum. Serological tests may be performed for diagnostic purposes when an infection is suspected, in rheumatic illnesses, and in many other situations, such as checking an individual's blood type.²

Antibodies produced in response to an infectious disease like malaria remain in the body after the individual has recovered from the disease. A serological test detects the presence or absence of such antibodies. An individual with such antibodies is termed sero-positive.

A sample which has taken at a certain time point, the information for each individual:

Age at test.
Infected or not.

Prevalence of sero-positivity in the sample: This is the probability to become infected before the age at test.
In this example the information about each subject in the experiment is the disease status (infected or not by malaria) and the age group of the subject.
The variables are: the sample size, the number of sero-positive at each sample size (=the number of infected subjects) and the age.

Binary data

Binary data may occur in two forms:

Ungrouped in which the variable can take one of two values, say success/failure. Grouped in which the variable is the number of successes in a given number of trials.

The natural distribution for such data is the Binomial (n, p) distribution; where in the first case n = 1.
The observation is a binary variable which takes the value of 1 with probability P.

$P = \frac{e^{α + β a g e}}{1 + e^{α + β a g e}}$ (1)

The probability of infection.
If then there is a positive association between the probability and age. This means that the probability of infection increase with age.
If then there is a negative association between the probability and age. This means that the probability of infection decrease with age.

Generalized linear models (GLM)

Generalized linear models (GLM) are used to fit fixed effect models to certain types of data that are not normally distributed. Generalized–not limited to normally distributed data. Linear–models use a linear combination of variables to "predict" the response. Exponential family of Binomial distribution, Dobson.³

$Z_{i} = {\begin{matrix} 1 \\ 0 \end{matrix} \to Y_{i} = \sum_{i = 1}^{n} Z_{i} \to Y_{i} ~ B (n, π_{i})$

$p (y_{i} | θ) = \exp {y_{i} \log [\frac{θ_{i}}{1 - θ_{i}}] + n_{i} \log (1 - θ_{i}) + \log (\begin{array}{l} n_{i} \\ y_{i} \end{array})}$

The link function

$g (μ) = \log (\frac{μ}{1 - μ})$

$a_{i} (φ) = 1, b (θ_{i}) = \log (1 + \exp (θ_{i}))$

$c (y) = \log (\begin{array}{l} n_{i} \\ y_{i} \end{array})$

$\log (\frac{μ}{1 - μ}) = \log (\frac{\frac{e^{θ}}{1 + e^{θ}}}{\frac{1}{1 + e^{θ}}}) = \log (e^{θ})$

$\begin{array}{l} \leftarrow E (y) = μ = b^{'} (θ_{i}) = e^{θ_{i}} (1 + e x p (θ_{i}))^{- 1} \\ var (y) = μ (1 - μ) / n \end{array}$ (2.2)

Components of GLM

Random component- the probability distribution of the response.
Systematic component (linear predictor)-the predictor variables are (e.g., X1, X2, etc). These variables enter to the model in a linear manner.
Link function-Specify the relationship between the mean random component (i.e., E(Y)) and the systematic component.

Random component

Y_{i j} = {\begin{matrix} 1 s e r o p o s i t i v e \\ 0 s e r o n e g a t i v e \end{matrix}

then

E (Y_{i j}) = P (Y_{i j} = 1) = π_{i j}

which will also be

\sum \frac{Y_{i j}}{n_{}}

, where

Z_{i} = \sum Y_{i j}

To show the sum of Bernollis is binomially distributed, $Z_{i} = {\begin{matrix} 1 s e r o p o s i t i v e \\ 0 s e r o n e g a t i v e \end{matrix}$ and

$Z_{i} = ~ B i n (1, π_{i j})$ (2.3)

$Z_{i} = \sum Y_{i j} V s Z_{i} = B i n (n_{i}, π_{i j})$

Number of sero-positive at each age group ni: sample size at each age group
P_i is the probability to be infected (the prevalence). We use logistic regression in order to model the prevalence as a function of age.

Systematic component: - dependency of the predictor – the linear predictor The systematic component of the model consists of a set of explanatory variables and some linear function of them.⁴

$π_{j} = f (s e r o p o s i t i v e_{i}) = f (S_{i}) \leftrightarrow π_{j} = f (S_{i}) = f (β_{0} + β_{1} S_{i})$ (2.4)

Binomial link functions

Logit link function: $n (p) = b (\frac{p}{1 - p})$

$℧ = \frac{\exp (X β)}{1 + \exp (X β)} = \frac{1}{1 + \exp (- X β)}$ mean of the response with logit link

Probit link function $η (p) = φ^{- 1} (p)$
Complementary log log function: $η (p) = \ln (- \ln (1 - p))$

Analysis of designed matrices

For logistic regression

Define a (design) matrix X so that for response variable Where is a vector of parameters and X is a design matrix of predictors.

For binomial model

Whereis a vector of parameters and X is a design matrix of predictors.

Model Selection Techniques

The most commonly known model selection criteria are Akaike Information Criterion (AIC) (Sakamoto, 1986), and Log-likelihood were used.

Where, -2 log L is twice the negative log-likelihood value for the model

P: - is the number of estimated parameters.

Smallest value of AIC, best is the model.

Results and discussions

Exploratory analysis of data
The above plot indicates the prevalence of malaria infection will be increased with age, as age increases the probability of infection will increases. Thus, there is almost a linear relationship among the probability of malaria infection and age (Figure 1). The line indicates the fitted proportion of infection linearly as given below:

Figure 1 Plot of prevalence of malaria vs. age, posi/N.

$\log i t (\overset{\land}{p i}) = - 2.71 + 0.044 * a g e$ (3.1)

Model Diagnosis
As the above plot describes, there is a pattern the residuals fit and the residuals are not constant through fitted values; the variation among the predicted probability of infection is not the same. Thus, it indicates some assumption/constant variance of the model has not been satisfied (Figure 2).

Figure 2 Plot of residuals vs. Fitted values.

The above normal plot shows that the normality assumption has been satisfied (Figure 3).

Figure 3 Normal plot.

Models with different link functions
Model with logit link
Deviance Residuals:

Min 1Q Median 3Q Max $- 2.78685 - 1.31863 - 0.05053 0.66752 2.38275$
Coefficients: Estimate Std. Error z value $P r (> | z |)$
(Intercept) $- 2.714074 0.151740 - 17.886 < 2 e - 16 * * *$
agei $0.044672 0.004511 9.904 < 2 e - 16 * * *$
Signif. codes: $0 ‘ * * * ’ 0.001 ‘ * * ’ 0.01 ‘ * ’ 0.05 ‘ . ’ 0.1 ‘ ’ 1$
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.037 on 8 degrees of freedom
Residual deviance: 21.865 on 7 degrees of freedom
AIC: 66.388

Complementary log log or (c-log-log) link:
Deviance Residuals:

Min 1Q Median 3Q Max
$- 2.6301 - 1.3864 - 0.1393 0.6994 2.5276$
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) $- 2.709235 0.139261 - 19.45 < 2 e - 16 * * *$
agei $0.039671 0.003746 10.59 < 2 e - 16 * * *$
Signif. codes: $0 ‘ * * * ’ 0.001 ‘ * * ’ 0.01 ‘ * ’ 0.05 ‘ . ’ 0.1 ‘ ’ 1$
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.037 on 8 degrees of freedom
Residual deviance: 20.658 on 7 degrees of freedom
AIC: 65.181

Model with log link:

Deviance Residuals:
Min 1Q Median 3Q Max
$- 2.428 - 1.474 - 0.146 0.751 2.682$
Coefficients: Estimate Std. Error z value $P r (> | z |)$
(Intercept) $- 2.699659 0.126483 - 21.34 < 2 e - 16 * * *$
agei $0.034705 0.002997 11.58 < 2 e - 16 * * *$
Signif. codes: $0 ‘ * * * ’ 0.001 ‘ * * ’ 0.01 ‘ * ’ 0.05 ‘ . ’ 0.1 ‘ ’ 1$
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.037 on 8 degrees of freedom
Residual deviance: 19.312 on 7 degrees of freedom
AIC: 63.836

Model with Identity link:

glm(formula = dew ~ agei, family = binomial(link = "identity"))
Deviance Residuals:
Min 1Q Median 3Q Max
$- 3.2921 - 0.8959 - 0.1462 0.8583 3.0276$
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) $0.0381457 0.0123993 3.076 0.00209 * *$
agei $0.0063542 0.0006656 9.547 < 2 e - 16 * * *$
Signif. codes: $0 ‘ * * * ’ 0.001 ‘ * * ’ 0.01 ‘ * ’ 0.05 ‘ . ’ 0.1 ‘ ’ 1$
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.037 on 8 degrees of freedom
Residual deviance: 26.165 on 7 degrees of freedom
AIC: 70.689

Models Comparison
Selection of terms for deletion or inclusion is based on Akaike's information criterion (AIC). In R, the function “extractAIC(model) will give AIC (Table 1). According to the AIC criteria and Likelihood, the model with log link function will be chosen as a good model; though its mean estimate is the second smallest next to identity, its AIC and Likelihood are the smallest of all. Hence, the chosen model with the log link function should be given as follows:

Model	Estimate $(β)$	Likelihood	No. parameters	AIC
Logit	0.044672	-31.1941	2	66.388
Logit	0.034705	-29.9179	2	63.836
Identity	0.006354	-33.3445	2	70.689
C-log-log	0.039671	.30.59063	2	65.181

Table 1 Model comparison

${\overset{⌢}{P}}_{i} = \frac{e^{2.699659 + 0.034705*age}}{1 + e^{- 2.699659 + 0.034705 * a g e}}$

$E (Y) = - 2.699659 + 0.034705 * a g e$ , which indicates that for a unit increase in age since at infection, the proportion of developing the antibiotics will increase by 0.0347(3.5%).

The odds ratio: point estimator

How to calculate the odds ratio? For continuous predictor the odds ratio is given by

θ = e x p (β) .

The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, exp (B) is easier to interpret. Exp (B) represents the ratio-change in the odds of the event of interest for a one-unit change in the predictor. Exp (0.0347) =1.0353, in this case the odds for malaria infection in sero-positive people is 0.035(3.5%) times the odds for malaria infection in sero-negative people.⁵

Conclusion

Serological data is explored and analyzed as is shown above. From the summary part it is indicated that in all models fitting, the p-value is very small and the predictor variable age is significant for the prediction of the prevalence of malaria. Comparison of the four models indicated that the model with log link function is chosen as the best model based on AIC criteria, in which case the predicted value of model coefficient is 0.0347, which indicates for a unit increase in mid age the proportion of malaria infection will increase by 0.0347.

Acknowledgements

None.

Conflict of interest

Author declares that there is no conflict of interest.

References

Eric Rogier, Wiegand R, Moss D, et al. Multiple comparisons analysis of serological data from an area of low Plasmodium falciparum transmission. Malaria Journal. 2015;4(14):436.
Collet D. Modeling Binary Data. London: Chapman & Hall; 1991.
Dobson AJ. An Introduction to Generalized Linear Models. 2nd edn. London: Chapman &Hall; 2001.
McCullagh P, JA Nelder. Generalized Linear Models. London 2nd edn. London: Chapman & Hall. 1989.
Lindsey JK, G Mersch. Fitting and comparing probability distributions with log linear models. Comput Statist Data Anal. 1992;13:373–384.