Submit manuscript...
eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Editorial Volume 3 Issue 5

On bayesian inference with complex survey data

Joseph Kang, Kyle Bernstein

The Centers for Disease Control and Prevention, USA

Correspondence: Joseph Kang, Statistics Team Lead, 1600 Clifton Rd, MS-02, Atlanta, GA, USA

Received: April 18, 2016 | Published: May 11, 2016

Citation: Kang J, Bernstein K. On bayesian inference with complex survey data. Biom Biostat Int J. 2016;3(5):155-156. DOI: 10.15406/bbij.2016.03.00076

Download PDF

Editorial

Nationally representative probability sample surveys (e.g., the National Health and Nutrition Examination Survey, or NHANES,1 and the National Survey of Family Growth, NSFG)2 have immense value in helping develop estimates of the prevalence of disease, morbidity, and risk factors.3,4 These surveys were designed to utilize survey weights to approximate nationally representative parameters. Any survey data analysis that uses the survey weights is called design-based estimation.  Such analyses are implemented in major statistical software programs (e.g. SAS5 and R).6

While commonly used, weighting complex survey data is a foreign concept to Bayesian modelers, according to Gelman.7 A typical Bayesian analyst does not use weights, but focuses instead on updating assumed prior distributions with observed data likelihood. Much of this disconnect may be a function of differing goals; Bayesian approaches are focused on reliable statistical models8 rather than on assessing the degree to which their estimates are nationally representative or not. However, Bayesian approaches, which have been successfully applied to multilevel data,8 missing data, and measurement errors,8 may represent a natural partner in complex survey data analysis. Measurement errors, missing data, and multilevel variables in complex survey data sets can be all treated as unobserved random variables in the Bayesian framework and they can be assessed by updating assumed prior distributions of related parameters with observed data sets.8,9

Rod Little proposed a method called “Calibrated Bayes”10 which can be used to adopt survey weights within the Bayesian paradigm. Originally the concept of calibration was proposed by non-Bayesian statisticians.11 The calibration technique modifies survey data sets by changing the survey weights to explain nationally representative features. For example, a complex survey data set’s average male age is different from that of the U.S. Census Bureau. By changing survey weights, the calibration method matches the average male age of the complex survey data set to the Census Bureau’s. Though theoretically the calibration technique can be used for Bayesian methods, the Calibrated Bayes method has not been used in major health science journals, which is partly due to the fact that the Calibrated Bayes method is relatively new and that its theory has been discussed from a statistical point of view.10,12

Although Bayesian methods are useful in dealing with complex problems, to our knowledge, none of the popular Bayesian software programs (e.g., BUGS, Bayesian inference Using Gibbs Sampling, Cambridge Institute of Public Health13) have code related to adopting survey weights. Despite this and compared to the theoretical Calibrated Bayes method, it is still possible to use Bayesian software and survey weights together via the R software program in a relatively simple way.

Consider, for example, an analyst who wants to use NHANES data to estimate a disease prevalence by using established survey weights. NHANES provides complex survey data with multilevel structures having missing data. The analyst can run the BUGS program in R to build Bayesian models on the basis of well-documented examples (Congdon illustrates extensive examples).9 Depending on types of target parameters, survey weights can be omitted or treated as a fixed variable in the Bayesian analysis. Again, the analyst can obtain general point estimates of disease conditions for NHANES study subjects by modeling multilevel data structures, missing data, and measurement errors. Design-based variance estimation can be done with the Bayesian point estimates using the jackknife method14 in the R survey package. That is, for each of the jackknife samples, the Bayesian modeling can be performed to produce point estimates. However, it will be computationally burdensome if the Markov Chain Monte Carlo simulation is performed for each of the jackknife samples to assess the posterior means of the parameters. The computational burden can be relieved if a weighted posterior likelihood is maximized to obtain posterior modes instead. 

In terms of a statistical formula, let y denote a binary disease condition, w denote the survey weights, and θ=P( y=1 ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacqaH4oqCcqGH9aqpcaWGqbWaaeWaa8aabaWdbiaadMhacqGH 9aqpcaaIXaaacaGLOaGaayzkaaaaaa@3E9D@ denote the probability of having the disease which is the estimand of interest. The usual estimate of θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOaeqiUde haaa@383B@ with w MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOaam4Daa aa@3781@ is θ ^ = wy/ w MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacuaH4oqCgaqcaiabg2da98aadaqfGaqabeqabaGaaGzaVdqa a8qacqGHris5aaGaam4DaiaadMhacaGGVaWdamaavacabeqabeaaca aMb8oabaWdbiabggHiLdaacaWG3baaaa@4430@ , where MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacqGHris5aaa@3849@ indicates summation over all sampled units. Because y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOaamyEaa aa@3783@ is subject to multilevel data structures, missing data, and measurement error biases, it can be modeled using a Bayesian probability model P(y=1|x) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacaWGqbGaaiikaiaadMhacqGH9aqpcaaIXaGaaiiFaiaadIha caGGPaaaaa@3D8F@ , where x MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOaamiEaa aa@3782@ denotes a vector of auxiliary covariates that are associated with y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOaamyEaa aa@3783@ . θ ^ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOafqiUde NbaKaaaaa@384B@ can be re-paramatized by E( wy/ w |x) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacaWGfbGaaiika8aadaqfGaqabeqabaGaaGzaVdqaa8qacqGH ris5aaGaam4DaiaadMhacaGGVaWdamaavacabeqabeaacaaMb8oaba WdbiabggHiLdaacaWG3bGaaiiOaiaacYhacaWG4bGaaiykaaaa@46A8@ , where the mathematical expectation E( ̇ ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacaWGfbWaaeWaa8aabaWdbiablEWaMaGaayjkaiaawMcaaaaa @39D8@ is taken with respect to the Bayesian model P(y=1|x) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaieaaaaaa aaa8qacaWGqbGaaiikaiaadMhacqGH9aqpcaaIXaGaaiiFaiaadIha caGGPaaaaa@3D8F@ . θ ^ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaqcfaOafqiUde NbaKaaaaa@384B@ can be estimated within each of the jackknife samples.

To summarize this estimation process, only three steps are needed to conduct the Bayesian analysis with survey weights, as follows:

  1. Divide data with the jackknife method.
  2. Obtain Bayes' point estimates of target estimands (e.g., a disease outcome) for each jackknife sample.
  3. Summarize sample mean and sample variance of jackknifed estimates.

As described previously, Bayesian modeling is generally known to be suitable for handling multilevel data structures, missing data, and measurement errors. However, the Bayesian modeling itself does not provide proper variance estimates in the sense of design-based estimation. Using the jackknife resampling method, the Bayesian point estimates can yield design-based variance estimates. Alternatives to the jackknife method are the bootstrap resampling method, balanced repeated replication, and other resampling methods. In this way, analysts can benefit from the Bayesian methodology for multilevel data, missing data, and measurement errors as well as the calibration technique to report nationally representative estimates.

Acknowledgments

None.

Conflicts of interest

Author declares that there are no conflicts of interest.

References

  1. http://www.cdc.gov/nchs/nhanes.
  2. http://www.cdc.gov/nchs/nsfg.htm.
  3. Fanfair RN, Zaidi A, Taylor LD, et al. Trends in seroprevalence of herpes simplex virus type 2 among non-Hispanic blacks and non-Hispanic whites aged 14 to 49 years--United States, 1988 to 2010. Sex Transm Dis. 2013;40(11):860–864.
  4. Xu F, Sternberg MR, Kottiri BJ, et al. Trends in herpes simplex virus type 1 and type 2 seroprevalence in the United States. JAMA. 2006;296(8):964–973.
  5. SAS Institute Inc. Cary, North Carolina 27513, USA. 2008.
  6. http://www.R-project.org/.
  7. Gelman A. Struggles with survey weighting and regression modeling. Statistical Science. 2007;22(2):153–164.
  8. Gelman A, John B Carlin, Hal S Stern. Bayesian data analysis, second Edition. Chapman and Hall/CRC 690. 2003.
  9. Congdon P. Bayesian statistical modeling. 2nd edn. John Wiley & Sons. 2006. p. 1–573.
  10. Little R. Calibrated Bayes, an alternative inferential paradigm for official statistics. Journal of official statistics. 2012;28(3):309–334.
  11. Deville JC, Sarndal CE. Calibration Estimators in Survey Sampling. Journal of the American Statistical Association. 1992;87(418):376–382.
  12. Little R. Calibrated Bayes, for statistics in general, and missing data in Particular. Statistical Science. 2011;26(2):162–174.
  13. Lunn D, Spiegelhalter D, Thomas A, et al. The BUGS project: Evolution, critique and future directions. Stat Med. 2009;28(25):3049–3067.
  14. Canty AJ, Davison AC. Resampling-based variance estimation for labor force surveys. Journal of the Royal Statistical Society. 1999;48(3):379–391.
Creative Commons Attribution License

©2016 Kang, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.