Research Article Volume 2 Issue 4
^{1}Center for Human Genetics, USA
^{2}Computation and Informatics in Biology and Medicine, USA
^{3}Department of Genetics, Stanford University, USA
Correspondence: Steven J Schrodi, Center for Human Genetics, 1000 N Oak Ave_MLR, Marshfield, WI, 54449 USA, Tel 715 221 6443, Fax 715 389 4950
Received: June 29, 2015  Published: September 2, 2015
Citation: Schrodi SJ, Jones H. Calculating exact pvalues from the McNamara transmission/ disequilibrium test statistic. J Investig Genomics. 2015;2(4):8588. DOI: 10.15406/jig.2015.02.00032
The transmission/disequilibrium test (TDT) is a popular method for analyzing genetic data in studies of complex disease. It is often assumed that the Pvalues for the test are wellcalculated using the asymptotic, chisquared distribution. However, that is not always an accurate assumption. A formula is derived for the exact Pvalue of the TDT McNamara statistic and we show that the asymptotic Pvalues for the McNemar statistic can often depart considerably from the exact Pvalues, even when sample sizes are relatively large. Notably, the asymptotic Pvalues can be either too large or too small, leading to either false positive or false negative results. Since the exact Pvalue for this statistic is simple to calculate, it will be preferable to do so. We also anticipate that our derivation may find utility in other applications of the McNemar statistic where the underlying variables are binomiallydistributed.
Keywords: transmission/disequilibrium test, TDT, mcnemar statistic, exact pvalue, disease gene mapping
Genetic association studies have become increasingly common in recent years. These studies aim to detect an increase in the frequency of a disease predisposing variant in a population of affected as compared to a control population. In most cases, the predisposing allele cannot be interrogated directly, and instead a dense set of genetic markers is used as a surrogate. The association study then aims to detect a significant difference in the frequency of one or more alleles at the markers. Such an increase depends on the existence of linkage disequilibrium between a predisposing allele and one or more genetic markers. Because linkage disequilibrium only extends over a short distance, the most commonly used genetic markers are single nucleotide polymorphisms(SNPs) as they are numerous enough to provide a dense coverage of the genome Reich et al.,^{1} Reich et al.^{2} and easily and inexpensively assayed on high density arrays with wellvalidated analysis techniques Guo et al.^{3}
The simplest experimental design for a genetic association study is to compare a population of cases (patients with the disease being studied) with a population of controls (unaffected individuals). This classic casecontrol design has been extensively studied in the field of epidemiology with many refinements being incorporated (unequal numbers of cases and controls, different methods for “matching”, related cases and unrelated controls, etc) Breslow & Day,^{4} Risch & Teng,^{5} Teng & Risch,^{6} Slager & Schaid.^{7} A significant drawback of the casecontrol design is the potential for confounding which can lead to false positive and false negative results. This arises when an unknown factor causes the populations to differ, even though it may not contribute to the phenotype being examined. In terms of genetics, this may arise when one population is more homogenous than the other. For example, suppose that cases and controls are sampled from different geographic locations exhibiting different genetic histories. Allele frequency discrepancies at a particular marker between cases and controls may be due to the sampling bias rather than disease status. In this simple example, it would likely be relatively easy to tell that the two populations were not well matched (that is, that the background level of relatedness was not equal in the two populations). A simple analysis of markers from across the genome would show that the cases showed greater genetic homogeneity for all the markers Devlin & Roeder,^{8} Pritchard et al.,^{9} Ardlie et al.^{10} However, this confounding (or stratification) can exist in much subtler forms and can lead to spurious results arising from casecontrol studies. Debate as to the extent of this bias between cases and controls is ongoing and several methods have been developed to either remove genetic background outliers or adjust by principal components derived from large numbers of SNPs Price et al.^{11} Additionally, the primary hypothesis tested with casecontrol designs is independence between disease status and genotype counts. This may have limitations in that truly causative variants generate many genetic patterns in data sets that are not fully interrogated by basic analyses conducted on casecontrol data: 1 there is welldescribed decay of statistical association patterns with declining linkage disequilibrium from causal sites Schrodi et al.,^{12} Garcia et al.^{13} HardyWeinberg disequilibrium will exist in affected individuals at the causal site under many disease models Nielsen et al.^{14} and Guo et al.^{3 }causative variants tend to segregate in families with disease status (i.e., linkage) Mohr,^{15} Bernstein,^{16} Haldane & Smith.^{17}
An alternative design is the Transmission/Disequilibrium Test (TDT) Spielman et al.^{18} this is a family based method that requires the parents of the affected individual to be available for genotyping. The idea is qualitatively similar to the casecontrol design except that the population of controls comprises of the nontransmitted alleles from the parents. That is, of the four parental alleles, two are transmitted to the affected child. The other two are not transmitted and hence should be a random sample from the population from which the cases were selected. These two alleles are used as the control genotype. In this way, the case and control population are well matched. Importantly, only heterozygous parents are informative and so the effective sample size may be much smaller than the total number of families in the study. Hence, highly polymorphic markers that tag chromosomes can be a significant advantage when conducting transmissionbased tests. Subtly, the test evaluates the simple hypothesis of Mendel’s law of segregation for parents to offspring, rather than independence between disease status and genotypes. Similar to casecontrol studies and affected sibling pair linkage studies, the TDT aims to combine signal across a large number of small families and as such may lose substantial power under diseases models of high locus heterogeneity. It should be noted that numerous extensions to the TDT have been proposed including those that extend to larger families and multiplex situations Martin et al.^{19}
Data from a TDT association study are analyzed by comparing the transmitted allele to the un transmitted allele. Under the disease model, a causative allele should be more often transmitted to affected offspring than the alternative allele(s) at the site interrogated. Under the disease model, the difference in the frequencies of transmission for each allele is greater than expected under Mendel’s law of segregation – the null hypothesis – where each allele would have equal probabilities of being transmitted to the offspring. A McNemartest statistic was originally proposed as the TDT test statistic Spielman et al.^{18} assuming a biallelic marker, segregating alleles A_1 and A_2 takes the form of
$T=\frac{\left({X}_{1}{X}_{2}\right){}^{2}}{{X}_{1}+{X}_{2}};$ (1)
Where X_{1} and X_{2} are the number of transmissions of the A_{1} and A_{2} alleles respectively, for the parents that are heterozygous at the locus evaluated. Researchers tend to use the asymptotic result for calculating pvalues from this statistic using the ChiSquared limiting distribution with one degree of freedom. Let N denote the total number of transmissions from heterozygous parents (N=X_{1}+ X_{2}) then,
${\mathrm{lim}}_{N\to \infty}\frac{1}{dt}P\left[T\in \left(t,t+dt\right)\right]=\frac{1}{\sqrt{2\pi t}}\mathrm{exp}\left(\frac{t}{2}\right)$ (2)
The density of X_{1} under the null hypothesis of no linkage and no association with disease under Mendel’s first law is simply
$P\left[{X}_{1}=x\right]=\left({}_{x}^{N}\right){2}^{N}$ (3)
For finite values of N, eqn (2) does not strictly hold and hence using this limiting distribution to determine a pvalue is prone to error. For example, the variance of T is
$Var\left[T\right]=\frac{1}{{N}^{2}}\left\{E\left[{\left(2{X}_{1}N\right)}^{4}\right]{N}^{2}\right\}=\frac{2\left(N1\right)}{N},$ (4)
As opposed to 2 under the limiting distribution. This departure is nonnegligible for small values of N
The exact density of T can be derived, and we use this to calculate the appropriate pvalue and examine the rate of convergence to the pvalue calculated under the limiting distribution. As the McNemar statistic is commonly used in numerous scenarios within genetics and other fields, there may be additional applications for the exact density of T.
$P\left[T=t\right]=P\left\{\left[{X}_{1}=\frac{N+\sqrt{Nt}}{2}\right]U\left[{X}_{1}=\frac{N\sqrt{Nt}}{2}\right]\right\}$ (5)
Since these are disjoint events,
$=P\left[{X}_{1}=\frac{N+\sqrt{Nt}}{2}\right]+P\left[{X}_{1}=\frac{N\sqrt{Nt}}{2}\right]$ (6)
Employing eqn (3),
$=\frac{N!}{{2}^{N}\left(\frac{N+\sqrt{Nt}}{2}\right)!\left(\frac{N\sqrt{Nt}}{2}\right)!}$ (7)
So, for an observed value for the statistic T=t_{obs}, a pvalue can be directly calculated analytically with
$P\left[T\ge {t}_{obs}\right]=\frac{N!}{{2}^{N}}{\sum}_{u\ge {t}_{obs}}{\left[\left(\frac{N+\sqrt{Nu}}{2}\right)!\left(\frac{N\sqrt{Nu}}{2}\right)!\right]}^{1}$ (8)
To exemplify the use of eqn (8), suppose that one observed 60 transmissions of the A_{1} allele from a total of 100 informative transmissions. The T statistic will take a value of 4. Using the limiting ChiSquared distribution with one degree of freedom as the null distribution, the pvalue would be calculated as 0.0455, whereas, using eqn (8) yields an exact pvalue for the McNemar statistic of 0.0569. Thus, in this example, the asymptotic approach exaggerates the significance of these data. Further, the departure of the pvalue calculated using the ChiSquared distribution may be positive or negative. That is, the asymptotic test may be either anticonservative of conservative depending on the parameter space. For example, for a highly significant example where A_{1} is 90 transmissions from a total of 100 informative transmissions, the asymptotic pvalue is1.24×10^{15}, while the exact result is 3.06×10^{17}.
Figure 1 shows the ratio of the exact to asymptotic pvalues varying the numbers of transmitted alleles assuming a total number of 100 informative transmissions. When the number of transmissions is close to the null expectation of 50, the two pvalues are very similar and therefore the ratio is close to unity. As the number of transmissions increases the pvalue given by the ChiSquared approximation is less than the exact pvalue, giving appositive ratio. In this case, the approximation overestimates the significance, potentially leading to false positive results. Note the region of the parameter space where the proportion of transmission is only slightly greater than the null may be the most realistic scenario for a study (e.g., transmission of predisposing allele of ~60% compared to the null of 50%). Thus, for realistic values or transmission, the asymptotic result can lead to false positive results where association is deemed to exist when it does not. For higher rlevels of transmission (>75%), the situation is reversed with the asymptotic pvalue being greater than the exact pvalue, underestimating the true significance of the data, leading to false negative results.
Table 1 presents asymptotic and exact pvalues for a variety of different sample sizes and transmission frequencies. Here again, the asymptotic pvalue can be either greater than or less than the true value. Simulation studies were also carried out to verify these results. Table 2 shows the pvalue from one million simulations and the corresponding ChiSquared probability. Again, using the asymptotic pvalue can lead to substantial errors that may be conservative or anticonservative. To calculate statistical power or carry out Bayesian derivations, the probability that the T statistic takes a given value under the alternative hypothesis is needed. That is, a formula analogous to eqn (7) for probabilities of transmission that deviate from one half. For a general transmission probability q , this is given as,
P[T=tobs]diseasemodel=[q(1−q)]12(N−Nt√)[(1−q)Nt√(N−NNt√2)+qNt√(N+NNt√2)] (9)
Number of informative transmissions 

20 
60 
100 
200 

Proportion of transmissions to affected off spring 
55% 
Exact 
0.825 
0.529 
0.368 
0.09 
Asymptotic 
0.655 
0.439 
0.317 
0.157 

Ratio 
1.26 
1.18 
1.16 
0.57 

65% 
Exact 
0.263 
0.027 
0.003 
1.3x105 

Asymptotic 
0.189 
0.02 
0.004 
2.2x105 

Ratio 
1.46 
1.35 
0.75 
0.6 

75% 
Exact 
0.041 
0.041 
5.6x107 
4.2x1013 

Asymptotic 
0.025 
0.025 
5.7x107 
1.5x1012 

Ratio 
1.64 
1.64 
0.98 
0.27 
Table 1 presents asymptotic and exact pvalues for a variety of different sample sizes and transmission frequencies
Transmission 
Replicates 
Quintile 
T Quintile 
ChiSquared Probability 
% Error 
20 
1000000 
0.95 
3.2 
0.0736 
47.3 
20 
1000000 
0.99 
7.2 
0.00729 
27.1 
20 
1000000 
0.999 
9.8 
0.00175 
74.5 
20 
1000000 
0.9999 
12.8 
0.000347 
246.6 
40 
1000000 
0.95 
3.6 
0.0578 
15.6 
40 
1000000 
0.99 
6.4 
0.0114 
14.6 
40 
1000000 
0.999 
10 
0.00157 
56.5 
40 
1000000 
0.9999 
14.4 
0.000148 
47.8 
100 
1000000 
0.095 
4 
0.0455 
9 
100 
1000000 
0.999 
6.8 
0.00932 
6.8 
100 
1000000 
0.9999 
10.2 
0.00137 
37.4 

Table 2 Shows the pvalue from one million simulations and the corresponding ChiSquared probability
The TDT is a commonlyused method of carrying out disease mapping studies. Because it requires parental DNA to be available and that the parents are heterozygous for the marker being interrogated, samples sizes will often be modest. Thesis especially true if biallelic SNPs are being used. Results in Figure 1 & Table 1 show that the standard method of calculating a pvalue by appealing to the asymptotic distribution can lead to both false positive and false negative results. Given the time and cost of genetic studies, such errors can be problematic. False positives results may lead are searcher to continue to pursue a region of the genome that does not harbor a predisposing allele. Conversely, false negatives may result in regions of the genome being excluded, even though they contain genetic factors that play a role in the disease of interest. Most notably, in the example given here, the asymptotic test is anticonservative in the region of the parameter space most likely to be observed in a genetic association study. This leads to the dangerous situation where evidence for disease is believed to be proven at a given significance level when, in fact, it is not. Given the legion of problems that can arise from false positive and false negative results, it will be important to correctly calculate the probability of the observed data under the null hypothesis, especially when the sample size is limited.
None.
Author declares that there is no conflict of interest.
©2015 Schrodi, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work noncommercially.