A test of symmetry based on the kernel kullback-leibler information with application to base deficit data

doi:10.15406/bbij.2016.03.00060

eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Research Article Volume 3 Issue 2

A test of symmetry based on the kernel kullback-leibler information with application to base deficit data

Hani M Samawi,

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Robert Vogel

Department of Biostatistics, Georgia Southern University, USA

Correspondence: Hani M Samawi, Department of Biostatistics, Jiann Ping Hsu College of Public Health, PO Box 8015, Georgia Southern University Statesboro, GA 30460, USA

Received: January 07, 2016 | Published: January 27, 2016

Citation: Samawi HM, Vogel R. A test of symmetry based on the kernel kullback-leibler information with application to base deficit data. Biom Biostat Int J. 2016;3(2):44-52. DOI: 10.15406/bbij.2016.03.00060

Download PDF

Abstract

The assumption of the symmetry of the underlying distribution is important to many statistical inference and modeling procedures. This paper provides a test of symmetry using kernel density estimation and the Kullback-Leibler information. Based on simulation studies, the new test procedure outperforms other tests of symmetry found in the literature, including the Runs Test of Symmetry. We illustrate our new procedure using real data.

Keywords: test of symmetry, power of the test, overlap coefficients, kernel density estimation, kullback-leibler information

Introduction

Many statistical applications and inferences rely on the validity of the underlying distributional assumption. Symmetry of the underlying distribution is essential in many statistical inference and modeling procedures. There are several tests of symmetry in the literature; however most of these tests suffer from low statistical power. Tests have been suggested by Butler,¹ Rothman & Woodroofe,² Hill & Roa,³ Baklizi,⁴ and McWilliams.⁵ McWilliams⁵ showed, using simulation, that his runs test of symmetry is more powerful than those provided by Butler,¹ Rothman & Woodroofe,² and Hill & Roa³ for various asymmetric alternatives. However, Tajuddin⁶ introduced a distribution-free test for symmetry based on Wilcoxon two-sample test which is more powerful than the runs test.

Moreover, Modarres & Gastwirth⁷ modified McWilliams⁵ runs test by using Wilcoxon scores to weight the runs. The new test improved the power for testing for symmetry about a known center but did not perform well when the asymmetry is focused in regions close to the median for a given distribution. Mira,⁸ introduced a distribution free test for symmetry based on Boferroni’s Measure. She showed that her test outperform tests introduced by Modarres & Gastwirth⁷ and Tajauddin.⁶ Recently, Samawi et al.⁹ provided a test of symmetry based on a nonparametric overlap measure. They demonstrated that the test of symmetry based on an overlap measure outperformed other tests of symmetry in the literature, including the runs test. Samawi & Helu¹⁰ introduced a runs test of conditional symmetry which is reasonably powerful to detect even small asymmetry in the shape of the conditional distribution. In addition, the Samawi & Helu¹⁰ test does not need any approximation nor extra computations such as kernel estimation of the density function as in the other tests that are found in the literature.

This paper uses the Kullback-Leibler information to test for the symmetry of the underlying distribution. Let $f_{1} (x) and f_{2} (x)$ be two probability density functions. Assume samples of observations are drawn from continuous distributions. The Kullback-Leibler discrimination information function is given by

$D (f_{1}, f_{2}) = \int_{- \infty}^{\infty} f_{1} (x) \ln (\frac{f_{1} (x)}{f_{2} (x)}) d x, = \int_{- \infty}^{\infty} f_{1} (x) \ln (f_{1} (x)) d x - \int_{- \infty}^{\infty} f_{1} (x) \ln (f_{2} (x)) d x,$ (1)

as defined by Kullback & Leibler.¹¹ For simplicity we will write (1) as

$D (f_{1}, f_{2}) = D_{11} (f_{1}, f_{1}) - D_{12} (f_{1}, f_{2}),$ where $D_{11} (f_{1}, f_{1}) = \int_{- \infty}^{\infty} f_{1} (x) \ln (f_{1} (x)) d x and D_{12} (f_{1}, f_{2}) = \int_{- \infty}^{\infty} f_{1} (x) \ln (f_{2} (x)) d x .$

This measure can be directly applied to discrete distributions by replacing the integrals with summations. It is well known that $D (f_{1}, f_{2}) \geq 0,$ and the equality holds if and only if $f_{1} (x) = f_{2} (x)$ almost everywhere. The discrimination function $D (f_{1}, f_{2})$ measures the disparity between $f_{1} and f_{2}$ .

Many authors used the discrimination function $D (., .)$ for testing goodness of fit of some distributions. For example see Alizadeh & Arghami.^12,13

In this paper we consider testing the null hypothesis of symmetry for an underlying absolutely continuous distribution $F (.)$ with known location parameter and density denoted by $f (.)$ $H_{0} : f (x) = f (- x)$ $versus H_{a} : f (x) \neq f (- x); for some x .$ Under the null hypothesis of symmetry, if we let $f_{1} (x) = f (x) and f_{2} (x) = f (- x)$ then $D (f_{1}, f_{2}) = 0$ .

Since kernel density estimation procedures are readily available in various statistical software packages such as SAS, STATA, S-Plus and R, we were interested in exploring the development of a new test of symmetry using kernel density estimation of $D (f_{1}, f_{2})$ . This paper will introduce a powerful test of symmetry based on Kullback-Leibler discrimination information function. The Kullback-Leibler information test of symmetry and its asymptotic properties are introduced in Section 2. A simulation study is provided in Section 3. Illustrations of the test using base deficit score data and final comments are given in Section 4.

Test of symmetry based on the kullback-leibler discrimination information function

Assume that a random sample, $X_{1}, X_{2} ...., X_{n}$ , is drawn from absolutely continuous distribution having known median, assumed to be 0. In the case of an unknown median, or if the center of the distribution is not known, then the data can be centered by a consistent estimate of the median. However, the implications of centering the data around a consistent estimator of the median on the asymptotic properties are not straightforward. Therefore, further investigations are needed to study the robustness of the proposed test of symmetry and compare it with other available tests of symmetry when the median is unknown. In this paper we will discuss only the case where the median of the underlying distribution is assumed known.

Consider testing for symmetry $H_{0} : f (x) = f (- x)$ $versus H_{a} : f (x) \neq f (- x); for some x .$ Let $f_{1} (x) = f (x) and f_{2} (x) = f (- x)$ . Under the null hypothesis, $D (f_{1}, f_{2}) = 0$ . An equivalent hypothesis for testing the symmetry is $H_{0} : D (f_{1}, f_{2}) = 0$ $versus H_{a} : D (f_{1}, f_{2}) > 0$ let $\hat{D}$ be a consistent nonparametric estimator of $D (f_{1}, f_{2})$ . Under the null hypothesis of symmetry and some regularity assumptions, which will be discussed later in this paper, we propose the following test of symmetry:

$z_{0} = \frac{\hat{D} - 0}{{\hat{σ}}_{\hat{D}}} \overset{L}{\to} N (0, 1)$

For large n, where ${\hat{σ}}_{\hat{D}}$ is a consistent estimator of the standard error of $\hat{D}$ . An asymptotic significant test procedure at level $α$ is to reject $H_{0}$ if $z_{0} > z_{α}$ , where $z_{α}$ is the upper $α$ percentile of the standard normal distribution.

Kernel estimation of $D (f_{1}, f_{2})$

For the i.i.d. sample $X_{1}, X_{2} ...., X_{n}$ , let ${\hat{D}}_{11} (f_{1}, f_{1})$ be an estimate of $D_{11} (f_{1}, f_{1})$ . To address which estimator of $D_{11} (f_{1}, f_{1})$ will be appropriate to our inference procedure we need to state some necessary conditions: C1: f is continuous. (Smoothness conditions) C2: f is k times differentiable. (Smoothness conditions) C3: $D_{11} ([X], [X]) < 1$ , where [X] is the integer part of X. (Tail condition) C4: $I n f_{f (x) > 0} f (x) > 0$ (Tail condition) C5: $\int f {(\ln f)}^{2} < \infty$ (Peak condition) (Note that, this is also a mild tail condition.) C6: f is bounded. (Peak condition)

Some suggested estimators for $- D_{11} (f_{1}, f_{1}) = \int_{- \infty}^{\infty} f_{1} (x) \ln (f_{1} (x)) d x$ may be found in the literature. These include the plug-in estimates of entropy which are based on a consistent density estimate $f_{n}$ of f. For example, the integral estimate of entropy introduced by Dmitriev & Tarasenko.¹⁴ Joe¹⁵ considers estimating $- D_{11} (f_{1}, f_{1})$ when $f_{1}$ is a multivariate pdf, but he points out that the calculation when ${\hat{f}}_{1}$ is a kernel estimator gets more difficult when the dimension of the integral is more than two. He therefore excludes the integral estimate from further study. The integral estimator can however be easily calculated if, for example, ${\hat{f}}_{1}$ is a histogram.

The re-substitution estimate is proposed by Ahmad & Lin¹⁶ as follows:

$- {\hat{D}}_{11} ({\hat{f}}_{1}, {\hat{f}}_{1}) = - \frac{1}{n} \sum_{i = 1}^{n} \ln {\hat{f}}_{1} (X_{i}),$ (3)

Where ${\hat{f}}_{1}$ is a kernel density estimator? They showed the mean square consistency of (3), such that $_{n} {\lim_{\to}}_{\infty} E {{({\hat{D}}_{11} ({\hat{f}}_{1}, {\hat{f}}_{1}) - D_{11} (f_{1}, f_{1}))}^{2}} = 0$ Joe¹⁵ considers the estimation of $- D_{11} (f_{1}, f_{1})$ for multivariate pdfs by an entropy estimate of the re-substitution type (3), also based on a kernel density estimate. He obtained asymptotic bias and variance terms, and showed that non-unimodal kernels satisfying certain conditions can reduce the mean square error. His analysis and simulations suggest that the sample size needed for good estimates increases rapidly when the dimension of the multivariate density increases. His results rely heavily on conditions C4 and C6. Hall & Morton¹⁷ investigated the properties of an estimator of the type (3) both when $f_{n}$ is a histogram density estimator and when it is a kernel estimator. For the histogram estimation they showed that $_{n} {\lim_{\to}}_{\infty} n^{1 / 2} ({\hat{D}}_{11} ({\hat{f}}_{1}, f_{1}) - D_{11} (f_{1}, f_{1})) \sim N (0, σ^{2})$ under certain tail and smoothness conditions with $σ^{2} = V a r (\ln (f (X))$ .(4)

Other estimators using sampling-spacing are investigated by Tarasenko,¹⁸ Beirlant & van Zuijlen,¹⁹ Hall,²⁰ Cressie,²¹ Dudewicz & van der Meulen,²² and Beirlant.²³ Finally, other nonparametric estimator has been discussed by many authors including Vasicek,²⁴ Dudewicz & Van der Meulen,²² Bowman²⁵ and Alizadeh.²⁶ Among these various entropy estimators, Vasicek’s sample entropy has been most widely used in developing entropy based statistical procedures. However, deriving the asymptotic distribution for there is hard to establish. Therefore, in this paper we will adopt the kernel re-substitution estimate which is proposed by Ahmad & Lin.¹⁶

We will adopt the notation of Samawi et al.⁹ Our proposed test of symmetry is as follow: Let $X_{1}, X_{2} ...., X_{n}$ be a random sample from absolutely continuous distribution $F (.)$ which is continuously differentiable with uniformly bounded derivatives and having known median.

Let K be a kernel function satisfying the condition $\int_{- \infty}^{\infty} K (x) d x = 1$

For simplicity, the kernel K will be assumed to be a symmetric density function with mean 0 and finite variance; an example is the standard normal density. The kernel estimators for $f (w_{i}) and f (- w_{i}), i = 1, 2, ..., C$ , are:

${\hat{f}}_{K} (- w_{i}) = \frac{1}{n h} \sum_{j = 1}^{n} K (\frac{- w_{i} - x_{j}}{h})$ (6)

and

${\hat{f}}_{K} (w_{i}) = \frac{1}{n h} \sum_{j = 1}^{n} K (\frac{w_{i} - x_{j}}{h}),$

(7)

Respectively, where $C$ is the number of bins and depends on the sample size. As in Samawi et al. [9], we suggest to take the integer of $C = \sqrt{n}$ . In addition, $h$ is the bandwidths of the kernel estimators satisfying the conditions that $h > 0, h \to 0 and (n h \to \infty)$ as $n \to \infty$ . There are many choices of the bandwidths ( $h$ ). In our procedure we use the method suggested by Silverman²⁷ Using the normal distribution as the parametric family, the bandwidths of the kernel estimators are $h = 0.9 A {(n)}^{- 1 / 5}$ , (8)

Where $A$ =min{standard deviation of ( $x_{1}, x_{2} ...., x_{n}$ ), interquantile range of ( $x_{1}, x_{2} ...., x_{n}$ )/1.349}. This form of (8) is found to be adequate choices of the bandwidth for many purposes which minimizes the integrated mean squared error (IMSE),

$I M S E = \int E {[{\hat{f}}_{K} (x) - f (x)]}^{2} d x .$

We will use the Samawi et al.⁹ suggestion to calculate the bins as follows: Let $R = r a n g e (x_{1}, x_{2}, ..., x_{n})$ , then bins will be selected as $w_{i} = w_{i - 1} + δ_{x},$ where $i = 2, ..., C$ , $w_{1} = \min (x_{1}, x_{2}, ..., x_{n})$ and $δ_{x} = \frac{R}{C}$ .

Using the above kernel estimator, the nonparametric kernel estimator of $D (f_{1}, f_{2})$ under the null hypothesis is given by

$\hat{D} = \int_{}^{} {\hat{f}}_{K} (x) \ln (\frac{{\hat{f}}_{K} (x)}{{\hat{f}}_{K} (- x)}) d x, = {\hat{D}}_{11} ({\hat{f}}_{K} (x), {\hat{f}}_{K} (x)) - {\hat{D}}_{12} ({\hat{f}}_{K} (x), {\hat{f}}_{K} (- x)),$

Which can be approximated by?

$\hat{D} = \frac{1}{C} \sum_{i = 1}^{C} \ln {\hat{f}}_{K} (w_{i}) - \frac{1}{C} \sum_{i = 1}^{C} \ln {\hat{f}}_{K} (- w_{i})$

The approximate variance of $\hat{D}$ is given by

$V a r (\hat{D}) = \frac{V a r (\sum_{i = 1}^{C} \ln {\hat{f}}_{K} (w_{i}))}{C^{2}} + \frac{V a r (\sum_{i = 1}^{C} \ln {\hat{f}}_{K} (- w_{i}))}{C^{2}} .$

Asymptotic properties of $\hat{D}$

The nonparametric kernel estimator of $D (f_{1}, f_{2})$ ( $\hat{D}$ ) is based on the univariate kernel for density estimation, $K : ℝ \to ℝ$ . The necessary regularity conditions imposed on the univariate kernel for density estimation are:

I. $\int_{R} K (z) d z = 1.$

II. $\int_{R} z^{β} K (z) d z = 0 for any β = 1, ..., r - 1, and \int_{R} | z |^{r} K (z) d z < \infty .$

III. $R = \int_{R} K^{2} (z) d z < \infty .$

IV. $h > 0, h \to 0, (n h \to \infty) and (\frac{n h}{\log n} \to \infty)$

These conditions may be found in Silverman²⁷ (Chapter 3) or Wand & Jones [28] (Chapter 2).

To show consistency of , apply the kernel density asymptotic properties found in Silverman,²⁷ (Chapter 3) or Wand & Jones,²⁸ (Chapter 2). Under assumptions 1-4 and assuming that the density $f : ℝ \to ℝ$ is continuous at each $w_{i}$ , i=1, 2,… C,

$B i a s ({\hat{f}}_{K} (- w_{i})) = o {(1)}_{-} and B i a s ({\hat{f}}_{K} (w_{i})) = o {(1)}_{+}$ (12)

$V a r ({\hat{f}}_{K} (- w_{i})) = \frac{f (- w_{i})}{n h} \int_{ℝ} K^{2} (z) d z + o (\frac{1}{n h_{}}) and V a r ({\hat{f}}_{K} (w_{i})) = \frac{f (w_{i})}{n h} \int_{ℝ} K^{2} (z) d z + o (\frac{1}{n h}),$ (13)

and for $h > 0, h \to 0 and (n h \to \infty)$ as $n \to \infty$

${\hat{f}}_{K} (- w_{i}) \to^{P} f (- w_{i}) and {\hat{f}}_{K} (w_{i}) \to^{P} f (w_{i})$ If f(.) uniformly continuous, then the kernel density estimate is strongly consistent. Moreover, as in Ahmad & Lin,¹⁶ $_{C} {\lim_{\to}}_{\infty} E {{({\hat{D}}_{11} ({\hat{f}}_{K} (x), {\hat{f}}_{K} (x)) - D_{11} (f_{K} (x), f_{K} (x)))}^{2}} = 0,$ and hence ${\hat{D}}_{11} ({\hat{f}}_{K} (x), {\hat{f}}_{K} (x)) \overset{P}{\to} D_{11} (f_{K} (x), f_{K} (x)), as C \to \infty$ and . However, since $\hat{D} = {\hat{D}}_{11} ({\hat{f}}_{K} (x), {\hat{f}}_{K} (x)) - {\hat{D}}_{12} ({\hat{f}}_{K} (x), {\hat{f}}_{K} (- x))$ therefore $\hat{D} \overset{p}{\to} D (f (w), f (- w)), as C \to \infty .$

To drive the asymptotic distribution of $\hat{D}$ , we will define $D (f_{1}, f_{2})$ as a functional

$D (f_{1}, f_{2}) = \int_{- \infty}^{\infty} f_{1} (w) \ln (f_{1} (w)) d w - \int_{- \infty}^{\infty} f_{1} (w) \ln f_{2} (w) d w = \int_{- \infty}^{\infty} \ln (f_{1} (w)) d F_{1} - \int_{- \infty}^{\infty} \ln f_{2} (w) d F_{1}$

Using the previously stated regularity conditions, some regularity conditions given by Serfing²⁹ and assuming that the Gteuax derivatives of the functional $D (f_{1}, f_{2})$ exist, we can show that the partial influence function of the functional $D (f_{1}, f_{2})$ [30] are as follows:

$L_{1} (w; F_{1}, F_{1}) = \ln f_{1} (w) - \int_{- \infty}^{\infty} f_{1} (w) \ln f_{1} (w) d w,$

and

$L_{2} (w; F_{1}, F_{2}) = \ln f_{2} (w) - \int_{- \infty}^{\infty} f_{1} (w) \ln f_{2} (w) d w .$

Note that $\int L_{1} (w; F_{1} (w), F_{1} (w)) d F_{1} (w) = 0 and \int L_{2} (w; F_{1} (w), F_{2} (w)) d F_{1} (w) = 0.$ Now using this functional representation of $D (f_{1}, f_{2})$ , then as in Samawi et al.³⁰ and Serfing,²⁹

$\sqrt{C} (\hat{D} - D (f_{1}, f_{2})) \overset{L}{\to} N (0, σ_{\hat{D}}^{2}),$

where $σ_{\hat{D}}^{2} = \int L_{1}^{2} (w; F_{1}, F_{1}) d F_{1} + \int L_{2}^{2} (w; F_{1}, F_{2}) d F_{1}$

A consistent estimate for $σ_{\hat{D}}^{2}$ is given by ${\hat{σ}}_{\hat{D}}^{2} = \frac{1}{C} \sum_{i = 1}^{C} L_{1}^{2} (w; {\hat{F}}_{1}, {\hat{F}}_{1}) + \frac{1}{C} \sum_{i = 1}^{C} L_{2}^{2} (w; {\hat{F}}_{1}, {\hat{F}}_{2}),$

Where, $L_{1}^{} (w_{i}; {\hat{F}}_{1}, {\hat{F}}_{1}) = \ln {\hat{f}}_{1} (w_{i}) - {\hat{D}}_{11} ({\hat{f}}_{1} (w_{i}), {\hat{f}}_{1} (w_{i})) and L_{2}^{} (w_{i}; {\hat{F}}_{1}, {\hat{F}}_{2}) = \ln {\hat{f}}_{2} (w_{i}) - {\hat{D}}_{12} ({\hat{f}}_{1} (w_{i}), {\hat{f}}_{2} (w_{i})), i = 1, 2, ..., C,$

Where in our case $f_{1} (w_{i}) = f (w_{i}) and f_{2} (w_{i}) = f (- w_{i})$ .

For discussions about different methods addressing the issue of the performance of kernel density estimation at the boundary, see Hall & Park.³¹

Simulation study

As in Samawi et al.,⁹ to gain some insight of our procedure, a simulation study was conducted to investigate the performance of our new test of symmetry based on $\hat{D}$ . We compared our proposed test of symmetry with the test proposed by McWilliams,⁵ Modarres & Gastwirth,³² Mira⁸ Bonferroni’s test, and Samawi et al.⁹ tests of symmetry.

As in McWilliams [5], the runs test is described as follows: For any random sample of size n, let $Y_{(1)}, Y_{(2), ...,} Y_{(n)}$ denote the sample values ordered from the smallest to largest according to their absolute value (signs are retained), and $S_{1}, S_{2, ...,} S_{n}$ denote indicator variables designating the sign of the $Y_{(j)}$ values [ $S_{j} = 1 if Y_{(j)} is nonnegative, 0 otherwise$ ]. Thus, the test statistic used for testing symmetry is $R_{}^{*}$ = the number of runs in $S_{1}, S_{2, ...,} S_{n}$ sequence= $1 + \sum_{j = 2}^{n} I_{j}$ , where

$I_{j} = {\begin{matrix} 0 if S_{j} = S_{j - 1} \\ 1 if S_{j} \neq S_{j - 1} \end{matrix}$

We reject the null hypothesis if $R_{}^{*}$ is smaller than a critical value ( $c_{α}$ ) at the pre-specified value of $α$ . Moreover, Mira [8] Bonferroni’s test is $γ_{1} (F_{n}) = 2 ({\bar{X}}_{n} - X_{s : n})$ , where $X_{s : n} = M e d i a n (X_{1}, X_{2}, ..., X_{n})$ . The process is to reject the null hypothesis if $| γ_{1} (F_{n}) | \geq \frac{a_{n}}{\sqrt{n}} S_{c} (γ_{1}, F_{n}),$ where

$\begin{array}{l} a_{n} \to z_{1 - \frac{α}{2}} as n \to \infty, S_{c}^{2} (γ_{1}, F_{n}) = 4 {\hat{σ}}^{2} + - 4 D_{n, c} S_{{\bar{μ}}_{F}}, \\ {\hat{σ}}^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - {\bar{X}}_{n})^{2}, S_{{\bar{μ}}_{F}} = {\bar{X}}_{n} - \frac{2}{n} \sum_{i = 1}^{n} X_{i} I (X_{i} \leq X_{s : n}), D_{n, c} = \\ \frac{n^{1 / 5}}{2 c} (X_{[(n / 2) + c n^{4 / 5}] : n} - X_{[(n / 2) + c n^{4 / 5} + 1] : n}), and c = 0.5. \end{array}$

The Modarres & Gastwirth³² test is the hybrid test of sign test in the first stage and a percentile-modified two-sample Wilcoxon see Gastwirth³³ test in the second stage. Finally, Samawi et al.⁹ test of symmetry is based on kernel estimate of the overlap measure.

In the following simulation, SAS version 9.3 {proc kde; method=srot} is used. As in McWilliams,⁵ the generalized lambda distribution see, Ramberg & Schmeiser³⁴ is used in our simulation with following set of parameters:

To generate the observations we used $x_{i} = λ_{1} + \frac{1}{λ_{2}} (u_{i}^{λ_{3}} - {(1 - u_{i})}^{λ_{4}}, i = 1, ..., m,$ where $u_{i}$ a uniform random number. The significance level used in the simulation is $α = 0.05,$ with sample sizes n=30, 50, and 100. To investigate the Type I error, the symmetric distributions used in the simulation are the first case of the generalized lambda and the normal. Our simulation is based on 5000 simulated samples. The 95% confidence intervals of the true probability of type I error under the null hypothesis with $α = 0.05$ are (0.04396, 0.05504).

Table 1.1 shows the estimated probability of type I error. Our test is an asymptotic test with a slight bias in D(., .) and in the variance estimation for small sample size. For sample sizes more than 30, the test seems to have an estimated probability of type I error close to the nominal value 0.05. However, Bonferroni’s test seems to be conservative test procedure, while Modarres, Gastwirth test is slightly conservative for small sample size. Table 1.2 and Table 1.3 show that using D(., .) based test is more powerful than McWilliams,⁵ Bonferroni’s, Modarres & Gastwirth³² and Samawi et al.⁹ tests in all of the presented cases. The efficiency increases as the sample size increases.

Distribution	n	Run Tests	Test Based on the Overlap	Bonferroni’s $γ_{1} (F_{n})$	Modarres and Gastwirth (1998) Test $W_{0.80}$	Test Based on Kullback-Leibler Information
Case #1 generalized lambda $\begin{array}{l} λ_{1} = 0, λ_{2} = 0.197454, λ_{3} = 0.134915, \\ λ_{4} = 0.134915, α_{3} = 0, α_{4} = 3.0 \end{array}$	30	0.046	0.056	0.03	0.027	0.051
	50	0.052	0.051	0.032	0.044	0.047
	100	0.058	0.052	0.027	0.046	0.051
Normal (0, 1)	30	0.052	0.057	0.03	0.03	0.052
	50	0.048	0.055	0.03	0.043	0.051
	100	0.051	0.052	0.032	0.048	0.052

Table 1.1 Probability of Type I Error under the Null Hypothesis. (α =0.05)

Case #	n	Run Test	Test Based on the Overlap	Bonferroni’s $γ_{1} (F_{n})$	Modarres and Gastwirth (1998) Test $W_{0.80}$	Test based on Kullback-Leibler Information
Case #	n	Run Test	Test Based on the Overlap	Bonferroni’s $γ_{1} (F_{n})$	Modarres and Gastwirth (1998) Test $W_{0.80}$	Test based on Kullback-Leibler Information
-2 $λ_{1} = 0, λ_{2} = 1, λ_{3} = 1.4, λ_{4} = 0.25 α_{3} =0 .5, α_{4} = 2.2$	30	0.282	0.501	0.253	0.495	0.948
	50	0.456	0.839	0.352	0.941	0.992
	100	0.781	0.999	0.5	1	1
-3 $λ_{1} = 0, λ_{2} = 1, λ_{3} = 0.00007, λ_{4} = 0.1, α_{3} = 1.5, α_{4} = 5.8$	30	0.444	0.846	0.508	0.61	0.98
	50	0.678	0.953	0.756	0.99	0.999
	100	0.913	1	0.966	1	1
-4 $\begin{array}{l} λ_{1} = 3.586508, λ_{2} = 0.04306, λ_{3} = 0.025213, λ_{4} = 0.094029 \\ α_{3} = 0.9, α_{4} = 4.2 \end{array}$	30	0.12	0.38	0.154	0.179	0.684
	50	0.134	0.541	0.26	0.474	0.854
	100	0.245	0.761	0.488	0.845	0.946
-5 $λ_{1} = 0, λ_{2} = - 1, λ_{3} = - 0.0075, λ_{4} = - 0.03, α_{3} = 1.5, α_{4} = 7.5$	30	0.141	0.451	0.231	0.247	0.81
	50	0.201	0.601	0.41	0.652	0.92
	100	0.336	0.839	0.741	0.954	0.98

Table 1.2 Power of Kullback-Leibler Information based test, with comparison with other tests Under Alternative Hypotheses (α =0.05)

Case #	n	Runs Test	Test Based on the Overlap	Bonferroni’s $γ_{1} (F_{n})$	Modarres and Gastwirth (1998) Test $W_{0.80}$	Test Based on Kullback-Leibler Information
-6 $\begin{array}{l} λ_{1} = - 0.116734, λ_{2} = - 0.351663, λ_{3} = - 0.13, λ_{4} = - 0.16, \\ α_{3} = 0.8, α_{4} = 11.4 \end{array}$	30	0.051	0.161	0.034	0.033	0.191
	50	0.055	0.174	0.04	0.055	0.225
	100	0.053	0.21	0.059	0.12	0.331
-7 $λ_{1} = 0, λ_{2} = - 1, λ_{3} = - 0.1, λ_{4} = - 0.18, α_{3} = 2.0, α_{4} = 21.2$	30	0.101	0.189	0.091	0.092	0.452
	50	0.111	0.241	0.155	0.21	0.611
	100	0.122	0.361	0.336	0.478	0.737
-8 $λ_{1} = 0, λ_{2} = - 1, λ_{3} = - 0.001, λ_{4} = - 0.13, α_{3} = 3.16, α_{4} = 23.8$	30	0.544	0.98	0.643	0.655	0.993
	50	0.752	0.999	0.888	0.992	1
	100	0.961	1	0.996	1	1
-9 $λ_{1} = 0, λ_{2} = - 1, λ_{3} = - 0.0001, λ_{4} = - 0.17 α_{3} = 3.88, α_{4} = 40.7$	30	0.571	1	0.685	0.676	0.993
	50	0.81	1	0.916	0.995	0.999
	100	0.963	1	0.999	1	1

Table 1.3 Power of Overlap based test and Run Tests under Alternative Hypotheses (α =0.05)

Note: The values of skewness $(α_{3})$ and kurtosis $(α_{4})$ are from McWilliams.⁵

Illustration using base deficit data

We applied our new test procedure of symmetry to the base deficit (bd) data as in Samawi et al.⁹ The base deficit score refers to a deficit of "base" present in the blood. Base deficit scores were first established by Davis et al.³⁵ The base deficit score has been found correlated to many variables in the trauma population, such as, mechanism of injury, the presence of intra-abdominal injury, transfusion requirements, mortality, the risk of complications, and the number of days spent in the intensive care unit as indicated by Tremblay et al.³⁶ and Davis et al.³⁷

The samples used in this illustration are part from the data collected based on a retrospective study of the trauma registry at a level 1 trauma center between January, 1998 and May, 2000. The primary concern was to determine at what point we can differentiate between life and death based on a base deficit score. A first step in this analysis is to determine if there is a difference in location for the base deficit score of those who survive and those who fail to survive. As is frequently the case in such studies, the underlying distribution is assumed “normal” or at least symmetric and a t-test or a nonparametric test would be performed without checking the assumptions. In either case a test of symmetry is almost never considered as a means of determining how one may proceed in the analysis. Based on the conclusions of a test of symmetry, the analyst can chose the most powerful test for location. The goal is to test the hypothesis that, on average, the base deficit score is the same for those who survive and those who fail to survive their injuries. The injuries of interest in this group of patients are either penetrating injury or blunt injury. However, before deciding on the test procedure, we need to check the assumptions of underlying distribution of the base deficit for both penetrating injury and blunt injury groups of patients. In particular, the assumption of symmetry of the underlying distribution needs to be verified. The data will be centered about the estimated measure of location to perform the tests of symmetry.

Figure 1.1 and Figure 1.2 show the box plot for penetrating injury and blunt injury groups for dead and alive patients respectively. Clearly there is some asymmetry on all four distributions. Also, Table 2.1 and Table 2.2 show summery statistics for penetrating injury and blunt injury groups for dead and alive patients respectively. Table 2.3 shows the overlap based test, the runs test and the proposed test of symmetry based on the Kullback-Leibler information of symmetry for the underlying distribution for patients discharged alive and dead patients of blunt trauma and penetrating trauma. We reject the assumption of symmetry for underlying distribution of these groups.

Descriptives
BD	Type of Wound			Statistic	Std. Error
	Penetrating	Mean		-10.81	0.846
		95% Confidence Interval for Mean	Lower Bound	-12.49
			Upper Bound	-9.12
		5% Trimmed Mean		-10.68
		Median		-10
		Variance		52.904
		Std. Deviation		7.274
		Minimum		-29
		Maximum		9
		Range		38
		Interquartile Range		10
		Skewness		-0.21	0.279
		Kurtosis		0.102	0.552
	Blunt	Mean		-7.59	0.444
		95% Confidence Interval for Mean	Lower Bound	-8.46
			Upper Bound	-6.71
		5% Trimmed Mean		-7.3
		Median		-6
		Variance		60.65
		Std. Deviation		7.788
		Minimum		-37
		Maximum		23
		Range		60
		Interquartile Range		10
		Skewness		-0.518	0.139
		Kurtosis		1.368	0.277

Table 2.1 Summery statistics for base deficit for dead patients

Descriptives
Base Deficit	Type of Wound			Statistic	Std. Error
	penetrating	Mean		-3.52	0.202
		95% Confidence Interval for Mean	Lower Bound	-3.91
			Upper Bound	-3.12
		5% Trimmed Mean		-3.06
		Median		-2.7
		Variance		24.683
		Std. Deviation		4.968
		Minimum		-28
		Maximum		12
		Range		40
		Interquartile Range		5
		Skewness		-1.75	0.099
		Kurtosis		5.079	0.199
	Blunt	Mean		-1.8	0.059
		95% Confidence Interval for Mean	Lower Bound	-1.92
			Upper Bound	-1.69
		5% Trimmed Mean		-1.61
		Median		-1.3
		Variance		11.601
		Std. Deviation		3.406
		Minimum		-27
		Maximum		13
		Range		40
		Interquartile Range		3
		Skewness		-1.22	0.043
		Kurtosis		4.39	0.085

Table 2.2 Summery statistics for base deficit for alive patients

	Injury Type	N	Test	Significance
Kullback-Leibler Information	Penetrating - Dead	74	3.989	<0.0001
Kullback-Leibler Information	Penetrating - alive	603	13.057	<0.0000
Overlap test*	Penetrating - Dead	74	-2.09	0.0183
Overlap test*	Penetrating - alive	603	-16.928	<0.0001
Run test*	Penetrating - Dead	74	-2.065	0.0195
Run test*	Penetrating - alive	603	-16.41	<0.0001
Kullback-Leibler Information	Blunt - Dead	306	13.92	<0.0001
Kullback-Leibler Information	Blunt - alive	3275	8.053	<0.0001
Overlap test*	Blunt - Dead	306	-13.264	<0.0001
Overlap test*	Blunt - alive	3275	-79.074	<0.0001
Run test*	Blunt - Dead	306	-10.29	<0.0001
Run test*	Blunt - alive	3275	-52.405	<0.0001

Table 2.3 Test of symmetry with summary statistics

Figure 1.1 Box plot to base deficit for dead patients.

Figure 1.2 Box plot to base deficit for alive patients.

The proposed test of symmetry based on the Kullback-Leibler information, appears to outperform the other tests of symmetry in the literature in terms of power. Our test is more sensitive to detect a slight asymmetry in the underlying distribution than other tests proposed in the literature. Moreover, the kernel density estimation literature is very rich and many of the proposed methods and the improved methods are available on statistical software, such as SAS™, S-plus, Stata and R. Since based on the Kullback-Leibler information can be used in multivariate cases as well as in univariate cases, our proposed test of symmetry can be extended to multivariate cases for diagonal symmetry, conditional symmetry and other types of symmetry.