Brief Report Volume 11 Issue 4
1Department of Biostatistics, East Tennessee State University, USA
2Center for Nursing Research, East Tennessee State University, USA
Correspondence: Shimin Zheng, Department of Biostatistics and Epidemiology, East Tennessee State University, USA
Received: September 04, 2022 | Published: September 20, 2022
Citation: Zheng S, Cao Y. Correlation analysis for different types of variables and relationship between different correlation coefficients. Biom Biostat Int J. 2022;11(4):127-129. DOI: 10.15406/bbij.2022.11.00365
The purpose of this article is to provide a summary about statistical correlation analysis and relationship between simple, multiple and partial correlation coefficients.
Statistical correlation analysis and regression analysis are related, but different. Correlation analysis quantifies the strength of the linear relationship between two variables or between two sets of variables, most often two continuous variables, or between two sets of continuous variables, whereas regression analysis is used to determine the relationship in the form of an equation between two variables or two sets of variables. Unlike regression analysis, to do correlation analysis, we don’t have to distinguish cause and effect, or dependent and independent variables.Most often, the simple correlation coefficient is used. It is also called Pearson product-moment correlation coefficient.1 It is a measure of the strength and direction of association between two variables measured on at least an interval scale. It can range from -1 to 1. However, maximum (or minimum) values of some simple correlations cannot reach unity (i.e., 1 or -1)
Correlation analysis is not always dealing with one-to-one correlation, i.e., the correlation between two variables. It can be partial correlation (adjusted one-to-one correlation). It can also be one-to-many, or multiple correlation.2 In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is the correlation between the variable’s values and the best predictions that can be computed linearly from the predictive variables.
r=∑(x−ˉx)(y−ˉy)√∑(x−ˉx)2√∑(y−ˉy)2=n∑xy−∑x∑y√n∑x2−(∑x)2√n∑y2−(∑y)2 (1)
The t-statistic r√1−r2√n−2 (df=n-2) is used to conduct hypothesis test H0:ρ=0 vs Ha:ρ≠0. The formula to compute the multiple correlation coefficient between y and x1,x2....,xk is
r=√R2=√1−∑(y−ˆy)2∑(y−ˉy)2=√1−SSESST=√SSRSST (2)
The F-statistic SSR/kSSE/(n−k−1)=MSRMSE=(n−k−1)R2k(1−R2)∼F(k,n−k−1) is used to conduct hypothesis test H0:ρ2=0 vsHa:ρ2≠0.
Multiple correlation coefficient between y and x1,x2 can be calculated using simple correlation coefficients
r=√r2yx1+r2yx2−2ryx1ryx2rx1x21−r2x1x2. (3)
Generally, the multiple correlation coefficient between y andx1,x2,⋯,xk can be calculated using simple correlation coefficients.3,4,5
,r=√1−det(R)R11 (4)
where R=[1r01r02…r0kr011r12…r1kr02r121…r2k⋮⋮⋮⋱⋮r0kr1kr2k…1]
and R11 is the cofactor of the th element of matrix R,det(R) is the determinant of matrix R , r0j is the correlation coefficient between y and xj ,j = 1, 2, . . . , k ,rij is the correlation coefficient between xi and xj , i, j =1, 2, . . . , k . Let r−1 ij =, (rij) then we have
r=√1−1r00 (5)
Let (qij) 0⩽i,j⩽k be the dispersion matrix of y,x1,x2,…,xk and ,(qij)−1=(qij) then we have
r=√1−1q00q00 (6)
Multiple correlation coefficient can be also calculated using simple and partial correlation coefficients Kendall. 3
.1−r2y⋅x1x2…xk=(1−r2yx1)(1−r2yx2⋅x1)(1−r2yx3⋅x1x2)…(1−r2yxk⋅x1x2…xk−1) (7)
Formally, the partial correlation betweenx andy given z1,z2.....,zn is written as rxy.z , where z is an n-dimensional vector,z={z1,z2.....,zn} . Let denote the number of observations, then
rxy⋅z_=NN∑i=1ex,iey,i−N∑i=1ex,iN∑i=1ey,i√NN∑i=1e2x,i−(N∑i=1ex,i)2√NN∑i=1e2y,i−(N∑i=1ey,i)2=NN∑i=1ex,iey,i√NN∑i=1e2x,i√NN∑i=1e2y,i (8)
where ex
and ey
are residuals resulting from the linear regression x
of with z_
and of y
with z_
respectively.
Especially, if we have z1
only, the partial correlation between x
and y
given z1
is
rxy⋅z1=rxy−rxz1ryz1√(1−r2xz1)(1−r2yz1) (9)
The partial correlation between x and y given z1 and z2 is
rxy⋅z1z2=rxy⋅z1−rxz2⋅z1ryz2⋅z1√(1−r2xz2⋅z1)(1−r2yz2⋅z1) (10)
The formula (10) can be extended to more general case: the partial correlation between x and y given z1,z2,…,zk Kendall.3 is
rxy⋅z1z2…zk=rxy⋅z2z3…zk−rxz1⋅z2z3…zkryz1⋅z2z3…zk√(1−r2xz1⋅z2z3…zk)√(1−r2yz1⋅z2z3…zk) (11)
The partial correlation can also be calculated using multiple correlation. For example, the partial correlation between x and y given z1,z2 is
rxy⋅z1z2=√r2x.yz1z2−r2x.z1z21−r2x.z1z2 (12)
The partial correlation between x and y given z1 , z2 and z3 is
rxy⋅z1z2z3=√r2x⋅yz1z2z3−r2x⋅z1z2z31−r2x⋅z1z2z3 (13)
Generally, the partial correlation between x and y given z1,z2.....,zk is
rxy⋅z1z2…zk=√r2x⋅yz1z2…zk−r2x⋅z1z2…zk1−r2x⋅z1z2…zk (14)
Suppose we have z only, the t-statistic rxy⋅z1√n−υ√1−r2xy⋅z1∼t(n−υ) is used to conduct hypothesis test H0:ρxy.z1=0 vs, Ha:ρxy.z1≠0 where n is sample size υ , is total number of variables employed in the analysis, here υ=3 since we have three variables x,y and z1 .
In addition, correlation analysis can be used to determine association between many variables and many variables (many-to-many), the canonical correlation analysis (CCA),6 which includes deep CCA, sparse CCA, kernel CCA, generalized CCA, regularized CCA, nonlinear CCA. The canonical correlation analysis (CCA) is a standard tool of multivariate statistical analysis for discovery and quantification of associations between two sets of variables.
Correlation analysis is not always used to determine association between continuous or ordinary variables. It can also be used to determine the association between two categorical variables, or between one continuous variable and another categorical variable. The polychoric correlation is used to measure the association between ordered-category variables with an assumption of an underlying joint continuous distribution.7,8 A categorical variable is often a rough measurement of an underlying continuous variable. For instance, a dichotomous variable (adult or not) is observed as ‘Yes’ when age is 18 years or above, and as ‘No’ if age 18 years. The underlying variable is age, which is continuous. Hence, it is reasonable to assume that a continuous variable underlies a categorical (dichotomous or polychotomous) observed variable. Therefore, we can conduct the estimation of the polychoric correlation coefficient via Markov chain Monte Carlo methods assuming the underlying distribution is multivariate normal. Especially, the polychoric correlation between two observed binary variables is also known as tetrachoric correlation.9 Suppose we have a 2×2 table with two binary variables, and , then
Tetrachoric correlation = cos(π/(1+√(n11×n22)/n12/n21)) .
On the other hand, the point biserial correlation is used to determine an association between one continuous variable and another naturally binary variable.10 For example, the correlation between gender and salary is called point biserial correlation. The formula for the point biserial correlation coefficient is
tpb=Q1−Q0sn√pq (15)
where Q1 is the mean of the positive or ‘Yes’ group, defined by the dichotomous variable, Q0 is the mean of the negative or ‘No’ group, defined by the
same dichotomous variable, sn is the standard deviation for all, p is the ‘Yes’ proportion and q is the ‘No’ proportion.
Biserial correlation is very close to point biserial correlation, but one of associated variables is dichotomous ordinal and has an underlying continuity.11 For example, depression level can be measured on a continuous scale, such as PHQ-9, the nine-item depression scale of the patient health questionnaire, or the Hamilton rating scale for depression, but can be classified dichotomously as high/low. The formula for biserial correlation coefficient between a dichotomous ordinal variable (W) and one continuous variable (M) is
rb=[(Μ1−Μ0)×(pq/Μ)]/σm (16)
where M0 is mean score of M when , W=1 is the mean score of M when , W=1 q is proportion for , W=0 p is proportion for W=1 , σm is population standard deviation, M is the height of the standard normal distribution at z , where P(z' .
If point-biserial correlation is known, you can also find biserial correlation with the following formula Sheskin D12
(17)
where
(18)
(19)
We can have a natural extension of the model above if we have more than two ordered rating levels.
We can assume that the joint distribution of the quantitative variable and a latent continuous variable underlying the ordinal variable is bivariate normal when we compute a polyserial correlation coefficient (standard error) between a quantitative variable and an ordinal variable. Either the maximum-likelihood (ML) estimator or a quicker ‘two-step’ approximation can be used. For the ML estimator the estimates of the thresholds and the covariance matrix of the estimates are also available.
In this article we have discussed about Pearson product-moment correlation coefficient, simple, multiple, partial correlation, the relationship among them, the concepts and the formulas to compute each specific coefficient. Also, we have discussed the multivariate canonical correlation between many and many variables. In addition, we have discussed about tetrachoric or polychoric correlation between two observed binary variables or between two ordered-multiple-category variables, as well as the polyserial correlation between a quantitative variable and an ordinal variable, point biserial correlation between one continuous variable and one naturally binary variable, and biserial correlation which is very close to point biserial correlation, but one of associated variables is dichotomous ordinal and has an underlying continuity. To extend the relationship between Pearson product-moment correlation coefficient, simple, multiple, partial correlation to the relationship for other kinds of correlation, such as polychoric, polyserial correlation, can be further study.
None.
The authors declare no conflicts of interest.
©2022 Zheng, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.
2 7