Brief Report Volume 11 Issue 4
Correlation analysis for different types of variables and relationship between different correlation coefficients
Shimin Zheng,1
Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.
Yan Cao2
1Department of Biostatistics, East Tennessee State University, USA
2Center for Nursing Research, East Tennessee State University, USA
Correspondence: Shimin Zheng, Department of Biostatistics and Epidemiology, East Tennessee State University, USA
Received: September 04, 2022 | Published: September 20, 2022
Citation: Zheng S, Cao Y. Correlation analysis for different types of variables and relationship between different correlation coefficients. Biom Biostat Int J. 2022;11(4):127-129. DOI: 10.15406/bbij.2022.11.00365
Download PDF
Introduction
The purpose of this article is to provide a summary about statistical correlation analysis and relationship between simple, multiple and partial correlation coefficients.
Statistical correlation analysis and regression analysis are related, but different. Correlation analysis quantifies the strength of the linear relationship between two variables or between two sets of variables, most often two continuous variables, or between two sets of continuous variables, whereas regression analysis is used to determine the relationship in the form of an equation between two variables or two sets of variables. Unlike regression analysis, to do correlation analysis, we don’t have to distinguish cause and effect, or dependent and independent variables.Most often, the simple correlation coefficient is used. It is also called Pearson product-moment correlation coefficient.1 It is a measure of the strength and direction of association between two variables measured on at least an interval scale. It can range from -1 to 1. However, maximum (or minimum) values of some simple correlations cannot reach unity (i.e., 1 or -1)
Correlation analysis is not always dealing with one-to-one correlation, i.e., the correlation between two variables. It can be partial correlation (adjusted one-to-one correlation). It can also be one-to-many, or multiple correlation.2 In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is the correlation between the variable’s values and the best predictions that can be computed linearly from the predictive variables.
Relationship between simple and multiple correlation coefficients
- The formula to compute the simple correlation coefficient between variables
and
is
(1)
The t-statistic
is used to conduct hypothesis test
vs
The formula to compute the multiple correlation coefficient between
and
is
(2)
The F-statistic
is used to conduct hypothesis test
vs
Multiple correlation coefficient between
and
can be calculated using simple correlation coefficients
(3)
Generally, the multiple correlation coefficient between
and
can be calculated using simple correlation coefficients.3,4,5
,
(4)
where
and
is the cofactor of the th element of matrix
is the determinant of matrix
,
is the correlation coefficient between
and
,
,
is the correlation coefficient between
and
,
. Let
=,
then we have
(5)
Let
be the dispersion matrix of
and ,
then we have
(6)
Relationship between simple, multiple and partial correlation coefficients
Multiple correlation coefficient can be also calculated using simple and partial correlation coefficients Kendall. 3
.
(7)
Formally, the partial correlation between
and
given
is written as
, where
is an n-dimensional vector,
. Let denote the number of observations, then
(8)
where
and
are residuals resulting from the linear regression
of with
and of
with
respectively.
Especially, if we have
only, the partial correlation between
and
given
is
(9)
The partial correlation between
and
given
and
is
(10)
The formula (10) can be extended to more general case: the partial correlation between
and
given
Kendall.3 is
(11)
The partial correlation can also be calculated using multiple correlation. For example, the partial correlation between
and
given
is
(12)
The partial correlation between
and
given
,
and
is
(13)
Generally, the partial correlation between
and
given
is
(14)
Suppose we have
only, the t-statistic
is used to conduct hypothesis test
vs,
where
is sample size
, is total number of variables employed in the analysis, here
since we have three variables
and
.
Canonical correlation analysis
In addition, correlation analysis can be used to determine association between many variables and many variables (many-to-many), the canonical correlation analysis (CCA),6 which includes deep CCA, sparse CCA, kernel CCA, generalized CCA, regularized CCA, nonlinear CCA. The canonical correlation analysis (CCA) is a standard tool of multivariate statistical analysis for discovery and quantification of associations between two sets of variables.
Polychoric and tetrachoric correlation
Correlation analysis is not always used to determine association between continuous or ordinary variables. It can also be used to determine the association between two categorical variables, or between one continuous variable and another categorical variable. The polychoric correlation is used to measure the association between ordered-category variables with an assumption of an underlying joint continuous distribution.7,8 A categorical variable is often a rough measurement of an underlying continuous variable. For instance, a dichotomous variable (adult or not) is observed as ‘Yes’ when age is 18 years or above, and as ‘No’ if age 18 years. The underlying variable is age, which is continuous. Hence, it is reasonable to assume that a continuous variable underlies a categorical (dichotomous or polychotomous) observed variable. Therefore, we can conduct the estimation of the polychoric correlation coefficient via Markov chain Monte Carlo methods assuming the underlying distribution is multivariate normal. Especially, the polychoric correlation between two observed binary variables is also known as tetrachoric correlation.9 Suppose we have a
table with two binary variables, and , then
Tetrachoric correlation =
.
Point biserial correlation and biserial correlation
On the other hand, the point biserial correlation is used to determine an association between one continuous variable and another naturally binary variable.10 For example, the correlation between gender and salary is called point biserial correlation. The formula for the point biserial correlation coefficient is
(15)
where
is the mean of the positive or ‘Yes’ group, defined by the dichotomous variable,
is the mean of the negative or ‘No’ group, defined by the
same dichotomous variable,
is the standard deviation for all,
is the ‘Yes’ proportion and
is the ‘No’ proportion.
Biserial correlation is very close to point biserial correlation, but one of associated variables is dichotomous ordinal and has an underlying continuity.11 For example, depression level can be measured on a continuous scale, such as PHQ-9, the nine-item depression scale of the patient health questionnaire, or the Hamilton rating scale for depression, but can be classified dichotomously as high/low. The formula for biserial correlation coefficient between a dichotomous ordinal variable (W) and one continuous variable (M) is
(16)
where
is mean score of
when ,
is the mean score of
when ,
is proportion for ,
is proportion for
,
is population standard deviation,
is the height of the standard normal distribution at
, where
.
If point-biserial correlation is known, you can also find biserial correlation with the following formula Sheskin D12
(17)
where
(18)
(19)
We can have a natural extension of the model above if we have more than two ordered rating levels.
We can assume that the joint distribution of the quantitative variable and a latent continuous variable underlying the ordinal variable is bivariate normal when we compute a polyserial correlation coefficient (standard error) between a quantitative variable and an ordinal variable. Either the maximum-likelihood (ML) estimator or a quicker ‘two-step’ approximation can be used. For the ML estimator the estimates of the thresholds and the covariance matrix of the estimates are also available.
Conclusion
In this article we have discussed about Pearson product-moment correlation coefficient, simple, multiple, partial correlation, the relationship among them, the concepts and the formulas to compute each specific coefficient. Also, we have discussed the multivariate canonical correlation between many and many variables. In addition, we have discussed about tetrachoric or polychoric correlation between two observed binary variables or between two ordered-multiple-category variables, as well as the polyserial correlation between a quantitative variable and an ordinal variable, point biserial correlation between one continuous variable and one naturally binary variable, and biserial correlation which is very close to point biserial correlation, but one of associated variables is dichotomous ordinal and has an underlying continuity. To extend the relationship between Pearson product-moment correlation coefficient, simple, multiple, partial correlation to the relationship for other kinds of correlation, such as polychoric, polyserial correlation, can be further study.
Acknowledgments
Conflicts of interest
The authors declare no conflicts of interest.
References
©2022 Zheng, et al. This is an open access article distributed under the terms of the,
which
permits unrestricted use, distribution, and build upon your work non-commercially.