Correlation analysis for different types of variables and relationship between different correlation coefficients

doi:10.15406/bbij.2022.11.00365

eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Brief Report Volume 11 Issue 4

Correlation analysis for different types of variables and relationship between different correlation coefficients

Shimin Zheng,¹

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Yan Cao²

¹Department of Biostatistics, East Tennessee State University, USA
²Center for Nursing Research, East Tennessee State University, USA

Correspondence: Shimin Zheng, Department of Biostatistics and Epidemiology, East Tennessee State University, USA

Received: September 04, 2022 | Published: September 20, 2022

Citation: Zheng S, Cao Y. Correlation analysis for different types of variables and relationship between different correlation coefficients. Biom Biostat Int J. 2022;11(4):127-129. DOI: 10.15406/bbij.2022.11.00365

Download PDF

Introduction

The purpose of this article is to provide a summary about statistical correlation analysis and relationship between simple, multiple and partial correlation coefficients.

Statistical correlation analysis and regression analysis are related, but different. Correlation analysis quantifies the strength of the linear relationship between two variables or between two sets of variables, most often two continuous variables, or between two sets of continuous variables, whereas regression analysis is used to determine the relationship in the form of an equation between two variables or two sets of variables. Unlike regression analysis, to do correlation analysis, we don’t have to distinguish cause and effect, or dependent and independent variables.Most often, the simple correlation coefficient is used. It is also called Pearson product-moment correlation coefficient.¹ It is a measure of the strength and direction of association between two variables measured on at least an interval scale. It can range from -1 to 1. However, maximum (or minimum) values of some simple correlations cannot reach unity (i.e., 1 or -1)

Correlation analysis is not always dealing with one-to-one correlation, i.e., the correlation between two variables. It can be partial correlation (adjusted one-to-one correlation). It can also be one-to-many, or multiple correlation.² In statistics, the coefficient of multiple correlation is a measure of how well a given variable can be predicted using a linear function of a set of other variables. It is the correlation between the variable’s values and the best predictions that can be computed linearly from the predictive variables.

Relationship between simple and multiple correlation coefficients

The formula to compute the simple correlation coefficient between variables $x$ and $y$ is

$\begin{array}{l} r = \frac{\sum (x - \bar{x}) (y - \bar{y})}{\sqrt{\sum {(x - \bar{x})}^{2}} \sqrt{\sum {(y - \bar{y})}^{2}}} \\ = \frac{n \sum x y - \sum x \sum y}{{\sqrt{n \sum x^{2} - (\sum x)}}^{2} \sqrt{n \sum y^{2} - {(\sum y)}^{2}}} \end{array}$ (1)

The t-statistic $\frac{r}{\sqrt{1 - r^{2}}} \sqrt{n - 2}$ $(df=n-2)$ is used to conduct hypothesis test $H_{0} : ρ = 0$ vs $H_{a} : ρ \neq 0.$ The formula to compute the multiple correlation coefficient between $y$ and $x_{1}, x_{2} ...., x_{k}$ is

$r = \sqrt{R^{2}} = \sqrt{1 - \frac{\sum {(y - \hat{y})}^{2}}{\sum {(y - \bar{y})}^{2}}} = \sqrt{1 - \frac{S S E}{S S T}} = \sqrt{\frac{S S R}{S S T}}$ (2)

The F-statistic $\frac{SSR / k}{SSE / (n - k - 1)} = \frac{MSR}{MSE} = \frac{(n - k - 1) R^{2}}{k (1 - R^{2})} \sim F (k, n - k - 1)$ is used to conduct hypothesis test $H_{0} : ρ^{2} = 0$ vs $H_{a} : ρ^{2} \neq 0.$

Multiple correlation coefficient between $y$ and $x_{1}, x_{2}$ can be calculated using simple correlation coefficients

$r = \sqrt{\frac{r_{y x_{1}}^{2} + r_{y x_{2}}^{2} - 2 r_{y x_{1}} r_{y x_{2}} r_{x_{1} x_{2}}}{1 - r_{x_{1} x_{2}}^{2}}} .$ (3)

Generally, the multiple correlation coefficient between $y$ and $x_{1}, x_{2}, \dots, x_{k}$ can be calculated using simple correlation coefficients.^3,4,5

, $r = \sqrt{1 - \frac{d e t (R)}{R_{11}}}$ (4)

where $R = [\begin{matrix} 1 & r_{01} & r_{02} & \dots & r_{0 k} \\ r_{01} & 1 & r_{12} & \dots & r_{1 k} \\ r_{02} & r_{12} & 1 & \dots & r_{2 k} \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ r_{0 k} & r_{1 k} & r_{2 k} & \dots & 1 \end{matrix}]$

and $R_{11}$ is the cofactor of the th element of matrix $R, d e t (R)$ is the determinant of matrix $R$ , $r_{0 j}$ is the correlation coefficient between $y$ and $x_{j}$ , $j = 1, 2, . . ., k$ , $r_{i j}$ is the correlation coefficient between $x_{i}$ and $x_{j}$ , $i, j = 1, 2, . . ., k$ . Let $r^{- 1}_{i j}$ =, $(r^{i j})$ then we have

$r = \sqrt{1 - \frac{1}{r^{00}}}$ (5)

Let $(q_{i j}) 0 ⩽ i, j ⩽ k$ be the dispersion matrix of $y, x_{1}, x_{2}, \dots, x_{k}$ and , ${(q_{i j})}^{- 1} = (q^{i j})$ then we have

$r = \sqrt{1 - \frac{1}{q_{00} q^{00}}}$ (6)

Relationship between simple, multiple and partial correlation coefficients

Multiple correlation coefficient can be also calculated using simple and partial correlation coefficients Kendall. ³

. $1 - r_{y \cdot x_{1} x_{2} \dots x_{k}}^{2} = (1 - r_{y x_{1}}^{2}) (1 - r_{y x_{2} \cdot x_{1}}^{2}) (1 - r_{y x_{3} \cdot x_{1} x_{2}}^{2}) \dots (1 - r_{y x_{k} \cdot x_{1} x_{2} \dots x_{k - 1}}^{2})$ (7)

Formally, the partial correlation between $x$ and $y$ given $z_{1}, z_{2} ....., z_{n}$ is written as $r_{x y . z}$ , where $z$ is an n-dimensional vector, $z = {z_{1}, z_{2} ....., z_{n}}$ . Let denote the number of observations, then

$r_{x y \cdot \underline{z}} = \frac{N \sum_{i = 1}^{N} e_{x, i} e_{y, i} - \sum_{i = 1}^{N} e_{x, i} \sum_{i = 1}^{N} e_{y, i}}{\sqrt{N \sum_{i = 1}^{N} e_{x, i}^{2} - {(\sum_{i = 1}^{N} e_{x, i})}^{2}} \sqrt{N \sum_{i = 1}^{N} e_{y, i}^{2} - {(\sum_{i = 1}^{N} e_{y, i})}^{2}}} = \frac{N \sum_{i = 1}^{N} e_{x, i} e_{y, i}}{\sqrt{N \sum_{i = 1}^{N} e_{x, i}^{2}} \sqrt{N \sum_{i = 1}^{N} e_{y, i}^{2}}}$ (8)

where $e_{x}$ and $e_{y}$ are residuals resulting from the linear regression $x$ of with $\underline{z}$ and of $y$ with $\underline{z}$ respectively.
Especially, if we have $z_{1}$ only, the partial correlation between $x$ and $y$ given $z_{1}$ is

$r_{x y \cdot z_{1}} = \frac{r_{x y} - r_{x z_{1}} r_{y z_{1}}}{\sqrt{(1 - r_{x z_{1}}^{2}) (1 - r_{y z_{1}}^{2})}}$ (9)

The partial correlation between $x$ and $y$ given $z_{1}$ and $z_{2}$ is

$r_{x y \cdot z_{1} z_{2}} = \frac{r_{x y \cdot z_{1}} - r_{x z_{2} \cdot z_{1}} r_{y z_{2} \cdot z_{1}}}{\sqrt{(1 - r_{x z_{2} \cdot z_{1}}^{2}) (1 - r_{y z_{2} \cdot z_{1}}^{2})}}$ (10)

The formula (10) can be extended to more general case: the partial correlation between $x$ and $y$ given $z_{1}, z_{2}, \dots, z_{k}$ Kendall.³ is

$r_{x y \cdot z_{1} z_{2} \dots z_{k}} = \frac{r_{x y \cdot z_{2} z_{3} \dots z_{k}} - r_{x z_{1} \cdot z_{2} z_{3} \dots z_{k}} r_{y z_{1} \cdot z_{2} z_{3} \dots z_{k}}}{\sqrt{(1 - r_{x z_{1} \cdot z_{2} z_{3} \dots z_{k}}^{2})} \sqrt{(1 - r_{y z_{1} \cdot z_{2} z_{3} \dots z_{k}}^{2})}}$ (11)

The partial correlation can also be calculated using multiple correlation. For example, the partial correlation between $x$ and $y$ given $z_{1}, z_{2}$ is

$r_{x y \cdot z_{1} z_{2}} = \sqrt{\frac{r^{2}_{x . y z_{1} z_{2}} - r^{2}_{x . z_{1} z_{2}}}{1 - r^{2}_{x . z_{1} z_{2}}}}$ (12)

The partial correlation between $x$ and $y$ given $z_{1}$ , $z_{2}$ and $z_{3}$ is

$r_{x y \cdot z_{1} z_{2} z_{3}} = \sqrt{\frac{r_{x \cdot y z_{1} z_{2} z_{3}}^{2} - r_{x \cdot z_{1} z_{2} z_{3}}^{2}}{1 - r_{x \cdot z_{1} z_{2} z_{3}}^{2}}}$ (13)

Generally, the partial correlation between $x$ and $y$ given $z_{1}, z_{2} ....., z_{k}$ is

$r_{x y \cdot z_{1} z_{2} \dots z_{k}} = \sqrt{\frac{r_{x \cdot y z_{1} z_{2} \dots z_{k}}^{2} - r_{x \cdot z_{1} z_{2} \dots z_{_{k}}}^{2}}{1 - r_{x \cdot z_{1} z_{2} \dots z_{_{k}}}^{2}}}$ (14)

Suppose we have $z$ only, the t-statistic $\frac{r_{x y \cdot z_{1}} \sqrt{n - υ}}{\sqrt{1 - r_{x y \cdot z_{1}}^{2}}} \sim t (n - υ)$ is used to conduct hypothesis test $H_{0} : ρ_{x y . z_{1}} = 0$ vs, $H_{a} : ρ_{_{x y . z_{1}}} \neq 0$ where $n$ is sample size $υ$ , is total number of variables employed in the analysis, here $υ = 3$ since we have three variables $x, y$ and $z_{1}$ .

Canonical correlation analysis

In addition, correlation analysis can be used to determine association between many variables and many variables (many-to-many), the canonical correlation analysis (CCA),⁶ which includes deep CCA, sparse CCA, kernel CCA, generalized CCA, regularized CCA, nonlinear CCA. The canonical correlation analysis (CCA) is a standard tool of multivariate statistical analysis for discovery and quantification of associations between two sets of variables.

Polychoric and tetrachoric correlation

Correlation analysis is not always used to determine association between continuous or ordinary variables. It can also be used to determine the association between two categorical variables, or between one continuous variable and another categorical variable. The polychoric correlation is used to measure the association between ordered-category variables with an assumption of an underlying joint continuous distribution.^7,8A categorical variable is often a rough measurement of an underlying continuous variable. For instance, a dichotomous variable (adult or not) is observed as ‘Yes’ when age is 18 years or above, and as ‘No’ if age 18 years. The underlying variable is age, which is continuous. Hence, it is reasonable to assume that a continuous variable underlies a categorical (dichotomous or polychotomous) observed variable. Therefore, we can conduct the estimation of the polychoric correlation coefficient via Markov chain Monte Carlo methods assuming the underlying distribution is multivariate normal. Especially, the polychoric correlation between two observed binary variables is also known as tetrachoric correlation.⁹ Suppose we have a $2 \times 2$ table with two binary variables, and , then

Tetrachoric correlation = $\cos (π / (1 + \sqrt{(n_{11} \times n_{22}) / n_{12} / n_{21}}))$ .

Point biserial correlation and biserial correlation

On the other hand, the point biserial correlation is used to determine an association between one continuous variable and another naturally binary variable.¹⁰ For example, the correlation between gender and salary is called point biserial correlation. The formula for the point biserial correlation coefficient is

$t_{p b} = \frac{Q_{1} - Q_{0}}{s_{n}} \sqrt{p q}$ (15)

where $Q_{1}$ is the mean of the positive or ‘Yes’ group, defined by the dichotomous variable, $Q_{0}$ is the mean of the negative or ‘No’ group, defined by the

same dichotomous variable, $s_{n}$ is the standard deviation for all, $p$ is the ‘Yes’ proportion and $q$ is the ‘No’ proportion.

Biserial correlation is very close to point biserial correlation, but one of associated variables is dichotomous ordinal and has an underlying continuity.¹¹ For example, depression level can be measured on a continuous scale, such as PHQ-9, the nine-item depression scale of the patient health questionnaire, or the Hamilton rating scale for depression, but can be classified dichotomously as high/low. The formula for biserial correlation coefficient between a dichotomous ordinal variable (W) and one continuous variable (M) is

$r_{b} = [(Μ_{1} - Μ_{0}) \times (p q / Μ)] / σ_{m}$ (16)

where $M_{0}$ is mean score of $M$ when , $W = 1$ is the mean score of $M$ when , $W = 1$ $q$ is proportion for , $W = 0$ $p$ is proportion for $W = 1$ , $σ_{m}$ is population standard deviation, $M$ is the height of the standard normal distribution at $z$ , where $P (z' < z) = q & P (z' > z) = p$ .

If point-biserial correlation is known, you can also find biserial correlation with the following formula Sheskin D¹²

$r_{b} = (\frac{r_{p b}}{h}) \sqrt{p_{0} (1 - p_{0})}$ (17)

where

$h = \frac{e^{- u^{2} / 2}}{\sqrt{2 π}}$ (18)

$P r [Z \geq u ∣ Z ~ N (0, 1)] = p_{1}$ (19)

We can have a natural extension of the model above if we have more than two ordered rating levels.
We can assume that the joint distribution of the quantitative variable and a latent continuous variable underlying the ordinal variable is bivariate normal when we compute a polyserial correlation coefficient (standard error) between a quantitative variable and an ordinal variable. Either the maximum-likelihood (ML) estimator or a quicker ‘two-step’ approximation can be used. For the ML estimator the estimates of the thresholds and the covariance matrix of the estimates are also available.

Conclusion

In this article we have discussed about Pearson product-moment correlation coefficient, simple, multiple, partial correlation, the relationship among them, the concepts and the formulas to compute each specific coefficient. Also, we have discussed the multivariate canonical correlation between many and many variables. In addition, we have discussed about tetrachoric or polychoric correlation between two observed binary variables or between two ordered-multiple-category variables, as well as the polyserial correlation between a quantitative variable and an ordinal variable, point biserial correlation between one continuous variable and one naturally binary variable, and biserial correlation which is very close to point biserial correlation, but one of associated variables is dichotomous ordinal and has an underlying continuity. To extend the relationship between Pearson product-moment correlation coefficient, simple, multiple, partial correlation to the relationship for other kinds of correlation, such as polychoric, polyserial correlation, can be further study.