Correlation for binary variates

While for continuous variates there exist numerous distinct correlation metrics, such as Pearson correlation $\rho$, Spearman's rho $\rho_S$, and Kendall's tau $\tau$, all of these become equivalent when considering binary variates instead: $\rho(X, Y)=\rho_S(X, Y)=\tau(X, Y)$. The Yule phi coefficient [1] (also known as the mean square contingency coefficient or the Matthews correlation coefficient in the ML literature) is a measure of the association of two binary variables, which is also equivalent to Pearson's correlation coefficient in the case of dichotomous variables. When considering two binary variates $X,Y\in \{0,1\}\times\{0,1\}$, the correlation coefficient $\rho$ between the two cannot span the full range $[-1,1]$. Instead, denoting by $p_j=\mathbb{P}(j)$ and $q_j = 1- p_j$, correlations are bounded as \begin{equation}\label{eq:achievable_corrs} \rho_{\text{min}} \le \rho \le \rho_{\text{max}}, \end{equation} where \begin{equation} \begin{split} \rho_{\text{min}} &= \text{max}\left(-\sqrt{\frac{p_X p_Y}{ q_X q_Y}}, -\sqrt{\frac{q_X q_Y}{p_Xp_Y}}\right)\\ \rho_{\text{max}} &= \text{min}\left(\sqrt{\frac{p_X q_Y}{ p_Y q_X}}, \sqrt{\frac{p_Y q_X}{p_X q_Y}}\right). \end{split} \label{eq:correlation_bounds} \end{equation} To see that, let us start by recalling that a Bernoulli random vector $(X, Y)$ takes values in the Cartesian product space $\{0,1\}\times \{0,1\}$, with probability mass function given by: \begin{equation}\label{eq:bernoulli_bivariate} f(x, y) = p_{11}^{xy}p_{10}^{x(1-y)}p_{01}^{(1-x)y}p_{00}^{(1-x)(1-y)} \end{equation} where $p_{ij}=\mathbb{P}(X=i, Y=j)$, and $p_{00}+p_{01}+p_{10}+p_{11}=1$. The marginal probabilities of $X$ and $Y$ are then clearly given by \begin{equation} \begin{split} p_X &= p_{10} + p_{11},\\ p_Y &= p_{01} + p_{11}.\\ \end{split} \end{equation} Obviously, $\mathbb{E}(X) = p_X$ and $\mathbb{E}(Y) = p_Y$. Therefore, recalling that \begin{equation} \begin{split} \rho &= \frac{\text{cov}\left(X, Y\right)}{\sigma_X\sigma_Y}\\ &= \frac{\mathbb{E}\left[XY\right]-\mathbb{E}\left[X\right]\mathbb{E}\left[Y\right]}{\sqrt{p_Xq_Xp_Yq_Y}}\\ &= \frac{\mathbb{E}\left[XY\right]-p_Xp_Y}{\sqrt{p_Xq_Xp_Yq_Y}}. \end{split} \end{equation} and noticing that $$ \mathbb{E}[XY] = \sum_{x,y}xyp_{xy}=p_{11}, $$ (since all terms where either $x$ or $y$ are zero, cancel out) one obtains \begin{equation} \rho = \frac{p_{11}-p_Xp_Y}{\sqrt{p_Xq_Xp_Yq_Y}}. \end{equation} At this point, notice that one must always have $p_{11}\le \min\left(p_X, p_Y\right)$. Hence: \begin{equation} \begin{split} \rho\sqrt{p_Xq_Xp_Yq_Y}-p_Xp_Y &\le \min\left(p_X, p_Y\right)\\ \rho &\le \min\left(\frac{p_X+p_Xp_Y}{\sqrt{p_Xq_Xp_Yq_Y}}, \frac{p_Y+p_Xp_Y}{\sqrt{p_Xq_Xp_Yq_Y}}\right)\\ &= \min\left(\frac{p_Xq_Y}{\sqrt{p_Xq_Xp_Yq_Y}}, \frac{p_Yq_X}{\sqrt{p_Xq_Xp_Yq_Y}}\right)\\ &= \min\left(\sqrt{\frac{p_Xq_Y}{q_Xp_Y}}, \sqrt{\frac{p_Yq_X}{p_Xq_Y}}\right)\\ &=\rho_{\text{max}}. \end{split} \end{equation} This proves the second bound in \eqref{eq:correlation_bounds}. To prove the lower bound instead, consider that every joint probability must be non-negative: $p_{ij}\ge0$ for all $i$ and $j$. This means that \begin{equation} \begin{split} p_{00}&=1-p_{01}-p_{10}-p_{11}\\ &=1-p_X-p_Y+p_{11}\ge0\\ p_{11}&\ge p_X+p_Y-1, \end{split} \end{equation} which implies \begin{equation} p_{11} \ge \max\left(0, p_X+p_Y-1\right) . \end{equation} As before, this results in: \begin{equation} \begin{split} \rho\sqrt{p_Xq_Xp_Yq_Y}-p_Xp_Y &\ge \max\left(0, p_X+p_Y-1\right)\\ \rho &\ge \max\left(-\frac{p_Xp_Y}{\sqrt{p_Xq_Xp_Yq_Y}}, \frac{-p_Xp_Y+p_X+p_Y-1}{\sqrt{p_Xq_Xp_Yq_Y}}\right)\\ &= \max\left(-\sqrt{\frac{p_Xp_Y}{q_Xq_Y}}, \frac{p_Xq_Y-q_Y}{\sqrt{p_Xq_Xp_Yq_Y}}\right)\\ &= \max\left(-\sqrt{\frac{p_Xp_Y}{q_Xq_Y}}, \sqrt{-\frac{q_Xq_Y}{p_Xp_Y}}\right)\\ &=\rho_{\text{min}}. \end{split} \end{equation} The bounds on the correlation $\rho$ from \eqref{eq:correlation_bounds} are plotted in Figure 1. Notice in particular, that $|\rho_{\text{min}}|$ is maximal when $p_X=q_Y$, while $|\rho_{\text{max}}|$ is maximal when $p_X=p_Y$. Conversely, the constraint on how negative correlations can get ($\rho_{\text{min}}$) is more binding when either both marginals are small $p_X\approx p_Y\approx0$, or when both marginals are large $p_X\approx p_Y\approx1$. Likewise, the constraint on how positive correlations can get ($\rho_{\text{max}}$) is more binding when $|p_X-p_Y|\approx1$, that is when one is large and the other small. Finally, the full range of possible correlations $[-1,1]$ is achievable only for $p_X=p_Y=\frac{1}{2}$.

Select $p_Y$: 0.5

Fig.2: Correlation bounds.

References

[1] "On the Methods of Measuring Association Between Two Attributes", G. Udny Yule, 1912

Correlation for binary variates

References

Back to Teaching