Simpson's Paradox is a Special Case of Omitted Variable Bias
The goal of this post is to illustrate a point made in a recent tweet by Amit Ghandi that Simpson's Paradox is a special case of omitted variable bias (OVB).
I will use an example from How SIMPSON'S PARADOX explains weird COVID19 statistics. (This example is for illustrative purposes only. This post is about statistics, not COVID-19).
The video compares those diagnosed with COVID-19 in China and Italy between March and May 2020.
Outcome by country: people infected with COVID-19 were more likely to survive in China than Italy.
Outcome by country-age group: at each age bracket, people infected with COVID-19 were more likely to survive in Italy than China.
Let's illustrate this with a simulation in the Julia Language.
The variables are defined in the table below: \begin{array}{|l|l|l|} \hline \text{Outcome: } Y_{i} & \text{Treatment: } X_{i} & \text{Confounder: } Z_{i}\\ \hline Y_{i} \equiv 0 \text{ if person i survives} & X_{i} \equiv 0 \text{ if person i is in China} & Z_{i} \equiv 0 \text{ if person i's age } \leq 59 \\ Y_{i} \equiv 1 \text{ if person i dies} & X_{i} \equiv 1 \text{ if person i is in Italy} & Z_{i} \equiv 1 \text{ if person i's age } > 59 \\ \hline \end{array} Let's assume the true data generating process (DGP) is: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \varepsilon_{i}$
Assume $\beta_{0}=10, \beta_{xy} = -5, \beta_{zy} = 10$ (coefficients are in %).
Let's generate some artificial data.
Suppose we have N=200 observations (half China, half Italy).
Suppose 80% of China's population is young $Z_{i} =0$ and 20% is old $Z_{i} = 1$.
Suppose 20% of Italy's population is young $Z_{i} =0$ and 80% is old $Z_{i} = 1$.
using DataFrames, Plots
N = 200; #200 observations = 100 in China + 100 in Italy.
β_0 = 10.0; β_Italy = -5.0; β_Age = 10.0;
#
df = DataFrame(
Y = [ones(8);zeros(80-8); ones(4);zeros(20-4);
ones(1);zeros(20-1); ones(12);zeros(80-12);],
Intercept = ones(N),
Italy = [zeros(100); ones(100)],
Age = [zeros(80);ones(100-80);
zeros(20);ones(100-20);],
)
1. Let's estimate the unconditional probability of death from COVID-19 in this data: $$Y_{i} = \beta_{0} + \varepsilon_{i}$$
y = df.Y; X = ones(N);
β = X \ y # 12.5%
mean(y) # 12.5%
\begin{array}{|l|l|l|} \hline P\left( \text{Death from COVID-19} \right) = 12.5\% \\ \hline \end{array} 2. Let's estimate the probability of death from COVID-19 conditional only on country:
$$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \varepsilon_{i}$$
y = df.Y; X = [ df.Intercept df.Italy];
β = X \ y
β[1] # 12%
β[1] + β[2] # 13%
\begin{array}{|l|l|l|} \hline P\left( \text{Death from COVID-19 } | \text{ China} \right) = 12\% \\ \hline P\left( \text{Death from COVID-19 } | \text{ Italy} \right) = 13\% \\ \hline \end{array} 3. Let's estimate the probability of death from COVID-19 conditional on country and age:
$$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \varepsilon_{i}$$
y = df.Y; X = [ df.Intercept df.Italy df.Age];
β = X \ y
β[1] # 10%
β[1] + β[2] # 5%
Summarize $P\left( \text{Death from COVID-19 } | \text{ Country, Age} \right)$ in the following table: \begin{array}{|l|l|l|} \hline & \text{Young } (Z_{i} = 0) & \text{Old } (Z_{i} = 1) \\ \hline \text{China } (X_{i} = 0) & 10\% & 20\% \\ \hline \text{Italy } (X_{i} = 1) & 5\% & 15\% \\ \hline \end{array} Without conditioning on age, patients in Italy have a 1% higher probability of Death than China (13% vs 12%).
Conditioning on age, patients in Italy have a 5% lower probability of Death than China within both age brackets.
Next we will show how Simpson's paradox is a special case of OVB.
Suppose the true model is: $$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \nu_{i}$$
Suppose you omit $Z_{i}$ and instead estimate: $$Y_{i} = \beta_{0} + \beta_{xy} X_{i} + u_{i} \Rightarrow u_{i} = \beta_{zy} Z_{i} + \nu_{i} $$
Suppose $X_{i}$ predicts $Z_{i}$: $$Z_{i} = \delta_{xz} X_{i} + w_{i} \Rightarrow \delta_{xz} = \frac{\sigma_{xz}}{\sigma_{x}^2} = \rho_{xz}\times \frac{\sigma_{z}}{\sigma_{x}} $$
Denote the OLS estimate (from the equation that omits age) $\hat{\beta}_{xy}$.
We have: $E[\hat{\beta}_{xy}|X_{i}] = \beta_{xy} + \underbrace{\delta_{xz} \beta_{zy}}_{\text{bias}}$ [1]
$\text{Bias} = \delta_{xz} \beta_{zy} = \left( \rho_{xz}\times \frac{\sigma_{z}}{\sigma_{x}} \right) \times \left( \rho_{zy}\times \frac{\sigma_{y}}{\sigma_{z}} \right) = \rho_{xz}\times \rho_{zy} \times \frac{\sigma_{y}}{\sigma_{x}} $
The bias is the product of (1) the impact of the treatment on the OV $\delta_{xz}$ and (2) the impact of the OV on the outcome $\beta_{zy}$.
The estimate will be unbiased if either (1) the treatment is uncorrelated w/ the OV, or (2) the OV is uncorrelated w/ the outcome.
Alternatively we can think about this through the Law of Iterated Expectations: \begin{align*} E[Y_i|\text{ China }] =& E[Y_i|\text{ China, Young}] \times P\left(\text{China, Young}\right) + E[Y_i|\text{ China, Old}] \times P\left(\text{China, Old}\right) \\ 12\% =& 10\% \times 80\% + 20\% \times 20\% \\ =& 8\% + 4\% \end{align*}
[1] To derive the bias, slightly abuse notation by stacking a column of ones and $X_{i}$ into a matrix "X", and stack $\beta_{0}$ and $\beta_{xy}$ into $\beta$:
\begin{align*} \hat{\beta} =& (X'X)^{-1} X'Y \\ =& (X'X)^{-1} X'(X\beta + Z\beta_{zy} + \nu_{i}) \\ =& \beta + (X'X)^{-1} X'Z \beta_{zy} + (X'X)^{-1} X'Z \nu_{i} \\ E[\hat{\beta}_{xy}|X ] =& \beta + (X'X)^{-1} X'Z \beta_{zy} \\ =& \beta + \delta_{xz} \beta_{zy} \end{align*}Let's start with a definition:
Case Fatality Rate (CFR): the proportion of people who die from a specified disease among all individuals diagnosed with the disease over a certain period of time.
The non-causal interpretation of $\beta_{xy}$ is:
Assumption:
The causal interpretation of $\beta_{xy}$ is:
Assumption:
AZ: even w/o causal assumption, still useful for non-causal prediction
$X \sim N\left(\mu, \sigma^2 \right)$