Simpson’s Paradox is a Special Case of Omitted Variable Bias

9 minute read


The goal of this post is to illustrate a point made in a recent tweet by Amit Ghandi that Simpson’s Paradox is a special case of omitted variable bias.

Let’s start with some definitions:
Simpson’s Paradox: a statistical phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations.
Omitted Variable Bias (OVB): when a statistical model leaves out one or more variables that is both correlated with the treatment and the outcome.
Case Fatality Rate (CFR): the proportion of people who die from a specified disease among all individuals diagnosed with the disease over a certain period of time.

Example: COVID-19 in China versus Italy

Let’s use an example from How Simpson’s paradox explains weird COVID19 statistics. (This example is for illustrative purposes only. This post is about interpreting statistics, not COVID-19). The video compares those diagnosed with COVID-19 in China and Italy between March and May 2020.
CFR by country: people infected with COVID-19 were more likely to die in Italy than China.
CFR by country-age group: at each age bracket, people infected with COVID-19 were more likely to die in China than Italy.

Simulate Data

Let’s illustrate this with a simulation in the Julia Language.
The variables are defined in the table below:

Outcome:  Y i Treatment:  X i Confounder:  Z i Y i 0  if person i survives X i 0  if person i is in China Z i 0  if person i's age  59 Y i 1  if person i dies X i 1  if person i is in Italy Z i 1  if person i's age  > 59

Let’s assume the true data generating process (DGP) is: $Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \varepsilon_{i}$
Under the true DGP, $\text{CFR}\left(X_i, Z_i \right) = P\left(Y_i =1 | X_i, Z_i \right) =E\left[Y_i =1 | X_i, Z_i \right]$.

Assume $\beta_{0}=10, \beta_{xy} = -5, \beta_{zy} = 10$ (coefficients are in %).
China-Young: $\text{CFR}\left(0, 0\right) = P\left(Y_i =1 | X_i=0, Z_i=0 \right) = \beta_{0} = 10\%$
China-Old: $\text{CFR}\left(0, 1\right) = P\left(Y_i =1 | X_i=0, Z_i=1 \right) = \beta_{0} + \beta_{zy} = 20\%$
Italy-Young: $\text{CFR}\left(1, 0\right) = P\left(Y_i =1 | X_i=1, Z_i=0 \right) = \beta_{0} + \beta_{xy} = 5\%$
Italy-Old: $\text{CFR}\left(1, 1\right) = P\left(Y_i =1 | X_i=1, Z_i=1 \right) = \beta_{0} + \beta_{xy} + \beta_{zy} = 15\%$

Let’s generate artificial data consistent with the DGP.
Suppose we have N=200 observations (half China, half Italy).
Suppose 80% of China’s population is young $Z_{i} =0$ and 20% is old $Z_{i} = 1$.
Suppose 20% of Italy’s population is young $Z_{i} =0$ and 80% is old $Z_{i} = 1$.

  using DataFrames, Plots, Statistics
  N = 200; #200 obs = 100 in China + 100 in Italy.
  β_0 = 10.0; β_Italy = -5.0; β_Age = 10.0;
  df = DataFrame(
      Y        = [
                  ones(8);zeros(80-8);   #China-Young: 8/80 die
                  ones(4);zeros(20-4);   #China-Old:  4/20 die
                  ones(1);zeros(20-1);   #Italy-Young: 1/20 die
                  ones(12);zeros(80-12); #Italy-Old: 12/80 die
      Intercept = ones(N), 
      Italy     = [zeros(100); ones(100)], 
      Age       = [zeros(80);ones(100-80); 
  y = df.Y;    

Estimate CFR conditional on: nothing/country/country & age

1) Let’s estimate the unconditional probability of death from COVID-19 in this data:
\(Y_{i} = \beta_{0} + \varepsilon_{i}\)

X = hcat(df.Intercept);
β = X \ y   # 12.5%
mean(y)     # 12.5% 
P ( Death from COVID-19 ) = 12.5 %

2) Let’s estimate the probability of death from COVID-19 conditional only on country:
\(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \varepsilon_{i}\)

X = hcat(df.Intercept, df.Italy);
β = X \ y   
β[1]         # 12% = CFR in China
β[1] + β[2]  # 13% = CFR in Italy
P ( Death from COVID-19  |  China ) = 12 % P ( Death from COVID-19  |  Italy ) = 13 %

3) Let’s estimate the probability of death from COVID-19 conditional on country and age:
\(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \varepsilon_{i}\)

X = hcat(df.Intercept, df.Italy, df.Age);
β = X \ y   
β[1]                # 10% = CFR for China-Young
β[1] + β[3]         # 20% = CFR for China-Old
β[1] + β[2]         #  5% = CFR for Italy-Young
β[1] + β[2] + β[3]  # 15% = CFR for Italy-Old

Summarize $P\left( \text{Death from COVID-19 } | \text{ Country, Age} \right)$ in the following table:

Young  ( Z i = 0 ) Old  ( Z i = 1 ) China  ( X i = 0 ) 10 % 20 % Italy  ( X i = 1 ) 5 % 15 %

Without conditioning on age, patients in Italy have a 1% higher probability of Death than China (13% vs 12%).
Conditioning on age, patients in Italy have a 5% lower probability of Death than China within both age brackets.


Next we will show how Simpson’s paradox is a special case of OVB.
Suppose the true model is: \(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \nu_{i}\)
Suppose you omit $Z_{i}$ and instead estimate: \(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + u_{i} \Rightarrow u_{i} = \beta_{zy} Z_{i} + \nu_{i}\)
Suppose $X_{i}$ predicts $Z_{i}$: \(Z_{i} = \delta_{xz} X_{i} + w_{i} \Rightarrow \delta_{xz} = \frac{\sigma_{xz}}{\sigma_{x}^2} = \rho_{xz}\times \frac{\sigma_{z}}{\sigma_{x}}\)
Denote the OLS estimate (from the equation that omits age) $\hat{\beta}_{xy}$.
We have: $E\left[ \hat{\beta}_{xy} | X_{i} \right] = \beta_{xy} + \underbrace{\delta_{xz} \beta_{zy}}_{\text{Bias}}$ (derivation below1)
$ \text{Bias} = \delta_{xz} \beta_{zy} = \left( \rho_{xz}\times \frac{\sigma_{z}}{\sigma_{x}} \right) \times \left( \rho_{zy}\times \frac{\sigma_{y}}{\sigma_{z}} \right) = \rho_{xz}\times \rho_{zy} \times \frac{\sigma_{y}}{\sigma_{x}} $
The bias is the product of (1) the impact of the treatment on the OV $\delta_{xz}$ and (2) the impact of the OV on the outcome $\beta_{zy}$.
The estimate will be unbiased if either (1) the treatment is uncorrelated w/ the OV, or (2) the OV is uncorrelated w/ the outcome.

Simpson’s reversal occurs when the sign of the estimated coefficient switches after including the confounder (when the bias is big enough in the opposite direction of the true effect): $\text{sign}\left( \hat{\beta}_{xy} \right) \neq \text{sign}\left( \beta_{xy} \right)$ $\Leftrightarrow$ $\text{sign}\left( \beta_{xy} + \delta_{xz} \beta_{zy} \right) \neq \text{sign}\left( \beta_{xy} \right)$.

In our case, the true effect $\beta_{xy} = -5\%$ and the bias $\delta_{xz} \beta_{zy}=6\%$, the OVB is big enough to cause a reversal:
\(\begin{align*} \hat{\beta}_{xy} &= 1\% & \text{Non-causal effect, estimated when excluding Z} \\ \beta_{xy} &= -5\% & \text{Causal effect, estimated when including Z} \\ \delta_{xz} \beta_{zy} &= 60\% \times 10\% =6\% &= \text{Bias} \\ \beta_{xy} + \delta_{xz} \beta_{zy} &= -5\% + 60\% \times 10\% &= 1\% \\ \end{align*}\)

Levels of Interpretation

Suppose we estimate: \(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \varepsilon_{i}\)

Non-causal interpretation $\hat{\beta}_{xy} = 1\%$: the probability of a diagnosed patient dying from COVID-19 is 1% higher in Italy than China.
Assumption 1: the Chinese and Italian data was correctly measured and reported.
Note: the assumption required for the non-causal interpretation is relatively mild.

Causal interpretation $\hat{\beta}_{xy} = 1\%$: if we intervene and move a diagnosed patient from China to Italy, the probability of the patient dying from COVID-19 will be 1% higher in Italy.
Assumption 1: the Chinese and Italian data was correctly measured and reported.
Assumption 2: the “treatment” (China vs Italy) is uncorrelated with unobserved determinants of survival. This is the famous CMI assumption: $E\left[ \varepsilon | X \right] = 0$.
Note: the identifying assumption (CMI) required for a causal interpretation is very strong. In general treatments are often correlated with variables which are also correlated with the outcome. In this case, the confounder is age, Italy’s population is older than China’s.

In general each reader can decide how convinced he is with the identifying assumption and thus how to interpret an estimate. Importantly, non-causal estimates are often still very useful in contexts where our goal is to make predictions.

Additional Practice

The true DGP above assumed the treatment effect was the same across age bins: \(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \nu_{i}\)
Thus the CFR was $\beta_{xy} = -5\%$ lower for both young and old patients in Italy.
Suppose there was treatment effect heterogeneity and the true DGP was:
\(Y_{i} = \beta_{0} + \beta_{xy} X_{i} + \beta_{zy} Z_{i} + \beta_{xzy} X_{i} Z_{i} + \nu_{i}\)
In this case, estimating the model omitting the interaction effect (omitted non-linearity) would also cause OVB.

  1. To derive the bias, slightly abuse notation by stacking a column of ones and $X_{i}$ into a matrix “X”, and stack $\beta_{0}$ and $\beta_{xy}$ into $\beta$:
    \(\begin{align*} \hat{\beta} &= (X'X)^{-1} X'Y \\ &= (X'X)^{-1} X'(X\beta + Z\beta_{zy} + \nu_{i}) \\ &= \beta + (X'X)^{-1} X'Z \beta_{zy} + (X'X)^{-1} X'Z \nu_{i} \\ \delta_{xz} &\equiv (X'X)^{-1} X'Z \\ \hat{\beta} &= \beta + \delta_{xz} \beta_{zy} + \delta_{xz} \nu_{i} \\ E\left( \hat{\beta} | X \right) &= \beta + \delta_{xz} \beta_{zy} \\ & \end{align*}\)

Leave a Comment