Advanced Social Epidemiology PhD Course
University of Copenhagen
2026-06-10
Ultimately, we want to know why health inequalities are changing over time—what changed?
Unpacking the ‘components’ of health inequality is an opportunity to better integrate the monitoring of health inequalities with the etiology of health inequalities.
These techniques often involve various kinds of ‘counterfactual’ scenarios
We want to understand this:
We want to understand this:
By evaluating something like this:
Recall that we can write the CI as:
\[RCI= \frac{2}{n\mu} \sum_{i=1}^{n}y_{i}R_{i}-1\]
where
Decomposition:
Kakwani, Wagstaff, and Doorslaer (1997)
Since the \(RCI\) is a function of health \((y_{i})\) and a socioeconomic rank variable \((R_{i})\), i.e. \[RCI= \frac{2}{n\mu} \sum_{i=1}^{n}{\color{red}{y_{i}}}R_{i}-1\]
Then suppose that one can write a regression equation expressing the health outcome of interest \((y_{i})\) as a function of several \(k_{i}\) determinants (e.g., age, gender, urban/rural status): \[\color{red}{y_{i}}=\alpha + \sum{\beta_{x}x_{k_{i}}}+\epsilon_{i}\]
Wagstaff, Doorslaer, and Watanabe (2003)
Since RCI is a function of \(y_{i}\) and socioeconomic rank, one can then re-express the relative concentration index as:
\[RCI=\sum{(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}}+gRCI_{e}/\mu\]
Where
The basic idea: how much of the overall inequality is due to other factors that are both differentially distributed by \(x\) (income) and also affect \(y\) (e.g., smoking)?
This equation results in 2 components of socioeconomic inequality:
\[RCI=\sum{(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}}+gRCI_{e}/\mu\]
One part \((\beta_{k}\bar{x}_{k}/\mu)RCI_{k}\) that is due to the association between:
The other part \((gRCI_{e}/\mu)\) is ‘unexplained’, i.e., inequality that cannot be explained by systematic variation across income groups in the determinants of health.
The influence of determinants depends on two things:
\[\color{red}{y_{i}}=\alpha + \sum{\beta_{x}x_{k_{i}}}+\epsilon_{i}\]
Calculate the mean of \(y\) \((\mu)\) and of each of the \(x_{k}\) determinants (e.g., education, age)
Calculate the overall Concentration Index for the health variable (\(CI\)) and for each determinant in the equation predicting health \((CI_{k})\). This means using each determinant \(x_{k}\) as the “outcome” and estimate a \(CI\) for age, \(CI\) for education, etc.
\[(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}\]
\[[(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}]/RCI\]
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| Current smoker | 2 | 0 | 0.1 | 0.2 | 0.0 | 0.0 | 1.0 | |
| Income decile | 10 | 0 | 6.6 | 2.6 | 1.0 | 7.0 | 10.0 | |
| Age (years) | 76 | 0 | 54.4 | 18.6 | 15.0 | 56.0 | 90.0 | |
| Low education (ISCED 1–4) | 2 | 0 | 0.4 | 0.5 | 0.0 | 0.0 | 1.0 | |
| Female | 2 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
| Binge drinking (monthly+) | 2 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
| Obese (BMI ≥ 30) | 2 | 0 | 0.1 | 0.4 | 0.0 | 0.0 | 1.0 | |
| Married | 2 | 0 | 0.5 | 0.5 | 0.0 | 0.0 | 1.0 | |
| Survey weight | 132 | 0 | 1.0 | 0.5 | 0.3 | 0.8 | 4.0 |
Source: Author’s calculations
Overall CI = -0.20
The poorest 50% of the population accounts for around 70% of current smokers.
| Variable | β | OR | 95% CI | p |
|---|---|---|---|---|
| Age (years) | -0.001 | 0.999 | (0.99, 1.01) | 0.906 |
| Female | 0.141 | 1.152 | (0.73, 1.83) | 0.547 |
| Low education (ISCED 1–4) | 0.359 | 1.432 | (0.90, 2.27) | 0.130 |
| Married | -0.989 | 0.372 | (0.21, 0.65) | <0.001 |
| Binge drinking (monthly+) | 0.38 | 1.463 | (0.92, 2.32) | 0.107 |
| Obese (BMI ≥ 30) | 0.495 | 1.64 | (0.90, 2.99) | 0.106 |
This is Step (1): Estimate a regression equation predicting \(y\) (‘health’) from its determinants \((\beta_{k}x_{k})\)
Note that income decile (the ranking variable) is excluded from the regression.
This is deliberate, since including it can be misleading:
“the residual component will be zero, or close to zero, suggesting that we have explained all or most of the variation in the Concentration Index”
Excluding it means the residual will:
The appropriate interpretation is that the covariates in the model account for part of the income gradient in smoking, and the rest remains either directly attributable to income or unexplained by the observed determinants.
\[RCI=\sum{(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}}+gRCI_{e}/\mu\]
With these parameters, the elasticity of smoking with respect to low education is: (0.024 * 0.452 / .073) = 0.148
Interpretation: a 1% increase in low education increases smoking by 14.8% (not percentage points!).
What about the RCI for low education?
This is Step (2): Calculate the mean of \(y\) \((\mu)\) and of each of the \(x_{k}\) determinants
Note the y-axis is cumulative share of low education
The poorest 50% account for roughly 60% of the share of low educated.
This is Step (3): Calculate the \(CI\) for each of the potential determinants \(x_{k}\)
Recall the decomposition formula:
\[RCI=\sum{(\color{red}{\beta_{k}\bar{x}_{k}/\mu})\color{blue}{RCI_{k}}}+gRCI_{e}/\mu\]
So the elasticity of smoking (from the previous slide) with respect to low education is (0.024 * 0.452 / .073) = 0.148
Now we have the RCI for low education = -0.265
So now we can calculate the contribution of low education as:
\[\text{Elasticity}\times RCI_{ed} = 0.148 * -0.265 = -.039\]
Thus low education accounts for -.039/ -0.203 = 19% of the overall \(RCI\)
Decomposition of Income-Related Inequality in Smoking: Sweden
Overall RCI = -0.203
Contributions of each factor can be positive or negative
| Contribution | ||||
|---|---|---|---|---|
| Variable | Elasticity | CI_k | Absolute | % |
| Logistic regression; marginal effects used as elasticity weights. | ||||
| Age | -0.030 | 0.001 | -0.000 | 0.018 |
| Binge drinker | 0.152 | 0.103 | 0.016 | -7.693 |
| Female | 0.060 | -0.031 | -0.002 | 0.898 |
| Obese | 0.068 | -0.013 | -0.001 | 0.436 |
| Low education | 0.144 | -0.265 | -0.038 | 18.827 |
| Married | -0.353 | 0.284 | -0.100 | 49.366 |
| Residual | — | — | -0.077 | 38.148 |
Each contribution is a product of two estimated quantities:
\[\underbrace{\frac{\hat{\beta}_k \bar{x}_k}{\hat{\mu}}}_{\text{elasticity}} \times \underbrace{\widehat{RCI}_k}_{\text{inequality in }x_k}\]
Both carry sampling variability — so do the contributions.
Bootstrap resamples the full dataset and re-runs the decomposition \(B\) times to get empirical CIs.
Decomposition results will be sensitive to the choice of determinants included (i.e., how well-specified the model is for predicting \(y\)).
The regression equations are predictive and not causal models.
Main utility is not in estimating the potential impact on \(y\) of changing the distribution of socioeconomic position, but in indicating the potential role that other factors may play in generating socioeconomic inequalities in health.
The core idea is to explain the distribution of the outcome variable in question by a set of factors that vary systematically with exposure status.
Thus, we want to know, on average, why the mean level of health or disease differs between exposed and unexposed groups.
Since, for most health outcomes there are multiple determinants, we may want to know which of these determinants plays more or less important roles in explaining the difference in average outcomes.
“Unpacking” or “decomposing” difference.
Evelyn Kitagawa (1955) was a sociologist and demographer who devised a non-parametric method for decomposing differences between rates, refined by Prithwis das Gupta in 1978.
Studies by Oaxaca (1973) and Blinder (1973) applied regression-based decomposition methods to analyze the wage gap between men and women and between whites and blacks in the USA.
Decomposition methods are based on regression analyses, and thus all of the usual caveats about good specification apply.
If regressions are purely descriptive, they reveal the associations that characterize the health inequality. Then inequality is explained in a statistical sense but implications for policies to reduce inequality are limited.
If data allow identification of causal effects, then the factors that generate the inequality are identified.
Then one can (potentially) draw conclusions about how policies would impact on inequality.
O’Donnell et al. (2008)
Left: OLS regression line always passes through \((\bar{x}, \bar{y})\) because residuals sum to zero. Right: the gap in group means decomposes into endowments (different \(\bar{x}\), same coefficients — grey triangle to blue dot) and coefficients (same \(\bar{x}\), different slopes/intercepts — grey triangle to red dot).
The overall gap between exposed and unexposed can be written as a function of differences the respective beta coefficients, evaluated at the mean for each group:
Coefficients of unexposed
Means of exposed
\(y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{unexp}-\Delta\beta x^{exp}\)
Coefficients of exposed
Means of unexposed
\(y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{exp}-\Delta\beta x^{unexp}\)
In the first, the differences in the \(X\)s are weighted by the coefficients of the unexposed group and the differences in the coefficients are weighted by the \(X\)s of the exposed group: \[y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{unexp}-\Delta\beta x^{exp}\]
whereas, in the second, the differences in the \(X\)s are weighted by the coefficients of the exposed group and the differences in the coefficients are weighted by the \(X\)s of the unexposed group: \[y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{exp}-\Delta\beta x^{unexp}\]
See Oaxaca (1973), Blinder (1973) and Cotton (1988) for details.
What is the average difference in body mass index (BMI) between those with low vs. high education?
How much of this difference is due to the fact that determinants of BMI (age, gender, smoking, drinking, income) differ between education groups?
Any residual difference is due to education differences in the associations between those risk factors and BMI — i.e., the coefficients differ.
European Social Survey Round 11, Italy (n = 1743)
Body mass index as outcome (kg/m²): weight / height²
Overall difference by education: High ed (ISCED 5–7) vs Low ed (ISCED 1–4)
Potential determinants (the Xs):
Low-educated Italians differ from high-educated on several covariates
These differences could explain part of the BMI gap
Each covariate may have a different association with BMI by education
Differences in returns form the unexplained component
Endowments: 38%; Coefficients: 62%
| Endowments | Coefficients | |||
|---|---|---|---|---|
| Est | SE | Est | SE | |
| Total | 0.608 | 0.108 | 1.030 | 0.259 |
| Age (years) | 0.415 | 0.074 | 0.865 | 0.691 |
| Female | 0.080 | 0.051 | 0.466 | 0.181 |
| Current smoker | 0.004 | 0.008 | -0.145 | 0.115 |
| Binge drinking (monthly+) | 0.001 | 0.007 | 0.003 | 0.104 |
| Married | 0.012 | 0.016 | -0.037 | 0.208 |
| Income (decile) | 0.095 | 0.071 | 0.133 | 0.425 |
| Intercept | — | — | -0.256 | 0.837 |
Low ed betas used as reference; rows show each variable’s contribution to the overall gap of 1.61 kg/m²
Endowments: 38%; Coefficients: 62%
| Endowments | ||
|---|---|---|
| Est | SE | |
| Total | 0.608 | 0.108 |
| Age (years) | 0.415 | 0.074 |
| Female | 0.080 | 0.051 |
| Current smoker | 0.004 | 0.008 |
| Binge drinking (monthly+) | 0.001 | 0.007 |
| Married | 0.012 | 0.016 |
| Income (decile) | 0.095 | 0.071 |
Interpretation:
If the low-educated had the same covariate means as the high-educated (keeping the low-educated group’s own coefficients), their BMI would be 0.608 \(kg/m^2\) lower — closing 38% of the gap.
Most of this is due to older age among the low-educated, which predicts higher BMI.
Endowments: 38%; Coefficients: 62%
Interpretation:
If the high-educated group’s covariates had the same relationship with BMI as they do in the low-educated group (i.e., if low-ed coefficients applied to high-ed covariate means), their BMI would be 1.030 higher, accounting for 62% of the gap.
Note smoking is negative, since smoking predicts higher BMI among the high-educated and is more common among high educated.
| Coefficients | ||
|---|---|---|
| Est | SE | |
| Total | 1.030 | 0.259 |
| Age (years) | 0.865 | 0.691 |
| Female | 0.466 | 0.181 |
| Current smoker | -0.145 | 0.115 |
| Binge drinking (monthly+) | 0.003 | 0.104 |
| Married | -0.037 | 0.208 |
| Income (decile) | 0.133 | 0.425 |
| Intercept | -0.256 | 0.837 |
Variables where groups differ and that predict BMI contribute most
Income and age are the biggest drivers
Attempting to reconcile the non-causal framework of KBO with mediation methods, new estimators.
Jackson (2021)
Various decomposition techniques exist that may be useful for analyzing social determinants of health:
All of these techniques make assumptions that need to be evaluated in the course of analysis.
When used properly, decomposition techniques can help to provide key evidence on why health inequalities exist and change over time.