Decomposition Techniques for Social Epidemiology

Overview

Concentration Index Decomposition

Kitagawa-Blinder-Oaxaca Decomposition

Overview of Decomposition Techniques

Today:

Inequality decomposition: Concentration Index
Decomposing two-group differences: Kitagawa-Blinder-Oaxaca

Not covered here:

Life table decomposition
Effect decomposition (i.e., mediation)
Decomposition of population rates
Inequality decomposition for nominal social groups

Moving from Description to Explanation

Ultimately, we want to know why health inequalities are changing over time—what changed?
- Risk factors?
- Demographic composition?
- Social conditions?
Unpacking the ‘components’ of health inequality is an opportunity to better integrate the monitoring of health inequalities with the etiology of health inequalities.
These techniques often involve various kinds of ‘counterfactual’ scenarios

Overview

Concentration Index Decomposition

Kitagawa-Blinder-Oaxaca Decomposition

We want to understand this:

We want to understand this:

By evaluating something like this:

Relative Concentration Curve

Formula for writing the Concentration Index

Recall that we can write the CI as:

\[RCI= \frac{2}{n\mu} \sum_{i=1}^{n}y_{i}R_{i}-1\]

where

\(\mu\) is the mean of \(y_{i}\) (e.g., smoking);
\(R_{i}\) is the fractional rank of the ith person in the socioeconomic (i.e., income) distribution.

Decomposition:

The basic idea here is to develop a model for predicting \(y\) using several determinants, then plug that model back into the equation for the \(RCI\)
This allows us to estimate how much of the overall inequality in \(y\) is due to the association between income and other factors that predict health, and how much is ‘unexplained’ by these factors.

Kakwani, Wagstaff, and Doorslaer (1997)

Decomposition of the RCI

Since the \(RCI\) is a function of health \((y_{i})\) and a socioeconomic rank variable \((R_{i})\), i.e. \[RCI= \frac{2}{n\mu} \sum_{i=1}^{n}{\color{red}{y_{i}}}R_{i}-1\]

Then suppose that one can write a regression equation expressing the health outcome of interest \((y_{i})\) as a function of several \(k_{i}\) determinants (e.g., age, gender, urban/rural status): \[\color{red}{y_{i}}=\alpha + \sum{\beta_{x}x_{k_{i}}}+\epsilon_{i}\]

Wagstaff, Doorslaer, and Watanabe (2003)

Decomposition of the RCI

Since RCI is a function of \(y_{i}\) and socioeconomic rank, one can then re-express the relative concentration index as:

\[RCI=\sum{(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}}+gRCI_{e}/\mu\]

Where

\(\mu\) is the mean of \(y\),
\(\bar{x}_{k}\) is the mean of determinant \(x_{k}\),
\(\beta_{k}\) is the regression coefficient for \(x_{k}\), and
\(RCI_{k}\) is the relative concentration index for \(x_{k}\).

The basic idea: how much of the overall inequality is due to other factors that are both differentially distributed by \(x\) (income) and also affect \(y\) (e.g., smoking)?

Explained and unexplained components

This equation results in 2 components of socioeconomic inequality:

\[RCI=\sum{(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}}+gRCI_{e}/\mu\]

One part \((\beta_{k}\bar{x}_{k}/\mu)RCI_{k}\) that is due to the association between:

determinants that predict health (i.e., \(\beta_{k}\bar{x}_{k}/\mu\)); and
the systematic variation in these determinants across income groups (i.e., \(RCI_{k}\))

The other part \((gRCI_{e}/\mu)\) is ‘unexplained’, i.e., inequality that cannot be explained by systematic variation across income groups in the determinants of health.

Two types of ‘explained’ components

The influence of determinants depends on two things:

the strength of the relationship between each factor and income

\(\color{blue}{RCI_{k}}\)

the strength of the relationship between each factor and health, and its prevalence in the population (elasticity).

\(\color{red}{\beta_{k}\bar{x}_{k}/\mu}\)

Procedure for decomposing the Concentration Index

Estimate a regression equation predicting \(y\) (‘health’) from its determinants \((\beta_{k}x_{k})\):

\[\color{red}{y_{i}}=\alpha + \sum{\beta_{x}x_{k_{i}}}+\epsilon_{i}\]

Calculate the mean of \(y\) \((\mu)\) and of each of the \(x_{k}\) determinants (e.g., education, age)
Calculate the overall Concentration Index for the health variable (\(CI\)) and for each determinant in the equation predicting health \((CI_{k})\). This means using each determinant \(x_{k}\) as the “outcome” and estimate a \(CI\) for age, \(CI\) for education, etc.

Procedure for decomposing the Concentration Index

Calculate the absolute contribution of each determinant by multiplying its ‘elasticity’ by its concentration index \((CI_{k})\):

\[(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}\]

Calculate the percentage contribution of each determinant:

\[[(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}]/RCI\]

Example: Decomposing Socioeconomic Inequality in Current Smoking

Data: European Social Survey Round 11 for Sweden

	Unique	Mean	SD	Min	Median	Max
Current smoker	2	0.1	0.2	0.0	0.0	1.0
Income decile	10	6.6	2.6	1.0	7.0	10.0
Age (years)	76	54.4	18.6	15.0	56.0	90.0
Low education (ISCED 1–4)	2	0.4	0.5	0.0	0.0	1.0
Female	2	0.5	0.5	0.0	0.0	1.0
Binge drinking (monthly+)	2	0.5	0.5	0.0	0.0	1.0
Obese (BMI ≥ 30)	2	0.1	0.4	0.0	0.0	1.0
Married	2	0.5	0.5	0.0	0.0	1.0
Survey weight	132	1.0	0.5	0.3	0.8	4.0

Source: Author’s calculations

Concentration curve for smoking

Overall CI = -0.20

The poorest 50% of the population accounts for around 70% of current smokers.

Predictors of current smoking: Sweden

Variable	β	OR	95% CI	p
Age (years)	-0.001	0.999	(0.99, 1.01)	0.906
Female	0.141	1.152	(0.73, 1.83)	0.547
Low education (ISCED 1–4)	0.359	1.432	(0.90, 2.27)	0.130
Married	-0.989	0.372	(0.21, 0.65)	<0.001
Binge drinking (monthly+)	0.38	1.463	(0.92, 2.32)	0.107
Obese (BMI ≥ 30)	0.495	1.64	(0.90, 2.99)	0.106

This is Step (1): Estimate a regression equation predicting \(y\) (‘health’) from its determinants \((\beta_{k}x_{k})\)

Should we include income in the regression?

Note that income decile (the ranking variable) is excluded from the regression.

This is deliberate, since including it can be misleading:

“the residual component will be zero, or close to zero, suggesting that we have explained all or most of the variation in the Concentration Index”

Excluding it means the residual will:

Capture unexplained income-related variation
Allow for a direct income → smoking pathway
Not artificially inflate the “explained” share

The appropriate interpretation is that the covariates in the model account for part of the income gradient in smoking, and the rest remains either directly attributable to income or unexplained by the observed determinants.

See Kessels and Erreygers (2013) and Erreygers and Kessels (2016)

Estimation for a specific factor: Education

\[RCI=\sum{(\beta_{k}\bar{x}_{k}/\mu)RCI_{k}}+gRCI_{e}/\mu\]

Estimated \(\beta\) coeff on low education (logit scale): 0.359 (OR = 1.43)
Marginal effect on probability scale: 0.024 (2.4 pct points)
Mean low education: 0.452
Mean smoking rate: 7.3%

With these parameters, the elasticity of smoking with respect to low education is: (0.024 * 0.452 / .073) = 0.148

Interpretation: a 1% increase in low education increases smoking by 14.8% (not percentage points!).

What about the RCI for low education?

This is Step (2): Calculate the mean of \(y\) \((\mu)\) and of each of the \(x_{k}\) determinants

Concentration curve for low education

Note the y-axis is cumulative share of low education

The poorest 50% account for roughly 60% of the share of low educated.

This is Step (3): Calculate the \(CI\) for each of the potential determinants \(x_{k}\)

Estimation for a specific factor: Low education

Recall the decomposition formula:

\[RCI=\sum{(\color{red}{\beta_{k}\bar{x}_{k}/\mu})\color{blue}{RCI_{k}}}+gRCI_{e}/\mu\]

So the elasticity of smoking (from the previous slide) with respect to low education is (0.024 * 0.452 / .073) = 0.148

Now we have the RCI for low education = -0.265

So now we can calculate the contribution of low education as:

\[\text{Elasticity}\times RCI_{ed} = 0.148 * -0.265 = -.039\]

Thus low education accounts for -.039/ -0.203 = 19% of the overall \(RCI\)

Decomposition of Income-Related Inequality in Smoking: Sweden

Overall RCI = -0.203

Contributions of each factor can be positive or negative

			Contribution
Variable	Elasticity	CI_k	Absolute	%
Logistic regression; marginal effects used as elasticity weights.
Age	-0.030	0.001	-0.000	0.018
Binge drinker	0.152	0.103	0.016	-7.693
Female	0.060	-0.031	-0.002	0.898
Obese	0.068	-0.013	-0.001	0.436
Low education	0.144	-0.265	-0.038	18.827
Married	-0.353	0.284	-0.100	49.366
Residual	—	—	-0.077	38.148

Binge drinking contribution is negative since it increases smoking but is more common among higher income.
Married contribution is positive since it decreases smoking but is more common among higher income.

Uncertainty in the RCI Decomposition

Each contribution is a product of two estimated quantities:

\[\underbrace{\frac{\hat{\beta}_k \bar{x}_k}{\hat{\mu}}}_{\text{elasticity}} \times \underbrace{\widehat{RCI}_k}_{\text{inequality in }x_k}\]

Both carry sampling variability — so do the contributions.

Bootstrap resamples the full dataset and re-runs the decomposition \(B\) times to get empirical CIs.

O’Donnell et al. (2008); Hosseinpoor, Doorslaer, and Speybroeck (2006)

Caveats for decomposing the RCI

Decomposition results will be sensitive to the choice of determinants included (i.e., how well-specified the model is for predicting \(y\)).

The regression equations are predictive and not causal models.

Main utility is not in estimating the potential impact on \(y\) of changing the distribution of socioeconomic position, but in indicating the potential role that other factors may play in generating socioeconomic inequalities in health.

Overview

Concentration Index Decomposition

Kitagawa-Blinder-Oaxaca Decomposition

Idea for Decomposition of Means

The core idea is to explain the distribution of the outcome variable in question by a set of factors that vary systematically with exposure status.

Thus, we want to know, on average, why the mean level of health or disease differs between exposed and unexposed groups.

Since, for most health outcomes there are multiple determinants, we may want to know which of these determinants plays more or less important roles in explaining the difference in average outcomes.

“Unpacking” or “decomposing” difference.

Origins

Evelyn Kitagawa (1955) was a sociologist and demographer who devised a non-parametric method for decomposing differences between rates, refined by Prithwis das Gupta in 1978.

Focused on understanding group contributions to rate differences.

Studies by Oaxaca (1973) and Blinder (1973) applied regression-based decomposition methods to analyze the wage gap between men and women and between whites and blacks in the USA.

Focused on how much of wage gap was ‘explained’ by differences in observable characteristics

Brief note on interpretation

Decomposition methods are based on regression analyses, and thus all of the usual caveats about good specification apply.

If regressions are purely descriptive, they reveal the associations that characterize the health inequality. Then inequality is explained in a statistical sense but implications for policies to reduce inequality are limited.

If data allow identification of causal effects, then the factors that generate the inequality are identified.

Then one can (potentially) draw conclusions about how policies would impact on inequality.

O’Donnell et al. (2008)

(Jiménez-Rubio and Hernández-Quevedo 2010)

Kitagawa-Blinder-Oaxaca: Basic Idea

Two potential sources of mean differences in outcomes

1. Means

Differences in the prevalence of determinants of outcome

2. Coefficients

Differences in the coefficient of a given determinant on the outcome (i.e., effect measure modification)

Regression reproduces the mean

Left: OLS regression line always passes through \((\bar{x}, \bar{y})\) because residuals sum to zero. Right: the gap in group means decomposes into endowments (different \(\bar{x}\), same coefficients — grey triangle to blue dot) and coefficients (same \(\bar{x}\), different slopes/intercepts — grey triangle to red dot).

Two ways of expressing the mean difference in \(y\)

The overall gap between exposed and unexposed can be written as a function of differences the respective beta coefficients, evaluated at the mean for each group:

First method

Coefficients of unexposed
Means of exposed

\(y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{unexp}-\Delta\beta x^{exp}\)

Second method

Coefficients of exposed
Means of unexposed

\(y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{exp}-\Delta\beta x^{unexp}\)

The two methods are equally valid

In the first, the differences in the \(X\)s are weighted by the coefficients of the unexposed group and the differences in the coefficients are weighted by the \(X\)s of the exposed group: \[y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{unexp}-\Delta\beta x^{exp}\]

whereas, in the second, the differences in the \(X\)s are weighted by the coefficients of the exposed group and the differences in the coefficients are weighted by the \(X\)s of the unexposed group: \[y^{exp} - y^{unexp} = \Delta\bar{x}\beta^{exp}-\Delta\beta x^{unexp}\]

See Oaxaca (1973), Blinder (1973) and Cotton (1988) for details.

Example: Decomposing Educational Differences in BMI in Italy

Basic question

What is the average difference in body mass index (BMI) between those with low vs. high education?

How much of this difference is due to the fact that determinants of BMI (age, gender, smoking, drinking, income) differ between education groups?

Any residual difference is due to education differences in the associations between those risk factors and BMI — i.e., the coefficients differ.

Example data

European Social Survey Round 11, Italy (n = 1743)

Body mass index as outcome (kg/m²): weight / height²

Overall difference by education: High ed (ISCED 5–7) vs Low ed (ISCED 1–4)

Potential determinants (the Xs):

age (years)
gender (female = 1)
smoking (1 = current smoker)
binge drinking (1 = monthly or more)
married (1 = currently married)
income decile (1–10)

BMI distribution by education (High ed: 23.6, Low ed: 25.2, Gap = 1.6)

Differences in determinants

Low-educated Italians differ from high-educated on several covariates
These differences could explain part of the BMI gap

Differences in coefficients

Each covariate may have a different association with BMI by education
Differences in returns form the unexplained component

Decomposition results: variable contributions

Endowments: 38%; Coefficients: 62%

	Endowments		Coefficients
	Est	SE	Est	SE
Total	0.608	0.108	1.030	0.259
Age (years)	0.415	0.074	0.865	0.691
Female	0.080	0.051	0.466	0.181
Current smoker	0.004	0.008	-0.145	0.115
Binge drinking (monthly+)	0.001	0.007	0.003	0.104
Married	0.012	0.016	-0.037	0.208
Income (decile)	0.095	0.071	0.133	0.425
Intercept	—	—	-0.256	0.837

Low ed betas used as reference; rows show each variable’s contribution to the overall gap of 1.61 kg/m²

Decomposition results: summary

Endowments: 38%; Coefficients: 62%

	Endowments
	Est	SE
Total	0.608	0.108
Age (years)	0.415	0.074
Female	0.080	0.051
Current smoker	0.004	0.008
Binge drinking (monthly+)	0.001	0.007
Married	0.012	0.016
Income (decile)	0.095	0.071

Interpretation:

If the low-educated had the same covariate means as the high-educated (keeping the low-educated group’s own coefficients), their BMI would be 0.608 \(kg/m^2\) lower — closing 38% of the gap.
Most of this is due to older age among the low-educated, which predicts higher BMI.

Decomposition results: coefficients

Endowments: 38%; Coefficients: 62%

Interpretation:

If the high-educated group’s covariates had the same relationship with BMI as they do in the low-educated group (i.e., if low-ed coefficients applied to high-ed covariate means), their BMI would be 1.030 higher, accounting for 62% of the gap.
Note smoking is negative, since smoking predicts higher BMI among the high-educated and is more common among high educated.

	Coefficients
	Est	SE
Total	1.030	0.259
Age (years)	0.865	0.691
Female	0.466	0.181
Current smoker	-0.145	0.115
Binge drinking (monthly+)	0.003	0.104
Married	-0.037	0.208
Income (decile)	0.133	0.425
Intercept	-0.256	0.837

Which covariates drive the endowments?

Variables where groups differ and that predict BMI contribute most
Income and age are the biggest drivers

Methods frontier

Attempting to reconcile the non-causal framework of KBO with mediation methods, new estimators.

Jackson (2021)

Summary

Various decomposition techniques exist that may be useful for analyzing social determinants of health:

Regression-based decomposition of Concentration Index
Oaxaca decomposition of mean health between groups

All of these techniques make assumptions that need to be evaluated in the course of analysis.

When used properly, decomposition techniques can help to provide key evidence on why health inequalities exist and change over time.

References

Blinder, Alan S. 1973. “Wage Discrimination: Reduced Form and Structural Estimates.” Journal of Human Resources 8 (4): 436–55. https://doi.org/10.2307/144855.

Cotton, Jeremiah. 1988. “On the Decomposition of Wage Differentials.” The Review of Economics and Statistics 70 (2): 236. https://doi.org/10.2307/1928307.

Erreygers, Guido, and Roselinde Kessels. 2016. “Structural Equation Modeling for Decomposing Rank-Dependent Indicators of Socioeconomic Inequality of Health: An Empirical Study.” Health Economics Review 6 (1). https://doi.org/10.1186/s13561-016-0134-2.

Hosseinpoor, Ahmad Reza, Eddy van Doorslaer, and Niko Speybroeck. 2006. “Decomposing Socioeconomic Inequality in Infant Mortality in Iran.” International Journal of Epidemiology 35 (5): 1211–19. https://doi.org/10.1093/ije/dyl164.

Jackson, John W. 2021. “Meaningful Causal Decompositions in Health Equity Research: Definition, Identification, and Estimation Through a Weighting Framework.” Epidemiology 32 (2): 282290. https://doi.org/10.1097/EDE.0000000000001319.

Jiménez-Rubio, Dolores, and Cristina Hernández-Quevedo. 2010. “Inequalities in the Use of Health Services Between Immigrants and the Native Population in Spain: What Is Driving the Differences?” The European Journal of Health Economics 12 (1): 17–28. https://doi.org/10.1007/s10198-010-0220-z.

Kakwani, Nanak, Adam Wagstaff, and Eddy van Doorslaer. 1997. “Socioeconomic Inequalities in Health: Measurement, Computation, and Statistical Inference.” Journal of Econometrics 77 (1): 87–103. https://doi.org/10.1016/S0304-4076(96)01807-6.

Kessels, Roselinde, and Guido Erreygers. 2013. “Regression-Based Decompositions of Rank-Dependent Indicators of Socioeconomic Inequality of Health.” In Health and Inequality, 227–59. Emerald Group Publishing Limited. https://doi.org/10.1108/s1049-2585(2013)0000021010.

Kitagawa, Evelyn M. 1955. “Components of a Difference Between Two Rates.” Journal of the American Statistical Association 50 (272): 1168–94. https://doi.org/10.2307/2281213.

O’Donnell, Owen, Eddy van Doorslaer, Adam Wagstaff, and Magnus Lindelow. 2008. Analyzing Health Equity Using Household Survey Data. Washington, DC: World Bank.

Oaxaca, Ronald. 1973. “Male-Female Wage Differentials in Urban Labor Markets.” International Economic Review 14 (3): 693–709. https://doi.org/10.2307/2525981.

Wagstaff, Adam, Eddy van Doorslaer, and Naoko Watanabe. 2003. “On Decomposing the Causes of Health Sector Inequalities with an Application to Malnutrition Inequalities in Vietnam.” Journal of Econometrics 112 (1): 207–23. https://doi.org/10.1016/S0304-4076(02)00161-6.