Understanding mechanisms: opening the “black box” in observational studies

New statistical methods are increasingly being used in epidemiology. One of the most interesting developments is the new field of causal inference. There is no magic bullet for causality of course, but the new statistical thinking in this area is fruitful in that it clarifies issues. It is important that statistical analyses are transparent, but one should not shy away from the effort to learn about new concepts such as counterfactual or mediation analysis. These ideas are likely to play an increasing role in many fields, including epidemiology. The design and execution of cohort studies are often excellent. The issue then is how to learn as much as possible from the data, and new statistical methods can in some cases yield more than classical ones.

Workers with a low baseline socioeconomic status (SES) were at increased risk of selection out of paid employment in a 10-year follow-up, whereas self-reported ill-health at baseline was associated with an increased risk of selection out of work as well as a reduced likelihood of return to work after exit. These are the main results of a Dutch study published in this issue of the Scandinavian Journal of Work, Environment & Health (1). The study was based on a rich data set that allowed the separation of different pathways out of work: unemployment, early retirement, disability pension, or becoming economically inactive. Schuring et al recommend that “[P]olicies to improve labor force participation, especially among low socioeconomic level workers, should protect workers with health problems against exclusion from the labor force”. 

This appears to be sound advice. The logical prevention would then be targeted at workers with health problems. Such problems could be detected in health examinations at the workplace, or interventions could be targeted at risk groups at an early stage, eg, based on sickness absence registrations.

Baseline SES or health determining the subsequent working career could, in a life course terminology (2), be classified as effects of a critical period. It is however reasonable to consider more complex causal models based on the dynamic relationships between SES, health, and work. The influence of exit from work on subsequent health could be important in this respect (3, 4). Results in some studies suggest that retirement will be beneficial to health (5–7), but there are indications that the picture is more complex (4, 8). Effects on health and well-being are also likely to be dependent on exit pathways, being less favorable after unemployment (9) or involuntary retirement (10). Studies addressing employment often concern different time-dependent states, such as unemployment, sickness absence, disability, and retirement, which add to the analytical complexity (1, 11). In fact, the principal investigator of the Schuring et al study (1) addressed this issue and the call for multistate analytical models in a recent editorial (12). 

Given this complexity, the question is how to formulate the proper research questions in order to give a sound basis for prevention (13). One solution would be to ask whether changes in determinants affect changes in outcome entities (14). This requires considering the time dimension not only for the outcome but also for the independent variables. Schuring et al (1) acknowledge the problem of not having measurements of changes in determinants and the potential for such changes influencing employment transitions. Alternative research questions could be: to what extent will changes in SES or changes in health influence changes in employment? How will changes in employment influence SES and health? 

This calls for including time-dependent covariates in the analysis. One interesting possibility that has gained increased attention recently is to estimate direct and indirect effects in mediation analysis. ­Assessing the role of an intermediate between exposure and outcome – a mediator – could be considered as a way to open risk factor epidemiology’s “black box”, thereby improving the mechanistic understanding of an exposure–outcome relationship (15). The mediator of interest could be SES later in life as an intermediate between childhood SES and chronic disease in adulthood (16) or physical work environment as an intermediate between SES and sickness absence (17). One realistic and interesting possibility is that the mediator might be more accessible for intervention than the exposure and therefore more valuable from a preventive point of view (15). For example, it could be easier to intervene on the work environment than SES level. 

How should direct and indirect effects be unraveled? The traditional way has been to compare exposure effects in regression models with and without the mediator (18). This is a strategy that will often produce flawed results, one of the problems being confounding from time-dependent covariates (16, 19). Analysis of time-to-event data in Cox regression and comparing models with and without the mediator involves additional problems because the important assumption of proportional hazards can never be satisfied for both models. One alternative to the traditional regression methods (eg, logistic or Cox regression) is to apply a counterfactual framework (20). The concept behind this is, to a large degree, based on ideas of intervention described in Judea Pearl’s famous book “Causality” (21). The aim is to analyze observational data in a way that mimics randomized experiments. Among the options for making counterfactual deductions from observational data, marginal structural model (MSM) with inverse probability weighting (IPW) and the use of causal diagrams (directed acyclic graphs or DAG) is the most common (16, 20, 22). This approach is well suited for causal mediation analysis (16). Dynamic path analysis is an alternative for estimating direct and indirect effects (17, 20, 23). This method is particularly useful when time-to-event is the outcome, ie, in survival analysis (17, 20). Lange & Hansen (17) estimated the mediating role of the physical work environment on the relationship between SES and sickness absence. Two methods were compared: one traditional where Cox models with and without the covariate (work environment) were compared, and the other in a counterfactual analysis where exposure (SES) and covariates were time-dependent, modeling the rate using the the Aalen additive hazard model (17). The traditional way was performed in accord with an earlier study applying the same data, where the degree of reduction in rate ratios in models including the covariate were interpreted as a corresponding degree of mediation through that factor (24). The results of the counterfactual analysis were qualitatively in accordance with the mediating effect in ordinary Cox regression analysis, but the quantitative effect was stronger (17). 

High quality event data (1, 11, 24) from registers and other sources have become increasingly available. Counterfactual frameworks could be a valuable basis for analytical progress and increased causal understanding of social and biologic processes and, thereby, for correct interventions. These causal methods have limitations [eg, dependent on the assumption of no unmeasured confounding of exposure or mediator (19, 20)]. This problem should be minimized by the careful collection of covariates that could act as confounders of the exposure– or mediator–outcome relationships (19) as well as sensitivity analysis assessments (25).

Emerging tools and techniques open new avenues for identification and estimation of causal effects in observational studies. This is one reason why they have become increasingly more popular. One example is the 2012 awarding of the Rothman Epidemiology Prize to first author Theis Lange for his article “Direct and indirect effects in a survival context” (26). 

Counterfactual causal analysis requires methods that could be challenging to the social or occupational epidemiologist. Does this mean that the floor will be completely taken over by biostatisticians? There is no reason to suspect such a scenario: the study of social or biological processes will depend on asking the right questions (13). Howards et al (25) provide an excellent example of the need for biologic knowledge and reasoning, pursuing the right model for analyzing the relationship between smoking and miscarriage. In this sense, the driving force behind causal inference is still biologic knowledge and hypotheses. Rather than fearing undue dominance by biostatisticians, there should be a call for a more intimate collaboration in all stages of studies applying the new causal methods.


Understanding mechanisms: opening the "black box" in observational studies
New statistical methods are increasingly being used in epidemiology.One of the most interesting developments is the new field of causal inference.There is no magic bullet for causality of course, but the new statistical thinking in this area is fruitful in that it clarifies issues.It is important that statistical analyses are transparent, but one should not shy away from the effort to learn about new concepts such as counterfactual or mediation analysis.These ideas are likely to play an increasing role in many fields, including epidemiology.The design and execution of cohort studies are often excellent.The issue then is how to learn as much as possible from the data, and new statistical methods can in some cases yield more than classical ones.
Workers with a low baseline socioeconomic status (SES) were at increased risk of selection out of paid employment in a 10-year follow-up, whereas self-reported ill-health at baseline was associated with an increased risk of selection out of work as well as a reduced likelihood of return to work after exit.These are the main results of a Dutch study published in this issue of the Scandinavian Journal of Work, Environment & Health (1).The study was based on a rich data set that allowed the separation of different pathways out of work: unemployment, early retirement, disability pension, or becoming economically inactive.Schuring et al recommend that "[P]olicies to improve labor force participation, especially among low socioeconomic level workers, should protect workers with health problems against exclusion from the labor force".This appears to be sound advice.The logical prevention would then be targeted at workers with health problems.Such problems could be detected in health examinations at the workplace, or interventions could be targeted at risk groups at an early stage, eg, based on sickness absence registrations.
Baseline SES or health determining the subsequent working career could, in a life course termin ology (2), be classified as effects of a critical period.It is however reasonable to consider more complex causal models based on the dynamic relationships between SES, health, and work.The influence of exit from work on subsequent health could be important in this respect (3,4).Results in some studies suggest that retirement will be beneficial to health (5-7), but there are indications that the picture is more complex (4,8).Effects on health and well-being are also likely to be dependent on exit pathways, being less favorable after unemployment (9) or involuntary retirement (10).Studies addressing employment often concern different time-dependent states, such as unemployment, sickness absence, disability, and retirement, which add to the analytical complexity (1,11).In fact, the principal investigator of the Schuring et al study (1) addressed this issue and the call for multistate analytical models in a recent editorial (12).
Given this complexity, the question is how to formulate the proper research questions in order to give a sound basis for prevention (13).One solution would be to ask whether changes in determinants affect changes in outcome entities (14).This requires considering the time dimension not only for the outcome but also for the independent variables.Schuring et al (1) acknowledge the problem of not having measurements of changes in determinants and the potential for such changes influencing employment transitions.Alternative research questions could be: to what extent will changes in SES or changes in health influence changes in employment?How will changes in employment influence SES and health?
This calls for including time-dependent covariates in the analysis.One interesting possibility that has gained increased attention recently is to estimate direct and indirect effects in mediation analysis.Assessing the role of an intermediate between exposure and outcome -a mediator -could be considered as a way to open risk factor epidemiology's "black box", thereby improving the mechanistic understanding of an exposure-outcome relationship (15).The mediator of interest could be SES later in life as an intermediate between childhood SES and chronic disease in adulthood (16) or physical work environment as an intermediate between SES and sickness absence (17).One realistic and interesting possibility is that the mediator might be more accessible for intervention than the exposure and therefore more valuable from a preventive point of view (15).For example, it could be easier to intervene on the work environment than SES level.
How should direct and indirect effects be unraveled?The traditional way has been to compare exposure effects in regression models with and without the mediator (18).This is a strategy that will often produce flawed results, one of the problems being confounding from time-dependent covariates (16,19).Analysis of time-to-event data in Cox regression and comparing models with and without the mediator involves additional problems because the important assumption of proportional hazards can never be satisfied for both models.One alternative to the traditional regression methods (eg, logistic or Cox regression) is to apply a counterfactual framework (20).The concept behind this is, to a large degree, based on ideas of intervention described in Judea Pearl's famous book "Causality" (21).The aim is to analyze observational data in a way that mimics randomized experiments.Among the options for making counterfactual deductions from observational data, marginal structural model (MSM) with inverse probability weighting (IPW) and the use of causal diagrams (directed acyclic graphs or DAG) is the most common (16,20,22).This approach is well suited for causal mediation analysis (16).Dynamic path analysis is an alternative for estimating direct and indirect effects (17,20,23).This method is particularly useful when time-to-event is the outcome, ie, in survival analysis (17,20).Lange & Hansen (17) estimated the mediating role of the physical work environment on the relationship between SES and sickness absence.Two methods were compared: one traditional where Cox models with and without the covariate (work environment) were compared, and the other in a counterfactual analysis where exposure (SES) and covariates were time-dependent, modeling the rate using the the Aalen additive hazard model (17).The traditional way was performed in accord with an earlier study applying the same data, where the degree of reduction in rate ratios in models including the covariate were interpreted as a corresponding degree of mediation through that factor (24).The results of the counterfactual analysis were qualitatively in accordance with the mediating effect in ordinary Cox regression analysis, but the quantitative effect was stronger (17).
High quality event data (1,11,24) from registers and other sources have become increasingly available.Counterfactual frameworks could be a valuable basis for analytical progress and increased causal understanding of social and biologic processes and, thereby, for correct interventions.These causal methods have limitations [eg, dependent on the assumption of no unmeasured confounding of exposure or mediator (19,20)].This problem should be minimized by the careful collection of covariates that could act as confounders of the exposure-or mediator-outcome relationships (19) as well as sensitivity analysis assessments (25).
Emerging tools and techniques open new avenues for identification and estimation of causal effects in observational studies.This is one reason why they have become increasingly more popular.One example is the 2012 awarding of the Rothman Epidemiology Prize to first author Theis Lange for his article "Direct and indirect effects in a survival context" (26).
Counterfactual causal analysis requires methods that could be challenging to the social or occupational epidemiologist.Does this mean that the floor will be completely taken over by biostatisticians?There is no reason to suspect such a scenario: the study of social or biological processes will depend on asking the right questions (13).Howards et al (25) provide an excellent example of the need for biologic knowledge and reasoning, pursuing the right model for analyzing the relationship between smoking and miscarriage.In this sense, the driving force behind causal inference is still biologic knowledge and hypotheses.Rather than fearing undue dominance by biostatisticians, there should be a call for a more intimate collaboration in all stages of studies applying the new causal methods.