Causal inference and potential outcome framework

In this blog, we would like to introduce basic concepts in causal inference and the potential outcome framework.

1.Causality terminology

  1. Unit: The fundamental notion is that causality is tied to action (or manipulation, treatment, or intervention), applied to a unit. A unit here can be a physical object, a firm, a person, at a particular point in time. The same physical object or person at a different time in a different unit. For instance, when you have a headache and you decide to take an aspirin to relieve your headache, you could also have chosen not to take the aspirin, or you could have chosen to take alternative medicine. In this framework, articulating with precision the nature and timing of the action sometimes require a certain amount of imagination. For example, if we define race solely in terms of skin color, the action might be a pill that alters only skin color. Such a pill may not currently exist (but, then, neither did surgical procedures for heart transplants hundreds of years ago), but we can still imagine such action.
  2. Active treatment vs. Control treatment: Often, one of these actions corresponds to more active treatment (e.g., taking an aspirin) in contrast to a more passive action (e.g., not taking the aspirin). We refer to the first action as the active treatment, the second action as the control treatment
  3. Potential Outcome: given a unit and a set of actions, we associate each action-unit pair with a potential outcome. We refer to these outcomes as potential outcomes because only one will ultimately be realized and therefore possibly observed: the potential outcome corresponding to the taken. The other potential outcomes cannot be observed because the corresponding actions that would lead to them being realized were not taken.
  4. Causal Effect: The causal effect of one action or treatment relative to another involves the comparison of these potential outcomes, one realized and the others not realized and therefore not observable.

Suppose we have a ‘treatment’ variable A with two levels: 1 and 0 and an outcome variable Y with two levels: 1 (death) and 0 (survival). Treatment A has a causal effect on an individual’s outcome Y if the potential outcomes under a = 1 and a = 0 are different. The causal effect of the treatment involves the comparison of these potential outcomes. A causes B if:

Causality terminology

For example, consider the case of a single unit, I, at a particular point in time, contemplating whether or not to take an aspirin for my headache. That is, there are two treatment levels, taking an aspirin, and not taking aspirin. There are therefore two potential outcomes, Y(Aspirin) and Y(No Aspirin), one for each level of the treatment.

Table 1: illustrates this situation assuming the values Y(Aspirin) = No Headache, Y (No Aspirin) = Headache.

Table 1: illustrates this situation assuming the values Y(Aspirin)

  1. A fundamental problem of causal inference: There are two important aspects of the definition of a causal effect. First, the definition of the causal effect depends on the potential outcomes, but it does not depend on which outcome is observed. Specifically, whether I take aspirin (and am therefore unable to observe the state of my headache with no aspirin) or do not take aspirin (and am thus unable to observe the outcome with an aspirin) does not affect the definition of the causal effect. Second, the causal effect is the comparison of potential outcomes, for the same unit, at the same moment in time post-treatment. In particular, the causal effect is not defined in terms of comparisons of outcomes at different times, as in a before-and-after comparison of my headache before and after deciding to take or not to take the aspirin. “The fundamental problem of causal inference” (Holland, 1986, p.947) is therefore the problem that at most one of the potential outcomes can be realized and thus observed. If the action you take is Aspirin, you observe Y(Aspirin) and will never know the value of Y(No Aspirin) because you cannot go back in time.
  2. Causal Estimands / Average Treatment Effect: For a population of units, indexed by i = 1,…, N. Each unit in this population can be exposed to one of a set of treatments.
  • Let Ti (or Wi elsewhere)denote the set of treatments to which unit I can be exposed.

Ti = T = {0, 1}

  • For each unit i, and for each treatment in the common set of treatments, there are corresponding potential outcome Yi(0) and Yi(1).
  • Comparison of Y1(1) and Yi(0) are unit-level causal effects

Yi(1) – Yi(0)

2 .Potential Outcomes Framework

2.1       Introduction

The potential outcome framework, formalized for randomized experiments by Neyman (1923) and developed for observational settings by Rubin (1974), defines for all individuals such potential outcomes, only some of which are subsequently observed. This framework dominates applications in epidemiology, medical statistics, and economics, stating the conditions under causal effects can be estimated in rigorous mathematical language

The potential outcomes approach was designed to quantify the magnitude of the causal effect of a factor on an outcome, NOT to determine whether it is a cause or not. Its goal is to estimate the effects of “cause”, not causes of an effect. Quantitative counterfactual inference helps us predict what would happen under different circumstances, but is agnostic in saying which is a cause or not.

2.2       Counterfactual

The potential outcome is the value corresponding to the various levels of treatment: Suppose we have a ‘treatment’ variable X with two levels: 1 (treat) and 0 (not treat) and an outcome variable Y with two levels: 1 (death) and 0 (survival). If we expose a subject, we observe Y1 but we do not observe Y0. Indeed, Y0 is the value we would have observed if the subject had been exposed. The unobserved variable is called a counterfactual. The variables (Y0, Y1) are also called potential outcomes. We have enlarged our set of variables from (X, Y) to (X, Y, Y0, Y1). A small dataset might look like this

2       Potential Outcomes Framework

The asterisks indicate unobserved variables. Causal questions involve the distribution p(y0, y1) of the potential outcomes. We can interpret p(y1) as p(y|set X = 1) and we can interpret p(y0) as p(y|set X = 0). For each unit, we can observe at most one of the two potential outcomes, the other is missing (counterfactual).

Causal inference under the potential outcome framework is essentially a missing data problem. Suppose now that X is a binary variable that represents some exposure. So X = 1 means the subject was exposed and X = 0 means the subject was not exposed. We can address the problem of predicting Y from X by estimating E(Y|X = x). To address causal questions, we introduce counterfactuals. Let Y1 denote the response if the subject is exposed. Let Y0 denote the response if the subject is not exposed. Then

2.2       Counterfactual

Potential outcomes and assignments jointly determine the values of the observed and missing outcomes:

2.2       Counterfactual

Since it is impossible to observe the counterfactual for a given individual or set of individuals. Instead, evaluators must compare outcomes for two otherwise similar sets of beneficiaries who are and are not exposed to the intervention, with the latter group representing the counterfactual

2.3       Confounding

In some cases, it is not feasible or ethical to do a randomized experiment and we must use data from observational (non-randomized) studies. Smoking and lung cancer is an example. Can we estimate causal parameters from observational (non-randomized) studies? The answer is: sort of

In an observational study, the treated and untreated groups will not be comparable. Maybe the healthy people chose to take the treatment and the unhealthy people didn’t. In other words, X is not independent. The treatment may have no effect but we would still see a strong association between Y and X. In other words, a (correlation) may be large even though q (causation) = 0.

Here is a simplified example. Suppose X denotes whether someone takes vitamins and Y is some binary health outcome (with Y = 1 meaning “healthy”)

2.3       Confounding

In this example, there are only two types of people: healthy and unhealthy. The healthy people have (Y0, Y1) = (1,1). These people are healthy whether or not they take vitamins. The unhealthy people have (Y0, Y1)= (0,0). These people are unhealthy whether or not they take vitamins.

The observed data are:

2.3       Confounding

In this example, q = 0 but a = 1. The problem is that people who choose to take vitamins are different from people who choose not to take vitamins. That’s just another way of saying that X is not independent of (Y0, Y1).

To account for the differences in the groups, we can measure confounding variables. These are the variables that affect both X and Y. These variables explain why the two groups of people are different. In other words, these variables account for the dependence between X and  Y. By definition, there are no such variables in a randomized experiment. The hope is that if we measure enough confounding variables then, perhaps the treated and untreated groups will be comparable, condition on Z. This means that  is independent of  conditional on Z.

2.4       Measuring the Average Causal Effect

The mean treatment effect or mean causal effect is defined by

E(Y1) – E(Y0) = E(Y|set X=1) – E(Y|set X=0)

The parameter q has the following interpretation: q is the mean response if we exposed everyone minus the mean response if we exposed no-one

The estimator for parameter: Estimator = difference-in-means


Hernán MA, Robins JM (2020). Causal Inference: What If
Imbens, G., & Rubin, D. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences.
Judea Pearl (2000). Causality: Models, Reasoning and Inference

Data Science Blog

Please check our other Data Science Blog

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Please also visit Vietnam AI Lab


Introduction to Healthcare Data Science

Introduction to Healthcare Data Science (Overview)

Healthcare analytics is the collection and analysis of data in the healthcare field to study determinants of disease in human populations, identify and mitigate risk by predicting outcomes. This post introduces  some common epidemiological study designs and an overview of the modern healthcare data analytics process.

Types of Epidemiologic Studies

In general, epidemiologic studies can be classified into 3 types: Interventional study, Observational study, Meta-analysis study

Interventional or Randomized control study

Clinical medicine relies on evidence base found in strong research to inform best practices and improve clinical care. The gold standard for study design to provide evidence is Randomized control (RCT). The main idea of this kind of research is to prove the root cause that makes a certain disease happen or the causal effect of a treatment. RCTs are performed in fairly homogeneous patient populations when participants are allocated by chance to two similar groups. The researchers then try different interventions or treatments for these two groups and compare the outcomes.

As an example, a study was conducted to assess whether improved lifestyle habits could reduce the hemoglobin A1c (HbA1c) levels of employees. In the experiment, the intervention consisted of a 3-month competition among employees to adopt  healthier lifestyle habits (Eat better, Move more, and Quit smoking) or keep their current lifestyle. After the intervention, employees with elevated HbA1c significantly reduced their HbA1c levels while employees without elevated HbA1c levels of employees without intervention were not changed.

In ideal conditions, there are no confounding variables in a randomized experiment, thus RCTs are often designed to investigate the causal relationship between exposure and outcome. However, RCTs have several limitations, RCTs are often costly, time-intensive, labor-intensive, slow, and can consist of homogeneous patients that are seldom generalizable to every patient population.

Observational studies

Unlike RCTs, Observational studies have no active interventions which mean the researchers do not interfere with their participants. In contrast with interventional studies, observation studies are usually performed in heterogeneous patient populations. In these studies, researchers often define an outcome of interest (e.g a disease) and use data collected on patients such as demographic, labs, vital signs, and disease states to explore the relationship between exposures and outcome, determine which factors contributed to the outcome and attempt to draw inferences about the effects of different exposures on the outcome. Findings from observational studies can subsequently be developed and tested with the use of RCTs in targeted patient populations.

Observational studies tend to be less time- and cost-intensive. There are three main study designs in observational studies: prospective study design, retrospective study design, and cross-sectional study design.

Follow-up study/ Prospective study/ Longitudinal (incidence) study

A prospective study is a study in which a group of disease-free individuals is identified as a baseline and are followed over some time until some of them develop the disease. The development of disease over time is then related to other variables measured at baseline, generally called exposure variables. The study population in a prospective study is often called a cohort.

Retrospective study/ Case-Control study

A retrospective study is a study in which two groups of individuals are initially identified: (1) a group that has the disease under study (the cases) and (2) a group that does not have the disease under study (the controls). Cases are individuals who have a specific disease investigated in the research. Controls are those who did not have the disease of interest in the research. Usually, a retrospective history of health habits before getting the disease is obtained. An attempt is then made to relate their prior health habits to their current disease status. This type of study is also sometimes called a case-control study.

Cross-sectional (Prevalence) study/ Prevalence study

A cross-sectional study is one in which a study population is ascertained at a single point in time. This type of study is sometimes called a prevalence study because the prevalence of disease at one point in time is compared between exposed and unexposed individuals. . Prevalence of a disease is obtained by dividing the number of people who currently have the disease by the number of people in the study population.

Meta-data analysis

Often more than one investigation is performed to study a particular research question, by different research groups reporting significant differences for a particular finding and other research groups reporting no significant differences. Therefore, in meta-analysis researchers collect and synthesize findings from many existing studies and provides a clear picture of factors associated with the development of a certain disease. These results may be utilized for ranking and prioritizing risk factors in other researches.

Modern Healthcare Data analytics approach

Secondary Analysis and modern healthcare data analytics approach

In primary research infrastructure, designing a large-scale randomized controlled trial (RCTs) is expensive and sometimes unfeasible.  The alternative approach for expansive data is to utilize electronic health records (EHR). In contrast with the primary analysis, secondary analysis performs retrospective research using data collected for purposes other than research such as Electronic Health Record (EHR). Modern healthcare data analytic projects apply advanced data analysis methods, such as machine learning, and perform integrative analysis to leverage a wealth of deep clinical and administrative data with longitudinal history from EHR to get a more comprehensive understanding of the patient’s condition.

Electronic Health Record (EHR)

EHRs are data generated during routine patient care. Electronic health records contain large amounts of longitudinal data and a wealth of detailed clinical information. Thus, the data, if properly analyzed and meaningfully interpreted, could vastly improve our conception and development of best practices. Common data in EHR are listed as the following:

  • Demographics

    Age, gender, occupation, marital status, ethnicity

  • Physical measurement

    SBP, DBP, Height, Weight, BMI, waist circumference

  • Anthropometry

    Stature, sitting height, elbow width, weight, subscapular, triceps skinfold measurement

  • Laboratory

    Creatinine, hemoglobin, white blood cell count (WBC), total cholesterol, cholesterol, triglyceride, gamma-glutamyl transferase (GGT)

  • Symptoms

    frequency in urination, skin rash, stomachache, cough

  • Medical history and Family diseases

    diabetes, traumas, dyslipidemia, hypertension, cancer, heart diseases, stroke, diabetes, arthritis, etc

  • Lifestyle habit

    Behavior risk factors from Questionnaires such as Physical activity, dietary habit, smoking, drinking alcohol, sleeping, diet, nutritional habits, cognitive function, work history, and digestive health, etc

  • Treatment

    Medications (prescriptions, dose, timing), procedures, etc.

Using EHR to Conduct Outcome and Health Services Research

In the secondary analysis, the process of analyzing data often includes steps:

  1. Problem Understanding and Formulating the Research Question: In this step, the process of transforming a clinical question into research is defined. There are 3 key components of the research question: the study sample (or patient cohort), the exposure of interest (e.g., information about patient demographic, lifestyle habit, medical history, regular health checkup test result), and the outcome of interest (e.g., a patient has diabetes or not after 5 years):
  2. Data Preparation and Integration: Extracted raw data can be collected from different data sources, or be in separate datasets with different representation and formats. Data Preparation and Integration is the process of combining and reorganizing data derived from various data sources (such as databases, flat files, etc.) into a consistent dataset that contains all the information required for desired statistical analysis
  3. Exploratory Data Analysis/ Data Understanding: Before statistics and machine learning models are employed, there is an important step of exploring data which is important for understanding the type of information that has been collected and what they mean. Data Exploration consists of investigating the distribution of variables, patterns, and nature inside the data and checking the quality of the underlying data. This preliminary examination will influence which methods will be most suitable for the data preprocessing step and choosing the appropriate predictive model
  4. Data Preprocessing: Data preprocessing is one of the most important steps and critical in the success of machine learning techniques. Electronic health records (EHR) often were collected for clinical purposes. Therefore, these databases can have many data quality issues. Preprocessing aims at assessing and improving the quality of data to allow for reliable statistical analysis.
  5. Feature Selection: Since the final dataset may have several hundreds of data fields, and not all of them are relevant to explain the target variable. In many machine learning algorithms, high-dimensionality can cause overfitting or reduce the accuracy of the model instead of improving it. Features selection algorithms are used to identify features that have an important predictive role. These techniques do not change the content of the initial features set, only select a subset of them. The purpose of feature selection is to help to create optimize and cost-benefit models for enhancing prediction performance.
  6. Predictive Model: To develop prediction models with statistical models and machine learning algorithms could be employed. The purpose of machine learning is to design and develop prediction models, by allowing the computer to learn from data or experience to solve a certain problem. These models are useful for understanding the system under study, the models can be divided according to the type of outcome that they produce which includes the Classification model, Regression model, or Clustering model
  7. Prediction and Model Evaluation: This process is to evaluate the performance of predictive models. The evaluation should include internal and external validation. Internal validation refers to the model performance evaluation in the same dataset in which the model was developed. External validation is the evaluation of a prediction model in other populations with different characteristics to assess the generalizability of the model.
    Please also check our Healthcare Data Science example


[1] Fundamentals of Biostatistics – Bernard Rosner, Harvard University

[2] Secondary Analysis of Electronic Health Records – Springer Open

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineers.
Please check our Career Page.

Healthcare Data Science Project

Please check about Healthcare Data Science about actual Data Science project examples.

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab