Introduction to Healthcare Data Science (Overview)
Healthcare analytics is the collection and analysis of data in the healthcare field to study determinants of disease in human populations, identify and mitigate risk by predicting outcomes. This post introduces some common epidemiological study designs and an overview of the modern healthcare data analytics process.
Types of Epidemiologic Studies
In general, epidemiologic studies can be classified into 3 types: Interventional study, Observational study, Meta-analysis study
Interventional or Randomized control study
Clinical medicine relies on evidence base found in strong research to inform best practices and improve clinical care. The gold standard for study design to provide evidence is Randomized control (RCT). The main idea of this kind of research is to prove the root cause that makes a certain disease happen or the causal effect of a treatment. RCTs are performed in fairly homogeneous patient populations when participants are allocated by chance to two similar groups. The researchers then try different interventions or treatments for these two groups and compare the outcomes.
As an example, a study was conducted to assess whether improved lifestyle habits could reduce the hemoglobin A1c (HbA1c) levels of employees. In the experiment, the intervention consisted of a 3-month competition among employees to adopt healthier lifestyle habits (Eat better, Move more, and Quit smoking) or keep their current lifestyle. After the intervention, employees with elevated HbA1c significantly reduced their HbA1c levels while employees without elevated HbA1c levels of employees without intervention were not changed.
In ideal conditions, there are no confounding variables in a randomized experiment, thus RCTs are often designed to investigate the causal relationship between exposure and outcome. However, RCTs have several limitations, RCTs are often costly, time-intensive, labor-intensive, slow, and can consist of homogeneous patients that are seldom generalizable to every patient population.
Unlike RCTs, Observational studies have no active interventions which mean the researchers do not interfere with their participants. In contrast with interventional studies, observation studies are usually performed in heterogeneous patient populations. In these studies, researchers often define an outcome of interest (e.g a disease) and use data collected on patients such as demographic, labs, vital signs, and disease states to explore the relationship between exposures and outcome, determine which factors contributed to the outcome and attempt to draw inferences about the effects of different exposures on the outcome. Findings from observational studies can subsequently be developed and tested with the use of RCTs in targeted patient populations.
Observational studies tend to be less time- and cost-intensive. There are three main study designs in observational studies: prospective study design, retrospective study design, and cross-sectional study design.
Follow-up study/ Prospective study/ Longitudinal (incidence) study
A prospective study is a study in which a group of disease-free individuals is identified as a baseline and are followed over some time until some of them develop the disease. The development of disease over time is then related to other variables measured at baseline, generally called exposure variables. The study population in a prospective study is often called a cohort.
Retrospective study/ Case-Control study
A retrospective study is a study in which two groups of individuals are initially identified: (1) a group that has the disease under study (the cases) and (2) a group that does not have the disease under study (the controls). Cases are individuals who have a specific disease investigated in the research. Controls are those who did not have the disease of interest in the research. Usually, a retrospective history of health habits before getting the disease is obtained. An attempt is then made to relate their prior health habits to their current disease status. This type of study is also sometimes called a case-control study.
Cross-sectional (Prevalence) study/ Prevalence study
A cross-sectional study is one in which a study population is ascertained at a single point in time. This type of study is sometimes called a prevalence study because the prevalence of disease at one point in time is compared between exposed and unexposed individuals. . Prevalence of a disease is obtained by dividing the number of people who currently have the disease by the number of people in the study population.
Often more than one investigation is performed to study a particular research question, by different research groups reporting significant differences for a particular finding and other research groups reporting no significant differences. Therefore, in meta-analysis researchers collect and synthesize findings from many existing studies and provides a clear picture of factors associated with the development of a certain disease. These results may be utilized for ranking and prioritizing risk factors in other researches.
Modern Healthcare Data analytics approach
Secondary Analysis and modern healthcare data analytics approach
In primary research infrastructure, designing a large-scale randomized controlled trial (RCTs) is expensive and sometimes unfeasible. The alternative approach for expansive data is to utilize electronic health records (EHR). In contrast with the primary analysis, secondary analysis performs retrospective research using data collected for purposes other than research such as Electronic Health Record (EHR). Modern healthcare data analytic projects apply advanced data analysis methods, such as machine learning, and perform integrative analysis to leverage a wealth of deep clinical and administrative data with longitudinal history from EHR to get a more comprehensive understanding of the patient’s condition.
Electronic Health Record (EHR)
EHRs are data generated during routine patient care. Electronic health records contain large amounts of longitudinal data and a wealth of detailed clinical information. Thus, the data, if properly analyzed and meaningfully interpreted, could vastly improve our conception and development of best practices. Common data in EHR are listed as the following:
Age, gender, occupation, marital status, ethnicity
SBP, DBP, Height, Weight, BMI, waist circumference
Stature, sitting height, elbow width, weight, subscapular, triceps skinfold measurement
Creatinine, hemoglobin, white blood cell count (WBC), total cholesterol, cholesterol, triglyceride, gamma-glutamyl transferase (GGT)
frequency in urination, skin rash, stomachache, cough
Medical history and Family diseases
diabetes, traumas, dyslipidemia, hypertension, cancer, heart diseases, stroke, diabetes, arthritis, etc
Behavior risk factors from Questionnaires such as Physical activity, dietary habit, smoking, drinking alcohol, sleeping, diet, nutritional habits, cognitive function, work history, and digestive health, etc
Medications (prescriptions, dose, timing), procedures, etc.
Using EHR to Conduct Outcome and Health Services Research
In the secondary analysis, the process of analyzing data often includes steps:
- Problem Understanding and Formulating the Research Question: In this step, the process of transforming a clinical question into research is defined. There are 3 key components of the research question: the study sample (or patient cohort), the exposure of interest (e.g., information about patient demographic, lifestyle habit, medical history, regular health checkup test result), and the outcome of interest (e.g., a patient has diabetes or not after 5 years):
- Data Preparation and Integration: Extracted raw data can be collected from different data sources, or be in separate datasets with different representation and formats. Data Preparation and Integration is the process of combining and reorganizing data derived from various data sources (such as databases, flat files, etc.) into a consistent dataset that contains all the information required for desired statistical analysis
- Exploratory Data Analysis/ Data Understanding: Before statistics and machine learning models are employed, there is an important step of exploring data which is important for understanding the type of information that has been collected and what they mean. Data Exploration consists of investigating the distribution of variables, patterns, and nature inside the data and checking the quality of the underlying data. This preliminary examination will influence which methods will be most suitable for the data preprocessing step and choosing the appropriate predictive model
- Data Preprocessing: Data preprocessing is one of the most important steps and critical in the success of machine learning techniques. Electronic health records (EHR) often were collected for clinical purposes. Therefore, these databases can have many data quality issues. Preprocessing aims at assessing and improving the quality of data to allow for reliable statistical analysis.
- Feature Selection: Since the final dataset may have several hundreds of data fields, and not all of them are relevant to explain the target variable. In many machine learning algorithms, high-dimensionality can cause overfitting or reduce the accuracy of the model instead of improving it. Features selection algorithms are used to identify features that have an important predictive role. These techniques do not change the content of the initial features set, only select a subset of them. The purpose of feature selection is to help to create optimize and cost-benefit models for enhancing prediction performance.
- Predictive Model: To develop prediction models with statistical models and machine learning algorithms could be employed. The purpose of machine learning is to design and develop prediction models, by allowing the computer to learn from data or experience to solve a certain problem. These models are useful for understanding the system under study, the models can be divided according to the type of outcome that they produce which includes the Classification model, Regression model, or Clustering model
- Prediction and Model Evaluation: This process is to evaluate the performance of predictive models. The evaluation should include internal and external validation. Internal validation refers to the model performance evaluation in the same dataset in which the model was developed. External validation is the evaluation of a prediction model in other populations with different characteristics to assess the generalizability of the model.
Please also check our Healthcare Data Science example
 Fundamentals of Biostatistics – Bernard Rosner, Harvard University
 Secondary Analysis of Electronic Health Records – Springer Open
Hiring Data Scientist / Engineer
We are looking for Data Scientist and Engineers.
Please check our Career Page.
Healthcare Data Science Project
Please check about Healthcare Data Science about actual Data Science project examples.
Vietnam AI / Data Science Lab
Please also visit Vietnam AI Lab