Introduction to Healthcare Data Science

Introduction to Healthcare Data Science (Overview)

Healthcare analytics is the collection and analysis of data in the healthcare field in order to study determinants of disease in human populations, identify and mitigate risk by predicting outcomes. This post introduces  some common epidemiological study designs and overview of modern healthcare data analytics process.

Types of Epidemiologic Studies

In general, epidemiologic studies can be classified into 3 types: Interventional study, Observational study, Meta-analysis study

Interventional or Randomized control study

Clinical medicine relies on evidence base found in strong research to inform best practices and improve clinical care. The gold-standard for study design to provide evidence is Randomized control (RCT). The main idea of this kind of research is to prove the root cause that makes a certain disease happen or causal effect of a treatment. RCTs are performed in fairly homogeneous patient populations when participants are allocated by chance to two similar groups. The researchers then try with different interventions or treatments to these two groups and compare the outcomes.

As an example, a study was conducted to assess whether improved lifestyle habits could reduce hemoglobin A1c (HbA1c) levels of employees. In the experiment, the intervention consisted of a 3-month competition among employees to adopt  healthier lifestyle habits (Eat better, Move more, and Quit smoking) or keep their current lifestyle. After the intervention, employees with elevated HbA1c significantly reduced their HbA1c levels while employees without elevated HbA1c levels of employees without intervention was not changed.

In ideal conditions, there are no confounding variables in a randomized experiment, thus RCTs are often designed to investigate causal relationship between exposure and outcome. However, RCTs have several limitations, RCTs are often costly, time-intensive, labor-intensive, slow, and can consist of homogeneous patients that are seldom generalizable to every patient population.

Observational studies

Unlike RCTs, Observational studies have no active interventions which means the researchers do not interfere with their participants. In contrast with interventional studies, observations studies are usually performed in heterogeneous patient populations. In these studies, researchers often define an outcome of interest (e.g a disease) and use data collected on patients such as demographic, labs, vital signs, and disease states to explore the relationship between exposures and outcome, determine which factors contributed to the outcome and attempt to draw inferences about the effects of different exposures on the outcome. Findings from observational studies can subsequently be developed and tested with the use of RCTs in targeted patient populations.

Observational studies tend to be less time- and cost- intensive. There are three main study designs in observational studies: prospective study design, retrospective study design, and cross-sectional study design.

Follow-up study/ Prospective study/ Longitudinal (incidence) study

A prospective study is a study in which a group of disease-free individuals is identified as baseline and are followed over a period of time until some of them develop the disease. The development of disease over time is then related to other variables measured at baseline, generally called exposure variables. The study population in a prospective study is often called a cohort.

Retrospective study/ Case-Control study

A retrospective study is a study in which two groups of individuals are initially identified: (1) a group that has the disease under study (the cases) and (2) a group that does not have the disease under study (the controls). Cases are individuals who have a specific disease investigated in the research. Controls are those who did not have disease of interest in the research. Usually, a retrospective history of health habits prior to getting disease is obtained. An attempt is then made to relate their prior health habits to their current disease status. This type of study is also sometimes called a case-control study.

Cross-sectional (Prevalence) study/ Prevalence study

A cross-sectional study is one in which a study population is ascertained at a single point in time. This type of study is sometimes called a prevalence study because the prevalence of disease at one point in time is compared between exposed and unexposed individuals. . Prevalence of a disease is obtained by dividing the number of people who currently have the disease by the number of people in the study population.

Meta-data analysis

Often more than one investigation is performed to study a particular research question, by different research groups reporting significant differences for a particular finding and other research groups reporting no significant differences. Therefore, in meta-analysis researchers collect and synthesize findings from many existing studies and provides a clear picture of factors associated with development of a certain disease. These results may be utilized for ranking and prioritizing risk factors in other researches.

Modern Healthcare Data analytics approach

Secondary Analysis and modern healthcare data analytics approach

In primary research infrastructure, designing a large-scale randomized controlled trial (RCTs) is expensive and sometimes unfeasible.  The alternative approach for expansive data is to utilize electronic health records (EHR). In contrast with the primary analysis, secondary analysis performs retrospective research using data collected for purposes other than research such as Electronic Health Record (EHR). Modern healthcare data analytic projects apply advanced data analysis methods, such as machine learning and perform integrative analysis in order to leverage a wealth of deep clinical and administrative data with longitudinal history from EHR to get a more comprehensive understanding of the patient condition.

Electronic Health Record (EHR)

EHRs are data generated during routine patient care. Electronic health records contain large amounts of longitudinal data and a wealth of detailed clinical information. Thus, the data, if properly analyzed and meaningfully interpreted, could vastly improve our conception and development of best practices. Common data in EHR are listed as the following:

  • Demographics

    Age, gender, occupation, marital status, ethnicity

  • Physical measurement

    SBP, DBP, Height, Weight, BMI, waist circumference

  • Anthropometry

    Stature, sitting height, elbow width, weight, subscapular, triceps skinfold measurement

  • Laboratory

    Creatinine, hemoglobin, white blood cell count (WBC), total cholesterol, cholesterol, triglyceride, gamma-glutamyl transferase (GGT)

  • Symptoms

    frequency in urination, skin rash, stomachache, cough

  • Medical history and Family diseases

    diabetes, traumas, dyslipidemia, hypertension, cancer, heart diseases, stroke, diabetes, arthritis, etc

  • Lifestyle habit

    Behavior risk factors from Questionnaires such as Physical activity, dietary habit, smoking, drinking alcohol, sleeping, diet, nutritional habits, cognitive function, work history and digestive health, etc

  • Treatment

    Medications (prescriptions, dose, timing), procedures, etc.

Using EHR to Conduct Outcome and Health Services Research

In the secondary analysis, the process of analyzing data often includes steps:

  1. Problem Understanding and Formulating the Research Question: In this step, the process of transforming a clinical question into research is defined. There are 3 key components of research question: the study sample (or patient cohort), the exposure of interest (e.g., information about patient demographic, lifestyle habit, medical history, regular health checkup test result) and the outcome of interest (e.g., a patient has diabetes or not after 5 years):
  2. Data Preparation and Integration: Extracted raw data can be collected from different data sources, or be in separate datasets with different representation and formats. Data Preparation and Integration is the process of combining and reorganizing data derived from various data sources (such as databases, flat files, etc.) into a consistent dataset that contains all the information required for desired statistical analysis
  3. Exploratory Data Analysis/ Data Understanding: Before statistics and machine learning models are employed, there is an important step of exploring data which is important for understanding the type of information that has been collected and what they mean. Data Exploration consists of investigating the distribution of variables, patterns and nature inside the data and checking the quality of the underlying data. This preliminary examination will influence which methods will be most suitable for data preprocessing step and choosing the appropriate predictive model
  4. Data Preprocessing: Data preprocessing is one of the most important steps and critical in the success of machine learning techniques. Electronic health records (EHR) often were collected for clinical purposes. Therefore, these databases can have many data quality issues. Preprocessing aims at assessing and improving the quality of data to allow for reliable statistical analysis.
  5. Feature Selection: Since the final dataset may have several hundreds of data fields, and not all of them are relevant to explain the target variable. In many machine learning algorithms, high-dimensionality can cause overfitting or reduce the accuracy of the model instead of improving it. Features selection algorithms are used to identify features that have important predictive role. These techniques do not change the content of initial features set, only select a subset of them. The purpose of feature selection is help to create optimize and cost-benefit models for enhancing prediction performance.
  6. Predictive Model: To develop prediction models with statistical models and machine learning algorithms could be employed. The purpose of machine learning is to design and develop prediction models, by allowing the computer learn from data or experience to solve a certain problem. These models are useful for understanding the system under study, the models can be divided according to the type of outcome that they produce which includes Classification model, Regression model or Clustering model
  7. Prediction and Model Evaluation: This process is to evaluate the performance of predictive models. The evaluation should include internal and external validation. Internal validation refers to the model performance evaluation in the same dataset in which the model was developed. The external validation is evaluation of prediction model in other populations with different characteristics to assess the generalizability of model.
    Please also check our Healthcare Data Science example


[1] Fundamentals of Biostatistics – Bernard Rosner, Harvard University

[2] Secondary Analysis of Electronic Health Records – Springer Open

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Healthcare Data Science Project

Please check about Healthcare Data Science about actual Data Science project examples.

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab


Implementation for Adversarially Constrained Autoencoder Interpolation (ACAI)


Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in a latent code. In some cases, autoencoders can “interpolate”: By decoding the convex combination of the latent codes for two datapoints, the autoencoder can produce an output which semantically mixes characteristics from the datapoints. In this paper, we propose a regularization procedure which encourages interpolated outputs to appear more realistic by fooling a critic network which has been trained to recover the mixing coefficient from interpolated data. We then develop a simple benchmark task where we can quantitatively measure the extent to which various autoencoders can interpolate and show that our regularizer dramatically improves interpolation in this setting. We also demonstrate empirically that our regularizer produces latent codes which are more effective on downstream tasks, suggesting a possible link between interpolation abilities and learning useful representations. – [1]

The idea comes from the paper “Implementation from paper: Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer” (, also known as ACAI framework.

Today I will walk through the implementation of this fantastic idea. The implementation is based on tensorflow 2.0 and python 3.6. Let’s start!


First, we need to import some dependency packages.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Input, Dense, Reshape, Flatten, Dropout, multiply, GaussianNoise
from tensorflow.keras.layers import BatchNormalization, Activation, Embedding, ZeroPadding2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.layers import UpSampling2D, Conv2D, Reshape
from tensorflow.keras.layers import Lambda
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import losses
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
import keras.backend as K

import matplotlib.pyplot as plt

import numpy as np
import tqdm
import os

import io
from PIL
import Image
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
import seaborn as sns
import pandas as pd

Next, we define the overall framework of ACAI, which composes of 3 parts: encoder, decoder and discriminator (also called as critic in the paper).

class ACAI():
    def __init__(self, img_shape=(28,28), latent_dim=32, disc_reg_coef=0.2, ae_reg_coef=0.5, dropout=0.0):
        self.latent_dim = latent_dim
        self.ae_optim = Adam(0.0001)
        self.d_optim = Adam(0.0001)
        self.img_shape = img_shape
        self.dropout = dropout
        self.disc_reg_coef = disc_reg_coef
        self.ae_reg_coef = ae_reg_coef
        self.intitializer = tf.keras.initializers.VarianceScaling(
                            scale=0.2, mode='fan_in', distribution='truncated_normal')
        self.initialize_models(self.img_shape, self.latent_dim)

    def initialize_models(self, img_shape, latent_dim):
        self.encoder = self.build_encoder(img_shape, latent_dim)
        self.decoder = self.build_decoder(latent_dim, img_shape)
        self.discriminator = self.build_discriminator(latent_dim, img_shape)
        img = Input(shape=img_shape)
        latent = self.encoder(img)
        res_img = self.decoder(latent)
        self.autoencoder = Model(img, res_img)
        discri_out = self.discriminator(img)

    def build_encoder(self, img_shape, latent_dim):
        encoder = Sequential(name='encoder')
        encoder.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        encoder.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        return encoder
    def build_decoder(self, latent_dim, img_shape):
        decoder = Sequential(name='decoder')
        decoder.add(Dense(1000, input_dim=latent_dim, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        decoder.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        decoder.add(Dense(, activation='sigmoid'))
        return decoder

    def build_discriminator(self, latent_dim, img_shape):
        discriminator = Sequential(name='discriminator')
        discriminator.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        discriminator.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))

        # discriminator.add(Reshape((-1,)))
        discriminator.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))
        return discriminator

Some utility functions for monitoring the results:

def make_image_grid(imgs, shape, prefix, save_path, is_show=False):
    # Find the implementation in below github repo

def flip_tensor(t):
    # Find the implementation in below github repo

def plot_to_image(figure):
    # Find the implementation in below github repo

def visualize_latent_space(x, labels, n_clusters, range_lim=(-80, 80), perplexity=40, is_save=False, save_path=None):
     # Find the implementation in below github repo

Next, we define the training worker, which is called at each epoch:

def train_on_batch(x, y, model: ACAI):
    # Randomzie interpolated coefficient alpha
    alpha = tf.random.uniform((x.shape[0], 1), 0, 1)
    alpha = 0.5 - tf.abs(alpha - 0.5)  # Make interval [0, 0.5]

    with tf.GradientTape() as ae_tape, tf.GradientTape() as d_tape:
        # Constructs non-interpolated latent space and decoded input
        latent = model.encoder(x, training=True)
        res_x = model.decoder(latent, training=True)

        ae_loss = tf.reduce_mean(tf.losses.mean_squared_error(tf.reshape(x, (x.shape[0], -1)), tf.reshape(res_x, (res_x.shape[0], -1))))

        inp_latent = alpha * latent + (1 - alpha) * latent[::-1]
        res_x_hat = model.decoder(inp_latent, training=False)

        pred_alpha = model.discriminator(res_x_hat, training=True)
        # pred_alpha = K.mean(pred_alpha, [1,2,3])
        temp = model.discriminator(res_x + model.disc_reg_coef * (x - res_x), training=True)
        # temp = K.mean(temp, [1,2,3])
        disc_loss_term_1 = tf.reduce_mean(tf.square(pred_alpha - alpha))
        disc_loss_term_2 = tf.reduce_mean(tf.square(temp))

        reg_ae_loss = model.ae_reg_coef * tf.reduce_mean(tf.square(pred_alpha))

        total_ae_loss = ae_loss + reg_ae_loss
        total_d_loss = disc_loss_term_1 + disc_loss_term_2

    grad_ae = ae_tape.gradient(total_ae_loss, model.autoencoder.trainable_variables)
    grad_d = d_tape.gradient(total_d_loss, model.discriminator.trainable_variables)

    model.ae_optim.apply_gradients(zip(grad_ae, model.autoencoder.trainable_variables))
    model.d_optim.apply_gradients(zip(grad_d, model.discriminator.trainable_variables))

    return {
        'res_ae_loss': ae_loss,
        'reg_ae_loss': reg_ae_loss,
        'disc_loss': disc_loss_term_1,
        'reg_disc_loss': disc_loss_term_2


Next, we need to define a main training function:

def train(model: ACAI, x_train, y_train, x_test,
          batch_size, epochs=1000, save_interval=200,
    n_epochs = tqdm.tqdm_notebook(range(epochs))
    total_batches = x_train.shape[0] // batch_size
    if not os.path.exists(save_path):
    for epoch in n_epochs:
        offset = 0
        losses = []
        random_idx = np.random.randint(0, x_train.shape[0], x_train.shape[0])
        for batch_iter in range(total_batches):
            # Randomly choose each half batch
            imgs = x_train[offset:offset + batch_size,::] if (batch_iter < (total_batches - 1)) else x_train[offset:,::]
            offset += batch_size

            loss = train_on_batch(imgs, None, model)

        avg_loss = avg_losses(losses)
        # wandb.log({'losses': avg_loss})
        if epoch % save_interval == 0 or (epoch == epochs - 1):
            sampled_imgs = model.autoencoder(x_test[:100])
            res_img = make_image_grid(sampled_imgs.numpy(), (28,28), str(epoch), save_path)
            latent = model.encoder(x_train, training=False).numpy()
            latent_space_img = visualize_latent_space(latent, y_train, 10, is_save=True, save_path=f'./latent_space/{epoch}.png')
            # wandb.log({'res_test_img': [wandb.Image(res_img, caption="Reconstructed images")],
            #            'latent_space': [wandb.Image(latent_space_img, caption="Latent space")]})
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype(np.float32) / 255.
x_test = x_test.astype(np.float32) / 255.
ann = ACAI(dropout=0.5)


Some of the result from ACAI after finishing:

First is the visualization of MNIST dataset after encoded by the encoder, we can see that the cluster is well separated and applying downstream tasks on latent space will lead to significant improvement in comparison to raw data (such as clustering, try KMeans and check it out yourself :D).

ACAI result


Second is the visualization of interpolation power on latent space:

  • Interpolation with alpha values in range [0,1.0] with step 0.1.
  • 1st row and final row are source and destination image, respectively.
  • Formula:
mix_latent = alpha * src_latent + (1 - alpha) * dst_latent


We can see that there is a very smooth morphing from the digits at the top row to the digits at the bottom row.

The whole running code is available at github (acai_notebook). It’s your time to play with the paper :D.


[1] David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer, 2018.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

Pytorch part 1: Introducing Pytorch

Pytorch is a deep learning framework and a scientific computing package
The scientific computing aspect of PyTorch is primarily a result PyTorch’s tensor library and associated tensor operations. That means you can take advantage of Pytorch for many computing tasks, thanks to its supporting tensor operation, without touching deep learning modules.

Important to note that PyTorch tensors and their associated operations are very similar to numpy n-dimensional arrays. A tensor is actually an n-dimensional array.

Pytorch build its library around Object Oriented Programming(OOP) concept. With object oriented programming, we orient our program design and structure around objects. The tensor in Pytorch is presented by the object torch.Tensor which is created from numpy ndarray objects. Two objects share memory. This makes the transition between PyTorch and NumPy very cheap from a performance perspective.

With PyTorch tensors, GPU support is built-in. It’s very easy with PyTorch to move tensors to and from a GPU if we have one installed on our system. Tensors are super important for deep learning and neural networks because they are the data structure that we ultimately use for building and training our neural networks.
Talking a bit about history.

The initial release of PyTorch was in October of 2016, and before PyTorch was created, there was and still is, another framework called Torch which is also a machine learning framework but is based on the Lua programming language. The connection between PyTorch and this Lua version, called Torch, exists because many of the developers who maintain the Lua version are the individuals who created PyTorch. And they have been working for Facebook since then till now.

Below are the primary PyTorch modules we’ll be learning about and using as we build neural networks along the way.

Pytorch package description
Image 1. Pytorch package description

Why use Pytorch for deep learning?

  • PyTorch’s design is modern, Pythonic. When we build neural networks with PyTorch, we are super close to programming neural networks from scratch. When we write PyTorch code, we are just writing and extending standard Python classes, and when we debug PyTorch code, we are using the standard Python debugger. It’s written mostly in Python, and only drops into C++ and CUDA code for operations that are performance bottlenecks.
  • It is a thin framework, which makes it more likely that PyTorch will be capable of adapting to the rapidly evolving deep learning environment as things change quickly over time.
  • Stays out of the way and this makes it so that we can focus on neural networks and less on the actual framework.

Why PyTorch is great for deep learning research

The reason for this research suitability is that Pytorch use dynamic computational graph, in contrast with tensorfow which uses static computational graph, in order to calculate derivatives.

Computational graphs are used to graph the function operations that occur on tensors inside neural networks. These graphs are then used to compute the derivatives needed to optimize the neural network. Dynamic computational graph means that the graph is generated on the fly as the operations are created. Static graphs that are fully determined before the actual operations occur.

k-Nearest Neighbors algorithms

In this blog post, I am going to introduce one of the most intuitive algorithms in the field of Supervised Learning[1], the k-Nearest Neighbors algorithm (kNN).

The original k-Nearest Neighbors algorithm

The kNN algorithm is very intuitive. Indeed, with the assumption that items close together in the dataset are typically similar, kNN infers the output of new sample by first constructing the distance score with every sample in the training dataset. From there, it creates a ‘neighbor zone’ by selecting samples that are ‘near’ the candidate one, and does the supervised tasks based on samples lie inside that zone. The task could be either classification or regression.

Let’s start with the basic kNN algorithm. Let $L = \{(y_i, x_i), i=1, \ldots, N\}$ be our training dataset with $N$ samples belong to $c$ classes, where $y_i \in \{1, \ldots, c\}$ is the class of one sample, and $x_i \in \mathbb{R}^{1\times d}$ denotes the corresponding feature vector that describes the characteristics of that sample. Furthermore, it is necessary to define the suitable distance metric, since it drives the algorithm to select neighbors and make prediction later on. The distance metric $d$, is a mapping $d: X\times X\xrightarrow{}\mathbb{R}^{+}\cup\{0\}$ over a vector space $X \in \mathbb{R}^{d}$, where the following conditions are satisfied $\forall x_i, x_j, x_k \in X$:

  • $d(x_i, x_j) \geq 0$
  • $d(x_i, x_j) = d(x_j, x_i)$
  • $d(x_i, x_j) \leq d(x_i, x_k) + d(x_j, x_k)$
  • $d(x_i, x_j) = 0 \iff x_i = x_j$

In the following steps to describe the k-Nearest Neighbors algorithm, the Euclidean distance will be used as the distance metric $d$.

For any new instance $x^{\prime}$:

  • Find $\{(y_j, x_j)\} \in S_k$ where $S_k$ is the set of $k$ samples that are closest to $x^\prime$
  • The way to define the nearest neighbors is based on distance metric $d$ (Note that we are using Euclidean distance).

$$ \begin{aligned} d_{Euclidean}(x_i, x_j) = \Bigg(\sum_{s=1}^{p}|x_{is} - x_{js}|^{2}\Bigg)^{\frac{1}{2}} \end{aligned} $$

  • The classifier $h$ is defined as:
    $$\ \begin{aligned} h(x^\prime) = \arg\max_{r} \Bigg(\sum_{i=1}^k I(y_i = r)\Bigg) \end{aligned} $$
    where $I(.)$ is the unit function. Note that for the regression problem, the function $h(x^\prime)$ will just an average of all response values $y$ from neighbor samples.

Please also check Vietnam AI Lab

Weighted k-Nearest Neighbors

In the kNN algorithm, we weight all neighbors equally. It may affect the inference steps, especially when the neighbor zone becomes bigger and bigger. To strengthen the effect of ‘close’ neighbors than others, the weighted scheme of k-Nearest Neighbors is applied.

Weighted k-Nearest Neighbors is based on the idea that, within $S_k$, such observations that are closer to $x^\prime$, should get a higher weight than the further neighbors. Now it is necessary to note some properties of any weighting schemes $K$ on any distance metric $d$:

  • $K(a) \geq 0, \forall a \in R^+\cup\{0\}$
  • $\arg\max_{a} K(a) = 0$
  • $K(a)$ decreases monotonously for $d\xrightarrow{} \pm\infty$

For any new instance $x^\prime$:

  • We find $\{(y_j, x_j)\} \in S_k$ where $S_k$ is the set of $k$ samples that are closest to $x^\prime$
  • The $(k+1)$th neighbor is used for standardization of the $k$ smallest distance: $$ \begin{aligned} d_{standardized}(x_i, x^\prime) = \frac{d(x_i, x^\prime)}{d(x_{k+1}, x^\prime)} \end{aligned} $$
  • We transform the standardized distance $d_{\text{standardized}}$ with any kernel function $K$ into weights $w_i = K(d_{standardized}(x_i, x^\prime))$.
  • The classifier $\hat{h}$ is defined as:
    $$ \begin{aligned} \hat{h}(x^\prime) = \arg\max_{r} \Bigg(\sum_{i=1}^kw_i I(y_i = r)\Bigg) \end{aligned} $$

The pros and cons of kNN, and further topics

The kNN and weighted kNN do not rely on any specific assumption on the distribution of the data, so it is quite easy to apply it on many problems as the baseline model. Furthermore, kNN (and its family) is very intuitive for understanding and implementing, which again make it a worthy try-it-first approach for many supervised problems.

Despite those facts, kNN still has challenges in some aspects: It is computationally expensive – especially when the dataset size becomes huge. Another challenge is choosing the ‘correct’ distance metric that best follows the assumption for using this algorithm: items close together in the data set should be typically similar. Lastly, the curse of dimensionality heavily affects the distance metric. Beyer et al.[2] proves that, under some preconditions, in high dimension space, all points converge to the same distance from the query point. In this case, the concept of ‘nearest neighbors’ is no longer meaningful.

Hypothesis Testing for One – Sample Mean

I. A Brief Overview

Consider an example of a courtroom trial:

A car company C is accused of not manufacturing environment-friendly vehicles. The average CO2 emission per car from different manufacturers based on a survey from the previous year is 120.4 grams per kilometer. But for a random batch of 100 cars produced at C’s factory, the average CO2 emission is 121.2 grams per kilometer with a standard deviation 1.8.

At the trial, Company C is not considered to be guilty as long as their wrongdoing is not proven. A public prosecutor tries to prove that C is guilty and can only succeed when enough evidence is presented.

The example above illustrates the concepts of hypothesis testing; specifically, there are two conflicting hypotheses:

i) C is not guilty; or

ii) C is guilty

The first is called the null hypothesis (denoted by H0), and the second the alternative hypothesis (denoted by HA). At the start of the trial the null hypothesis is temporarily accepted, until proven otherwise. The goal of hypothesis testing is to perform some sort of transformed comparison between the two numbers 121.2 and 120.4 to either reject H0 and accept HA or vice versa. This is a one-sample mean testing because we are comparing the average value obtained from one sample (121.2) with the average value assumed to represent for the whole population (120.4)

II. Required Steps for Hypothesis Testing

The six steps below must be followed to conduct a hypothesis test. The details will be elaborated with our example afterwards.

1) Set up null and alternative hypotheses and check conditions.

2) Determine the significance level, alpha.

3) Calculate the test statistic.

4) Calculate the probability value (a.k.a the p-value), or find the rejection region. For the following example, we will use the p-value.

5) Make a decision about the null hypothesis.

6) State the overall conclusion.

III. A step-by-step example

1) Set up hypotheses:

We already mentioned in the beginning the two hypotheses. But now we will formalize them:

Null hypothesis:

Company C’s CO2 mean (denoted by μ ) is equal to the population mean (denoted by μ0):           μ = μ0

Alternative hypothesis:

Company C’s CO2 mean is greater than the population mean: μ > μ0

The hypothesis testing for one-sample mean we are conducting requires the data to come from an approximately normal distribution or a large enough sample size, which can be quite subjective. To keep things simple, we decide that the data gathered from company C is big enough with the sample size being 100 cars.

2) Determine the significance level, alpha, or confidence level

The significance level and its complementary, the confidence level, provide a level of probability cutoff for our test to make decisions about the hypotheses. A common value for alpha is 5%, which is the same as a confidence level of 95%.

3) Calculate the test statistic

For one-sample mean test, we calculate the t* test statistic using the formula:

t* test statistic

where s is the standard deviation from the sample we are testing, 1.8, and n is the size of the sample, 100.

[AI Lab] Hiring Data Scientist

If you want to join in exciting and challenging projects, MTI Tech could be the next destination for your career.


Please contact below mail.

MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop our products using agile methods bringing the best deliverable results to the table in the shortest amount of time. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 100 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

The Job

We are looking for Data Scientists who would like to participate in the project to use existing various data to apply to AI, moreover, combine with other data to create new value.

Currently, we are looking for candidates with experiences in Algorithms/ Natural Language Processing (NLP), but any other fields of AI will be considered too.

Example of data

  • Data of medical examination results, medical questionnaire results or image of them in health check, or general medical examination.
  • Data of athletes’ training results and vital.
  • Pregnancy activities data of pregnant women.
  • Life data such as weather and navigation.
  • Text data such as newspapers.
    We have many project style for outsourcing/offshoring, research etc.

Example of application

  • By combining and analyzing Healthcare data such as medical examination/ medical questionnaire results and labor data such as mental health check and overtime records etc., we can find out future health risks at an earlier stage.

Programming Language

  • Python, R, MATLAB, SPSS
  • Java, JavaScript, Golang, Haskell, Erlang/Elixir is a plus.

Currently, development is mainly in Python. It is good to understand object thinking programming in Java etc. It is also good if you have parallel processing experience in the server-side language (Golang, etc.).

In addition, engineers who can use functional languages (Haskell, Erlang / Elixir) are treasures of talented people. Such people are interested in various programming languages, have mathematical curiosity, and many of them study by themselves. Although we do not have many opportunities to use these languages in actual development, we welcome such engineers as well.

Operational Environment

  • AWS, Google Cloud Platform, Microsoft AZURE, Redmine, GitHub etc.

 Who we are looking for

Recently, in the development of Web services, engineers who have experienced APIs usage and libraries related to prominent AI such as open source, Google, IBM etc. from the standpoint of “User”, are often classified as “AI Engineers”. What MTI Group seeks is not such a technician, we find an experienced person who has learned the data sciences themselves deeply. On the other hand, if a person who has learned data science, and has not much experience in actual work, we still consider.

  • Have a great ambition and ability to study the most leading-edge research by yourself and apply them to your own development.
  • Have technical skills and creativity to build new technologies from scratch by yourself if it is necessary but does not exist yet.
  • Adapt yourself to our working culture in a team such as discussion or sharing together. Personality is preferred. Excellent person has a variety of personalities. However, being able to work only on your own becomes a problem.
  • Have experiences in research and study related to Statistical Mathematic such as Regression analysis, SVM or Information theory.
  • Have taken part in research/business about AI, Machine Learning, Natural language processing (NLP), Neural network and so on.
  • Have experiences in research and study related to Engineering and Science, Econometrics, Behavior Psychology, Medical Statistic and so on.
  • Have working experiences in Statistical Analysis or Data Scientist.
  • In AI development, Trial & Error repeats many times to solve the problem with unclear specification or unfixed answer (result). For this reason, we are looking for the individual with the following requirements:
    • Be agile in the cycle of Trial & Error (speed of use your thought in code)
    • Be concerned with even small issues/ problems and solve all problems efficiently by your logical thought.
    • Be curious about knowledge. The person who is greatly interested in and curious about knowledge surely grows the most.
  • Have deep experiences in research.
  • English skill: Be able to use your English reading skill to gain information related to AI.

More on MTI – what is it like to work in MTI?

At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.

Bias in Data Science – the Good, the Bad and the Avoidable !?

In recent years, there have been a few prominent examples of accidental bias in machine-learning applications, such as smart phones’ beauty filters (that essentially ended up whitening skin) [1] or Microsoft’s from-innocent-teen-to-racist-in-24-hours chatbot [2,3]. Examples such as these obviously fell victim to inherently biased data being fed into algorithms too complex to allow for much transparency. Hidden bias continues to be an issue on ubiquitous social media platforms, such as Instagram, whose curators appear to profess themselves both regretful AND baffled [4]. Unfortunately, any model will somewhat regurgitate what it has been fed and interventions at this level of model complexity may prove tricky.  

Interestingly, bias itself does not need to be harmful and is often built into a model’s design on purpose, either to address only a subset of the overall population or to model a real-world state, for instance when predicting house prices from its size and number of bedrooms, the model’s bias parameter often represents the average house price in the data set. Thus, we need to distinguish between conscious and unconscious bias in data analysis. Additionally, there is the factor of intent, i.e. is the person conducting the analysis well-intentioned to follow good scientific method or trying to manipulate it to achieve a particular outcome.

In the following, I will only discuss aspects of unintentional and unconscious bias, meaning bias hidden from the researcher or data scientist introducing it. This is by no means an exhaustive discussion, but merely a highlight of some pervasive aspects:

A. Data availability bias

B. Coherent theory bias

C. Method availability/popularity bias

A. Data availability bias

A. The problem of scientists’ selecting their data out of convenience rather than suitability or representativeness for the current task has been around for a while [4], e.g. the ideal data set may not be available in machine-readable format or would require higher costs and more time for processing, in short, several obstacles to doing an analysis quickly. For instance, in the area of Natural Language Processing, the major European languages, like English, French and German etc. tend to receive more attention because both data and tools to analyze them are widely available. Similarly, psychology research has mostly focused on so-called WEIRD societies (White, Educated, Industrialized, Rich, Democratic) [5] and out of convenience often targets the even smaller population of “North American college students” that unsurprisingly have been found to not represent human populations at large.

B. Coherent theory bias

B. Various studies suggest that we as people strongly favour truths that fit into our pre-existent world view, and why would scientists be exempt from this? Thus, it appears when people analyze data they are often biased by their underlying beliefs about the outcome and are then less likely to yield to unexpected non-significant results [6]. This does not include scientists disregarding new evidence because of conflicting interests [7]. This phenomenon is commonly referred to as confirmation bias or, more fittingly, “myside” bias.

C. Method availability/popularity bias

C. There is a tendency of hailing new trendy algorithms as one-fits-all solutions for whatever task or application. The solution is presented before examining the problem and its actual requirements. While more complex models are often more powerful, this comes at the cost of interpretability, which in some cases is not advisable. Additionally, some methods, both simple and complex ones, enjoy popularity primarily because they come ready-to-use in one’s favourite programming language.

Going forward… 

We as data scientists should:

a. Select our data carefully with our objective in mind. Get to know our data and its limitations.

b. Be honest with ourselves about possible emotional investment in our analyses’ outcomes and resulting conflicts.

c. Examine the problem and its (theoretical) solutions BEFORE making any model design choices.


[1] accessed 21.10.2020)

 [2] accessed 21.10.2020)

 [3] accessed 21.10.2020)

[4] accessed 21.10.2020)

[5] Joseph Rudman (2003) Cherry Picking in Nontraditional Authorship Attribution Studies, CHANCE, 16:2,26-32, DOI: 10.1080/09332480.2003.10554845

[6] Henrich, Joseph; Heine, Steven J., and Norenzayan, Ara. The Weirdest People in the World? Behavioral and Brain Sciences, 33(2-3):61–83, 2010. doi: 10.1017/S0140525X0999152X.

[7] Hewitt CE, Mitchell N, Torgerson DJ. Heed the data when results are not significant. BMJ. 2008;336(7634):23-25. doi:10.1136/bmj.39379.359560.AD

[8] Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi:10.1371/journal.pmed.0020124

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

Efficient Algorithms: An overview


What makes computers useful for us is primarily the ability to solve problems. The procedure in which computers solve a problem is an algorithm.  In the recent context of increasing number of algorithms available for solving data-related problems, there is increasing demand for higher level of understanding of algorithm’s performance in order for data scientists to choose the right algorithms for their problems.

Having a general perception for efficiency of an algorithm would help shaping the thought process for creating or choosing better algorithms. With this intention in mind, I would like to create a series of posts to discuss about what makes a good algorithm in practice, or in short, efficient algorithm. And this article is the first step of the journey.

Define Efficiency

An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, ‘acceptable’ means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input.

There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage. In the next 2 sections, we will be looking at the two different perspectives for measuring efficiency of an algorithm from theoretician and practitioners.

Theoreticians perspective

Theoreticians are interested in measuring efficiency of algorithm without actually have to run it in several machines and input size. The key idea is that they are not going to actually consider the runtimes of the algorithm on any particular input. Rather, they look at what are known as asymptotic runtimes. Or in other words, they look at how the runtime scale with input size (n) as n gets larger. Does the output scale proportional to n, or proportional to n squared, or maybe exponential in n? In fact, these rate of growth are so different that as long as n is sufficiently large, constant multiples that come from other measures like temporary disk usage, long-term disk usage would be relatively small and neglected.
time complexity using asymptotic notation
Fig1:An illustration for time complexity using asymptotic notation for different functions

Practitioner perspective

While certainly useful, the asymptotic runtime of an algorithm doesn’t tell the whole story. There are algorithms that have good asymptotic runtime, but constants that are so huge that they effectively can’t be used. Ever. Lipton calls them Galactic Algorithms. A galactic algorithm is an algorithm that is wonderful in its asymptotic behavior, but is never used to actual compute anything.
 A fun exchange between a theoretician and practitioner
Fig2: A fun exchange between a theoretician and practitioner

In practice, there are other factors which can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, the way in which an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.

Implementation issues can also have an effect on efficiency, such as the choice of programming language, or the way in which the algorithm is actually coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases a language implemented by an interpreter may be much slower than a language implemented by a compiler.

Binomial Theorem

Can you expand on $(x+y)^{2}$? I guess you would find that is quite easy to do. You can easily find that $(x+y)^{2} = x^{2}+ 2xy +y^{2}$.

How about the expansion of $(x+y)^{10}$. It is no longer easy.

It is no longer easy isn’t it. However, if we use Binomial Theorem, this expansion becomes an easy problem.

Binomial Theorem is a very intriguing topic in mathematics and it has wide range of applications.


Let $x$$y$ be real numbers (or complex, or polynomial). For any positive integer $n$, we have:

$$\begin{align*} (x+y)^{n} &amp;= \binom{n}{0}x^{n} + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} + \binom{n}{n}y^{n}\\ &amp;= \sum_{k=0}^{n} x^{n-k}y^{k} \end{align*}$$


$$\begin{align*} \binom{n}{k} = \frac{n!}{k!(n-k)!} \end{align*}$$


We will use prove by induction. The base case $n=1$ is obvious. Now suppose that the theorem is true for the case $n-1$, that is assume that:

$$\begin{align*} (x+y)^{n-1} = \binom{n-1}{0}x^{n-1} + \binom{n-1}{1}x^{n-2}y + \dots + \binom{n-1}{n-2}xy^{n-2} + \binom{n-1}{n-1}y^{n-1} \end{align*}$$


we will need to  show that, this is true for

$$\begin{align*} (x+y)^{n} &amp;= \binom{n}{0}x^{n} + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} + \binom{n}{n}y^{n} \label{eq1} \end{align*}$$

Let us consider the left hand side of equation above

$$\begin{align*} (x+y)^{n} &amp;= (x+y) (x+y)^{n-1} \\ &amp;= (x+y) \bigg( \binom{n-1}{0}x^{n-1} + \binom{n-1}{1}x^{n-2}y + \dots \\ &amp;+ \binom{n-1}{n-2}xy^{n-2} + \binom{n-1}{n-1}y^{n-1}\bigg) \\ &amp;= x^{n} + \bigg( \binom{n-1}{0} + \binom{n-1}{1}\bigg) x^{n-1}y + \bigg( \binom{n-1}{1} + \binom{n-1}{2}\bigg) x^{n-2}y^{1} + \dots \\ &amp;+\bigg( \binom{n-1}{n-2} + \binom{n-1}{n-1}\bigg) xy^{n-1} + y^{n} \end{align*}$$

We can now apply Pascal’s identity:


$$\begin{align*} \binom{n-1}{k-1} + \binom{n-1}{k} = \binom{n}{k} \end{align*}$$

The equation above can be simplified to:

$$\begin{align*} (x+y)^{n} &amp;= x^{n} + \binom{n}{1}x^{n-1}y + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} +y^{n}\\ &amp; = \binom{n}{0}x^{n} + \binom{n}{1}x^{n-1}y + \dots + \binom{n}{n-1}xy^{n-1} + \binom{n}{n}y^{n} \end{align*}$$

as we desired.

Example 1:  Power rule in Calculus


In calculus, we always use the power rule that $\frac{d}{dx} x^{n} = n x^{n-1}$


We can prove this rule using Binomial Theorem.


Recall that derivative for any continuous function f(x) is defined as:


$$\begin{align*} \frac{d}{dx} f(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}. \end{align*}$$

Let $n$ be a positive integer and let $f(x) = x^{n}$


The derivative of f(x) is:


$$\begin{align*} &amp;\frac{d}{dx} x^{n} = \lim_{h \rightarrow 0} \frac{(x+h)^{n} - x^{n}}{h}\\ &amp;= \lim_{h \rightarrow 0} \frac{\bigg( \binom{n}{0} x^{n} + \binom{n}{1}x^{n-1}h + \dots + \binom{n}{n} h^{n} \bigg) - x^{n}}{h}\\ &amp; = \lim_{h \rightarrow 0} \frac{ \binom{n}{1}x^{n-1}h + \binom{n}{2}x^{n-2}h^{2}+ \dots + \binom{n}{n} h^{n}}{h}\\ &amp; = \lim_{h \rightarrow 0} \bigg( \binom{n}{1}x^{n-1} + \binom{n}{2}x^{n-2}h+ \dots + \binom{n}{n} h^{n-1} \bigg)\\ &amp;= n \lim_{h \rightarrow 0} \bigg( \binom{n-1}{0}x^{n-1} + \binom{n-1}{1}x^{n-2}h+ \dots + \binom{n-1}{n-1}h^{n-1} \bigg)\\ &amp;= n \lim_{h \rightarrow 0} (x+h)^{n-1}\\ &amp;= n x^{n-1}. \end{align*}$$

Example 2:  Binomial Distribution 

Let X be the number of Head a sequence of n independent coin tossing. X is usually model by binomial distribution in probability model. Let $ p \in [0,1]$ be the probability that a head show up in a toss, and let $k = 0,1,\dots,n$. The probability that there is $k$ head in the sequence of $n$ toss is:

$$\begin{align*} P(X = k) = \binom{n}{k} p^{k}(1-p)^{n-k} \end{align*}$$

We know that sum of all the probability must equal to 1. In order to show this, we can use Binomial Theorem. We have:


$$\begin{align*} \sum_{k=0}^{n} P(X = k) &amp;= \sum_{k=0}^{n} \binom{n}{k} p^{k}(1-p)^{n-k}\\ &amp; = (p + 1 - p)^{n}\\ &amp;= 1. \end{align*}$$

Please also check another article Gaussian Samples and N-gram language models ,Bayesian model , Monte Carlo for statistics knowledges.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

Monte Carlo Simulation

On a nice day 2 years ago, when I was on financial field. My boss sent our team an email. In this email, he would like to us propose some machine learning techniques to predict stock price.

So, after accepting the assignment from my manager, our team begin to research and apply some approaches for prediction. When we talk about Machine Learning, we often think of supervised and unsupervised learning. But one of the algorithms we applied is one that we usually forgotten  however equally highly effective algorithm: Monte Carlo Simulation.

What is Monte Carlo simulation

Monte Carlo method is a technique that uses random numbers and probability to solve complex problems. The Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial sectors, project management, costs, and other forecasting machine learning models.[1]

Now let’s jump into python implementation to see how it applies,

Python Implementation

In this task we used data of DXG stock dataset from 2017/01/01 to 2018/08/24 and we would like to know what is stock price after 10 days, 1 months and 3 months, respectively

Monte Carlo Simulation

We will simulate the return of stock and next price will be calculated by

P(t) = P(0) * (1+return_simulate(t))

Calculate mean and standard deviation of stock returns

miu = np.mean(stock_returns, axis=0)
dev = np.std(stock_returns)

Simulation process


simulation_df = pd.DataFrame()
last_price = init_price
for x in range(mc_rep):
    count = 0
    daily_vol = dev
    price_series = []
    price = last_price * (1 + np.random.normal(miu, daily_vol))
    for y in range(train_days):
        if count == train_days-1:
        price = price_series[count] * (1 + np.random.normal(miu, daily_vol))
        count += 1
    simulation_df[x] = price_series

Visualization Monte Carlo Simulation

fig = plt.figure()
fig.suptitle('Monte Carlo Simulation')
plt.axhline(y = last_price, color = 'r', linestyle = '-')

Monte Carlo Simulation

Now, let’s check with actual stock price after 10 days, 1 month and 3 months

plt.hist(simulation_df.iloc[9,:],bins=15,label ='histogram')
plt.axvline(x = test_simulate.iloc[10], color = 'r', linestyle = '-',label ='Price at 10th')
plt.title('Histogram simulation and last price of 10th day')

Monte Carlo Simulation

We can see the most frequent occurrence price is pretty close with the actual price after 10th

If the forecast period is longer, the results is not good gradually

Simulation for next 1 month

Monte Carlo Simulation

After 3 months

Monte Carlo Simulation


Monte Carlo simulation is used a lot in finance, although it has some weaknesses, but hopefully through this article, you will have a new look on the simulation application for forecasting.


[1] Pratik Shukla, Roberto Iriondo, “Monte Carlo Simulation An In-depth Tutorial with Python”, medium,

Please also check Gaussian Samples and N-gram language models,
Bayesian Statistics  for statistics knowledges.


Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab