Word Embeddings – blessing or curse in disguise?

As word embeddings become more and more ubiquitous in language applications, a key issue has likewise emerged. The ability of embeddings to learn complex, underlying relationships between words is also their greatest caveat:

How do we know when we have trained a good embedding?

It’s important to differentiate between a good embedding in a more general sense and a good embedding for a specific downstream task. Although some methods for evaluation, such as word similarity/analogy tasks have been proposed, most remain somewhat controversial as to their validity as well as relevancy to actual target applications (e.g. Faruqui et al. (2016)).

In this context, one distinguishes between two types of evaluations:  intrinsic where one typically employs specifically designed, human-moderated data sets and extrinsic whereby embeddings are tested on simpler, proxy NLP tasks to estimate their performance.

For both types, it is yet unclear to what extent good performance correlates with actual useful embeddings. In many, if not most state-of-the-art neural networks, the embeddings are trained alongside the model to tailor to the task at hand.

Here, we want to evaluate two different embeddings (Skipgram and CBOW)  trained on a Japanese text corpus (300K) with the aim of assessing which algorithm is more suitable.

Our setup is as follows: 

Data: Japanese text corpus, containing full texts and their matching summaries (300K) 

Preprocessing: Subword segmentation using SentencePiece (Kudo et al.,2018)

Embedding: Train 2 models: Skipgram and CBOW, vector size: 300, 40K vocabulary size using FastText (Athiwaratkun et al., 2018).

Japanese is a non-space separated language and needs to be segmented as part of the preprocessing. This can be done using morphological analyzers, such as Mecab (Kudo, 2006) or language-independent algorithms, such as SentencePiece (Kudo et al., 2018). As the concept of a “word” is therefore highly arbitrary, different methods can return different segmentations, all of which may be appropriate given specific target applications.

In order to tackle the ubiquitous Out-of-Vocabulary (OOV) problem, we are segmenting our texts into “subwords” using SentencePiece. These typically return smaller units and do not align with “word” segmentations returned by Mecab.

If we wanted to evaluate our embeddings on an intrinsic task such as word similarity, we could use the Japanese word similarity data set (Sakaizawa et al., 2018), containing word similarity ratings for pairs of words across different word types by human evaluators.

However, preliminary vocabulary comparisons showed that because of differences in segmentation, there was little to no overlap between the words in our word embeddings and those in the data set.  For instance, the largest common group occurred in nouns: only 50 out of 1000 total noun comparison pairs.

So instead we are going to propose a naïve approach to compare two word embeddings using a Synonym Vector Mapping Approach

For the current data set, we would like to see whether the model can map information from the full text and its summary correctly, even when different expressions are being used, i.e. we would like to test the model’s ability to pair information from two texts that use different words. 

Pre-processing Data

In Data Science, before building a predictive model from a particular data set, it is important to explore and perform pre-processing data.  In this blog, we will illustrate some typical steps in data pre-processing.

In this particular exercise, we will build a simple Decision Tree model to classify the food cuisine from the list of ingredients. The data for this exercise can be taken from:

https://www.kaggle.com/kaggle/recipe-ingredients-dataset

From this exercise, we will show the important of data pre-processing. This blog will be presented as follow:

  1. Data Exploration and Pre-processing.
  2. Imbalance Data.

1.  Data Exploration and Pre-processing

When you are given a set of data, it is important to explore and analyze them before constructing a predictive model. Let us first explore this data set.

1.  Data Exploration and Pre-processing

From the first 10 items of this data set. We observe that given a particular cuisine, the list of ingredients may be different.

From this data set, we can find out that there are 20 different cuisines and the recipes distribution is not uniform. For example, recipes from ‘Italian’ cuisine takes 19.7% of all the data set, while there is only 1.17% of the recipes are coming from ‘Brazilian’ cuisine.

receipt dataset

Now, let us explore further into this data set. Let us look at the top 15 ingredents

top15 ingredients

If we look at the top 15 ingredients, we will see that they include “salt”, “water”, “sugar”, etc. They are all generic and can be found in every cuisine. Intuitionally,  if we remove these ingredients from the classification model,  the accuracy of the classification should not be affected.

In the classification model, we would refer that recipes in each cuisine to have unique ingredients to that country. This will help the model to easily identify which cuisine this recipe comes from.

After removing  all the generic ingredients (salt, water, sugar, etc) from the data set, we look at the top 15 ingredients again.

top15 ingredients

It looks like we can remove a more ingredients, but decision which one to remove properly leave to someone with a bit more domain of knowledge in cooking. For example, some country may use ‘onion’ in their recipe, the other may use ‘red onion’. So it is better not to overly filter out too many generic ingredients.

Now, we look at the distribution of ingredients in each recipe in the data set.

ingredients

Some recipes have only 1 to 2 ingredients in the recipe, some may have up to 60. It is probably best to remove those recipes with so little ingredients out of the data set, as the number of ingredients may not be representative enough for the classification model. What is the minimum number of ingredients require to classify the cuisine? The short answer is no one know. It is best to experiment it out by remove data sets with 1, 2, 3, etc ingredients and re-train the model and compare the accuracy to decide which one work best for your model.

The ingredients in the recipe are all words, in order to do some further pre-processing, we will need to use some NLP (Natural Language Processing).

AWS Lake Formation – Data Lake(Setup)

Overview for Build a Data lake with AWS(Beginner)

AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

The challenges of data lakes

The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.

Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.

Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.

Set up the S3 bucket and put the dataset.

Set up the S3 bucket and put the dataset.

Set up Data Lake with AWS Lake Formation.

Step 1: Create a data lake administrator

First, designate yourself a data lake administrator to allow access to any Lake Formation resource.

Create a data lake administrator

Step 2: Register an Amazon S3 path

Next, register an Amazon S3 path to contain your data in the data lake.

Register an Amazon S3 path

Step 3: Create a database

Next, create a database in the AWS Glue Data Catalog to contain the datasetsample00 table definitions.

  •      For Database, enter datasetsample00-db
  •      For Location, enter your S3 bucket/ datasetsample00.
  •      For New tables in this database, do not select Grant All to Everyone.

Step 4: Grant permissions

Next, grant permissions for AWS Glue to use the datasetsample00-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.

Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:

  • For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
  • For Storage locations, enter s3:// datalake-hiennu-ap-northeast-1.

Step 5: Crawl the data with AWS Glue to create the metadata and table

In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.

Create a table using an AWS Glue crawler. Use the following configuration settings:

  • Crawler name: samplecrawler.
    Crawler name: samplecrawler.
  • Data stores: Select this field.
  • Choose a data store: Select S3.
  • Specified path: Select this field.
  • Include path: s3://datalake-hiennu-ap-northeast-1/datasetsample00.
  • Add another data store: Choose No.
  • Choose an existing IAM role: Select this field.
  • IAM role: Select AWSGlueServiceRoleDefault.
  • Run on demand: Select this field.
  • Database: Select datasetsample00-db.

Step 6: Grant access to the table data

Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.

  • In the navigation pane, choose Tables.
  • Choose Grant.
  • Provide the following information:
    1. For IAM role, select your user and AWSGlueServiceRoleDefault.
    2. For Table permissions, choose Select all.

Step 7: Query the data with Athena

Query the data in the data lake using Athena.

  • In the Athena console, choose Query Editor and select the datasetsample00-db
  • Choose Tables and select the datasetsample00 table.
  • Choose Table Options (three vertical dots to the right of the table name).
  • Select Preview table.

Athena issues the following query: SELECT * FROM datasetsample00 limit 10;

Combinatorial Optimization: From Supervised Learning to Reinforcement Learning – Part 1

Recently, I was asked to solve an interesting problem: The problem of Sorting Array:

Input:

An array A contains unique n-elements, whose values are integers. The length (n) of A is ranged from 2 to10.

Output:

A sorted array B in ascending order. The length of array B must be the same as the length of array A.

Examples: A = [3,0] -> B = [0,3], A = [1,3,2] -> B = [1,2,3], A = [5,9,1,3,7] -> B = [1,3,5,7,9]

Array sorting is not a new problem. There are many sorting algorithms such as  Straight Insertion, Shell Sort, Bubble Sort, Quick Sort, Selection Sort, Heap Sort, etc. The problem above becomes much more interesting if we consider it as a Combinatorial Optimization problem. Here, various Machine Learning approaches can be applied.

Combinatorial Optimization

“Combinatorial Optimization is a category of problems which requires optimizing a function over a combination of discrete objects and the solutions are constrained. Examples include finding shortest paths in a graph, maximizing value in the Knapsack problem and finding boolean settings that satisfy a set of constraints. Many of these problems are NP-Hard, which means that no polynomial time solution can be developed for them. Instead, we can only produce approximations in polynomial time that are guaranteed to be some factor worse than the true optimal solution.”

Source: Recent Advances in Neural Program Synthesis (https://arxiv.org/abs/1802.02353)

The traditional solvers are often relying on handcrafted designs to make decisions. In recent years, many Machine Learning (ML) techniques have been used to solve the combinatorial optimization problems. The related technologies vary from supervised learning techniques to modern reinforcement learning techniques.

Using the above sorting list problem, we will see how the problem can be solved using different ML techniques.

Agenda

In this series, we will start with some supervised techniques, then we’ll apply the neuro-evolution, finally using some modern RL techniques.

Part 1: Supervised learning: Gradient Boosting, Fully Connected Neural Networks, SeqtoSeq.

Part 2: Deep Neuro-Evolution: NEAT, Evolution Strategies, Genetic Algorithms.

Part 3: Reinforcement Learning: Deep Q-Network, Actor-Critic, PPO with Pointer network and Attention based model.

Code for Part 1:  https://colab.research.google.com/drive/1MXpJt9FwNJWizcjMM3ZWRh-drwIOnIvM?usp=sharing

(Note: Enable Colab GPU to speed up running time)

Supervised learning

Supervised machine learning algorithms are designed to learn by example. If we want to use Supervised learning, we have to have data.

First, we will generate 3 data sets with different sizes: 1000, 5000, 50000. Then we will use some models to train with this data sets. Then, we will compare their sorting abilities after the learning process.

1.Generate training data:

How to generate data? One possible approach is: if we consider each element of the input list as a feature, and each element of the sorted list is a label, we can easily convert the data back to a tabular form

In1 In2 In3 In4 In5 In6 In7 In8 In9 In10
0 -1 -1 -1 -1 -1 7 0 1 2 3
1 1 7 6 3 4 5 0 2 8 9
2 -1 -1 6 2 7 4 1 0 3 8
Out1 Out2 Out3 Out4 Out5 Out6 Out7 Out8 Out9 Out10
0 -1 -1 -1 -1 -1 0 1 2 3 7
1 0 1 2 3 4 5 6 7 8 9
2 -1 -1 0 1 2 3 4 6 7 8

Then we can use any multi-label regression or multi-label classification models for this training data.

2. Multi-label regression

For this Tabular dataset, I will use 2 common techniques : Gradient boosting (use XGB lib) and simple Fully connected neural networks (FCNNs).

from sklearn.multioutput import MultiOutputRegressor
from xgboost import XGBRegressor

AWS – Batch Process and Data Analytics

Overview
(AWS – Batch Process and Data Analytics)

Today, I will describe our current system references architecture for Batch processing and Data Analytics for the sales report system. Our mission is to create a big data analytics system that interacts with Machine Learning features for insights and prediction. Nowadays, too many AI companies are researching and were applied machine learning features to their system to improve services. We also working to research and applied machine learning features to our system such as NLP, Forecast, OCR … that help us have the opportunity to provide better service for customers.

References Architecture for Batch data processing
References Architecture for Batch data processing
  1. Data Source
    We have various sources from multiple systems including both on-premise and on-cloud with a large dataset and unpredictable frequent updates.
  2. Data lake storage
    We use S3 as our data lake storage with unlimited types and volumes. That helps us easy to scale our system easily
  3. Machine Learning
    We focus on Machine learning to build great AI solutions from our dataset. Machine learning models will predict for insight and integration directly to our system as microservices.
  4. Compute
    Compute machines are most important in our system. We choose fit machine services as infrastructure or serverless to maximize optimize cost and performance.
    – We using AWS lambda functions for small jobs such as call AI services, small dataset processing, and integrating
    – We using AWS Glue ETL to build ETL pipeline and build custom pipelines with AWS step functions.
    – We also provide web and APIs service for end-users that system building by microservices architecture and using AWS Fargate for hosting services
  5. Report datastore
    After processing data and export insight data from predictions we store data in DynamoDB and RDS for visualization and build-in AWS insight features.
  6. Data analytics with visualization and insights
    We are using AWS Quicksight and insight for data analytics

According to the specific dashboard, we have specific data processing pipeline to process and prepare data for visualization and insights. Here is our standard pipeline with AWS Step functions orchestrators. We want to hold everything basic as possible. We using AWS Events for schedule, and Lambda functions as interacting functions.
– Advantages: Serverless and easy to develop and scale

– Disadvantages: Most depend on AWS services and challenges when exceeding the limitation of AWS services such as Lambda functions processing time.
– Most use case scenario: Batch data processing pipeline.

Data processing pipeline
Data processing pipeline
  • Using serverless compute for data processing and microservices architecture
  • Easy to develop, deploy and scale system without modifying
  • Flexibility to using build-in services and build custom Machine Learning model with AWS SageMaker
  • Separately with Datasource systems in running
  • Effectively data lake processing

This architecture is basic and I hope you can get somethings in here. We focus to describe our current system and it is building to larger than our design with more features in the future; we should choose a flexible solution for easy to maintain and replaceable. When you choose any architecture or service which helps you resolve your problems, you should consider what is the best fit service for your current situation, and no architecture/service can resolve all problems that depend on the specific problem. And avoid the case for solution finding for problems 😀

— — — — — — — — — — — — — — — —

EXPLORING UNIVERSAL SENTENCE ENCODER MODEL (USE)

In NLP, encoding text is the heart of understanding language.  There are many implementations like Glove, Word2vec, fastText which are aware of word embedding. However, these embeddings are only useful for word-level and may not perform well in case we would like to expand to encode for sentences or in general, greater than one word. In this post, we would like to introduce one of the SOTAs for such a task: the Universal Sentence Encoder model

1.What is USE(UNIVERSAL SENTENCE ENCODER MODEL)?

The Universal Sentence Encoder (USE) encodes text into high dimensional vectors (embedding vectors or just embeddings). These vectors are supposed to capture the textual semantic. But why do we even need them?

A vector is an array of numbers of a particular dimension. With the vectors in hand, it’s much easier for computers to work on textual data. For example, we can say two data points are similar or not just by calculating the distance between the two points’ embedding vectors.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://amitness.com/2020/06/universal-sentence-encoder/)

The embedding vectors then in turn, can be used for other NLP downstream tasks such as text classification, semantic similarity, clustering…

2.USE architecture

It comes with two variations with the main difference resides in the embedding part. One is equipped with the encoder part from the famous Transformer architecture, the other one uses Deep Averaging Network (DAN)

2.1 Transformer encoder

The Transformer architecture is designed to handle sequential data, but not in order like the RNN-based architectures. It use the attention mechanism to compute context-aware representations of words in a sentence taking into account both the ordering and significance of all the other words. The encoder takes input as a lowercased PTB tokenized string and outputs the representations of each sentence as a fixed-length encoding vector by computing the element-wise sum of the representations at each word position. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.

Universal Sentence Encoder uses only the encoder branch of Transformer to take advantage of its strong embedding capacity.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://arxiv.org/abs/1706.03762)

2.2 Deep Averaging Network (DAN):

DAN is a simple Neural Network that takes average of embeddings for words and bi-grams and then passed the “combined” vector through a feedforward deep neural network (DNN) to produce sentence embeddings. Similar to the Transformer encoder, DAN takes as input a lowercased PTB tokenized string and output a 512 dimensional sentence embedding.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://medium.com/tech-that-works/deep-averaging-network-in-universal-sentence-encoder-465655874a04)

The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.

3. How was it trained?

The key idea for training this model is to make the model work for generic tasks such as:

  • Modified Skip-thought
  • Conversational input-response prediction
  • Natural language inference.

3.1 Modified skip-thought:

given a sentence, the model needs to predict the sentences around it.

 

UNIVERSAL SENTENCE ENCODER MODEL

  • 3.2 Conversational input-response prediction:

    In this task, the model needs to predict the correct response for a given input among a list of correct responses and other randomly sampled responses.

Radar Image Compressing by Huffman Algorithm

In AI Lab, one of our projects is “Weather Forecasting”, in which we process the data from weather radars, and perform prediction using nowcasting techniques, i.e.  forecasting in a short-time period like 60 minutes.

To develop forecasting model and do evaluation, we need to run many test-cases. Each test-case corresponds to one rainfall event, whose duration can be several hours. In the rainy season, one rainfall event cans last for several days.

Given one event and one interested area (80km2 of Tokyo city for example). In our actual work, a list of arrays of rainfall intensity needs to be stored in machine. Each array corresponds to one minute of observation by radar. If the event lasts for 5 days, then the number of arrays is

60 minutes x 24 hours x 5 days = 7200 arrays

Therefore, storing a such amount of data costs a lot of memory.

One of solution is to modify the way feeding data to the evaluation modul. However, for the scope of this blog, let assume that we would like to keep that amount of data in computer. How do we do to compress data, and how much of data can be compressed?

The aim of this post is to introduce the Huffman algorithm, one of the basic methods for data compressing. By learning about the Huffman algorithm, we can easily understand the mechanism behind some advanced compressing methods.

Let us start with a question: how does a computer store data?

Data in computer language

Computer uses bits of {0,1} to represent numbers. One character of 0 or 1 is called a bit. Any string of bits is called a code.

Each number is encoded by a unique string of bits, or code. The encoder should satisfy the “prefix” condition. This condition states that no code is the prefix of any other codes. This makes it is always possible to decode a bit-string back to a number.

Let us consider an example. Let assume that we have a data d as a list of numbers, d = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3].

To store this list d, the computer needs to encode the three numbers: 1, 2, 3. The two possible encoders are:

Code/Number 1 2 3
Code 1 ‘0’ ‘01’ ‘11’
Code 2 ‘0’ ‘10’ ‘11’

But the Code 1 violates the ‘prefix’ condition, as ‘0’ is prefix of ’01’, while the Code 2 does not. Therefore, the Code 2 can be used for encoding.

So, how many of bits needed to represent N different numbers? In theoretically, we need codes of k-bits length to encode 2^k numbers.

For example, to encode 2 numbers, we need codes of 1-bit length, which are ‘0’ and ‘1’.

To encode 4 = 2^2 numbers, we need codes of 2-bits length, which are: ’00’, ’01’, ’10’, ’11’

To encode 256 = 2^8 numbers, we need codes of 8-bits length, which are: … [we skip it :)]

Consider one radar image of dimension 401×401 including float numbers that take value from 0 to 200 mm/h. If we are interested in one decimal number, this data can be multipled by 10 and saved as interger. At that time, the array contains 401×401 intergers, whose values are from 0 to 2000.

Following the above logic, to encode 2000 different numbers, we need 11-bits codes, because 2^11 = 2048. If so, the total number of bits to encode this data is: 401 x 401 x 11 bits = 1,768,811 bits.

Compressing data aims to reduce the total number of bits that are used to encode it, in other words, to find an encoder that has a minimum average of code lengths.

Expected code length

Expected or average code length (ECL) of a data d is defined as:

Huffman Algorithmwhere x is values of data, f(x) is its frequency, and l(x) is the length of code used to encode the value x. This quantity tells us how many of bits on average is used to encode a number in the data x and by a Code C.

Minimum expected code length

It can be proved mathematically that, the ECL of a code is bounded below by the entropy of given data. What is entropy of a data? It’s a quantity to measure information, that was initially introduced by Shannon (a Mathematician and father of Information Theory). This quantity tells us how much uncertainty the data contains.

In mathematical point of view, the entropy of the data is defined as

Huffman Algorithm

And by the source coding theorem , the following inequality is always true

Huffman Algorithm

which is equivalent to

Huffman Algorithm

The ECL achieves the minimum value if l(x) = -log2 [f(x)], meaning that, the length of the code of value x depends on the frequency of that value in the data. For example, in our radar images, the most frequent number is 0 (0.1mm/h), then we should use a short string of bits to encode the value 0. On the contrary, the number 2000 (0.1mm/h) has a very low frequency, then a longer bit-string must be used. In that way, the average of bits used to encode one value will be reduced to the entropy value, which is the lower bound of the compression. If we use a code which yields an ECL less than the entropy, then the compression makes loss of information.

In summary, the compressing could be minimized if the length of codes follows the formula: l(x) = -log2 [f(x)].

The Huffman Encoding

The Huffman algorithm is one of methods to encode data into bit-strings whose ECL approximates to the lower bound of compression. This is a very simple algorithm that can be produced by a several code lines in Python. For detailed of Huffman algorithm, you can refers to the link:

https://www.geeksforgeeks.org/huffman-coding-greedy-algo-3/

Encode and Decode a radar images

As described above, a radar image contains 401×401 data points, each point is an interger taking value from 0 to 2000.

By performing the Huffman encoder, we obtained the following codes (Figure 1).

Huffman Algorithm
Figure 1. A part of Huffman codes for 2000 integer numbers

By using these codes, the total number of bits of one specific image is 416,924 bits, which is much smaller than the codes using 11-bits length.

In actual work, we can convert an image data to bits by concatenate value by value, then write everything to a binary file. As computer groups every 8 bits to a byte, then the binary file consume 416,924/8 bits = 52KB.

To decode a binary file back to the image array, we load the binary file and trasform it to bit. As the Huffman code verifies the ‘prefix’ condition, by searching and maping, the original data can always be reconstructed without loosing any information.

If we write image array in the numpy format, then the storage can be up to 1.3MB/array. Now with Huffman compression, the storage can be reduced to 52KB.

If we use ‘ZIP’ application to compress the numpy file, the storage of the zip file is only 48KB, meaning that the ‘ZIP’ is outperform our Huffman encoding. That is because ZIP is using another encoding algorithm that allows to reach closer to the lower bound of compression.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Weather Data Science Project

Please check about Weather Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

Introduction to Healthcare Data Science

Introduction to Healthcare Data Science (Overview)

Healthcare analytics is the collection and analysis of data in the healthcare field in order to study determinants of disease in human populations, identify and mitigate risk by predicting outcomes. This post introduces  some common epidemiological study designs and overview of modern healthcare data analytics process.

Types of Epidemiologic Studies

In general, epidemiologic studies can be classified into 3 types: Interventional study, Observational study, Meta-analysis study

Interventional or Randomized control study

Clinical medicine relies on evidence base found in strong research to inform best practices and improve clinical care. The gold-standard for study design to provide evidence is Randomized control (RCT). The main idea of this kind of research is to prove the root cause that makes a certain disease happen or causal effect of a treatment. RCTs are performed in fairly homogeneous patient populations when participants are allocated by chance to two similar groups. The researchers then try with different interventions or treatments to these two groups and compare the outcomes.

As an example, a study was conducted to assess whether improved lifestyle habits could reduce hemoglobin A1c (HbA1c) levels of employees. In the experiment, the intervention consisted of a 3-month competition among employees to adopt  healthier lifestyle habits (Eat better, Move more, and Quit smoking) or keep their current lifestyle. After the intervention, employees with elevated HbA1c significantly reduced their HbA1c levels while employees without elevated HbA1c levels of employees without intervention was not changed.

In ideal conditions, there are no confounding variables in a randomized experiment, thus RCTs are often designed to investigate causal relationship between exposure and outcome. However, RCTs have several limitations, RCTs are often costly, time-intensive, labor-intensive, slow, and can consist of homogeneous patients that are seldom generalizable to every patient population.

Observational studies

Unlike RCTs, Observational studies have no active interventions which means the researchers do not interfere with their participants. In contrast with interventional studies, observations studies are usually performed in heterogeneous patient populations. In these studies, researchers often define an outcome of interest (e.g a disease) and use data collected on patients such as demographic, labs, vital signs, and disease states to explore the relationship between exposures and outcome, determine which factors contributed to the outcome and attempt to draw inferences about the effects of different exposures on the outcome. Findings from observational studies can subsequently be developed and tested with the use of RCTs in targeted patient populations.

Observational studies tend to be less time- and cost- intensive. There are three main study designs in observational studies: prospective study design, retrospective study design, and cross-sectional study design.

Follow-up study/ Prospective study/ Longitudinal (incidence) study

A prospective study is a study in which a group of disease-free individuals is identified as baseline and are followed over a period of time until some of them develop the disease. The development of disease over time is then related to other variables measured at baseline, generally called exposure variables. The study population in a prospective study is often called a cohort.

Retrospective study/ Case-Control study

A retrospective study is a study in which two groups of individuals are initially identified: (1) a group that has the disease under study (the cases) and (2) a group that does not have the disease under study (the controls). Cases are individuals who have a specific disease investigated in the research. Controls are those who did not have disease of interest in the research. Usually, a retrospective history of health habits prior to getting disease is obtained. An attempt is then made to relate their prior health habits to their current disease status. This type of study is also sometimes called a case-control study.

Cross-sectional (Prevalence) study/ Prevalence study

A cross-sectional study is one in which a study population is ascertained at a single point in time. This type of study is sometimes called a prevalence study because the prevalence of disease at one point in time is compared between exposed and unexposed individuals. . Prevalence of a disease is obtained by dividing the number of people who currently have the disease by the number of people in the study population.

Meta-data analysis

Often more than one investigation is performed to study a particular research question, by different research groups reporting significant differences for a particular finding and other research groups reporting no significant differences. Therefore, in meta-analysis researchers collect and synthesize findings from many existing studies and provides a clear picture of factors associated with development of a certain disease. These results may be utilized for ranking and prioritizing risk factors in other researches.

Modern Healthcare Data analytics approach

Secondary Analysis and modern healthcare data analytics approach

In primary research infrastructure, designing a large-scale randomized controlled trial (RCTs) is expensive and sometimes unfeasible.  The alternative approach for expansive data is to utilize electronic health records (EHR). In contrast with the primary analysis, secondary analysis performs retrospective research using data collected for purposes other than research such as Electronic Health Record (EHR). Modern healthcare data analytic projects apply advanced data analysis methods, such as machine learning and perform integrative analysis in order to leverage a wealth of deep clinical and administrative data with longitudinal history from EHR to get a more comprehensive understanding of the patient condition.

Electronic Health Record (EHR)

EHRs are data generated during routine patient care. Electronic health records contain large amounts of longitudinal data and a wealth of detailed clinical information. Thus, the data, if properly analyzed and meaningfully interpreted, could vastly improve our conception and development of best practices. Common data in EHR are listed as the following:

  • Demographics

    Age, gender, occupation, marital status, ethnicity

  • Physical measurement

    SBP, DBP, Height, Weight, BMI, waist circumference

  • Anthropometry

    Stature, sitting height, elbow width, weight, subscapular, triceps skinfold measurement

  • Laboratory

    Creatinine, hemoglobin, white blood cell count (WBC), total cholesterol, cholesterol, triglyceride, gamma-glutamyl transferase (GGT)

  • Symptoms

    frequency in urination, skin rash, stomachache, cough

  • Medical history and Family diseases

    diabetes, traumas, dyslipidemia, hypertension, cancer, heart diseases, stroke, diabetes, arthritis, etc

  • Lifestyle habit

    Behavior risk factors from Questionnaires such as Physical activity, dietary habit, smoking, drinking alcohol, sleeping, diet, nutritional habits, cognitive function, work history and digestive health, etc

  • Treatment

    Medications (prescriptions, dose, timing), procedures, etc.

Using EHR to Conduct Outcome and Health Services Research

In the secondary analysis, the process of analyzing data often includes steps:

  1. Problem Understanding and Formulating the Research Question: In this step, the process of transforming a clinical question into research is defined. There are 3 key components of research question: the study sample (or patient cohort), the exposure of interest (e.g., information about patient demographic, lifestyle habit, medical history, regular health checkup test result) and the outcome of interest (e.g., a patient has diabetes or not after 5 years):
  2. Data Preparation and Integration: Extracted raw data can be collected from different data sources, or be in separate datasets with different representation and formats. Data Preparation and Integration is the process of combining and reorganizing data derived from various data sources (such as databases, flat files, etc.) into a consistent dataset that contains all the information required for desired statistical analysis
  3. Exploratory Data Analysis/ Data Understanding: Before statistics and machine learning models are employed, there is an important step of exploring data which is important for understanding the type of information that has been collected and what they mean. Data Exploration consists of investigating the distribution of variables, patterns and nature inside the data and checking the quality of the underlying data. This preliminary examination will influence which methods will be most suitable for data preprocessing step and choosing the appropriate predictive model
  4. Data Preprocessing: Data preprocessing is one of the most important steps and critical in the success of machine learning techniques. Electronic health records (EHR) often were collected for clinical purposes. Therefore, these databases can have many data quality issues. Preprocessing aims at assessing and improving the quality of data to allow for reliable statistical analysis.
  5. Feature Selection: Since the final dataset may have several hundreds of data fields, and not all of them are relevant to explain the target variable. In many machine learning algorithms, high-dimensionality can cause overfitting or reduce the accuracy of the model instead of improving it. Features selection algorithms are used to identify features that have important predictive role. These techniques do not change the content of initial features set, only select a subset of them. The purpose of feature selection is help to create optimize and cost-benefit models for enhancing prediction performance.
  6. Predictive Model: To develop prediction models with statistical models and machine learning algorithms could be employed. The purpose of machine learning is to design and develop prediction models, by allowing the computer learn from data or experience to solve a certain problem. These models are useful for understanding the system under study, the models can be divided according to the type of outcome that they produce which includes Classification model, Regression model or Clustering model
  7. Prediction and Model Evaluation: This process is to evaluate the performance of predictive models. The evaluation should include internal and external validation. Internal validation refers to the model performance evaluation in the same dataset in which the model was developed. The external validation is evaluation of prediction model in other populations with different characteristics to assess the generalizability of model.
    Please also check our Healthcare Data Science example

References:

[1] Fundamentals of Biostatistics – Bernard Rosner, Harvard University

[2] Secondary Analysis of Electronic Health Records – Springer Open

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Healthcare Data Science Project

Please check about Healthcare Data Science about actual Data Science project examples.

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

 

Importance of TinyML

Introduction

Tiny Machine Learning (TinyML) [1] is, unsurprisingly, a machine learning technique but this technique is often utilized in building machine learning applications, which require high performance but have limited hardware. a tiny neural network on a microcontroller with really low power requirements (sometimes <1mW).

Tiny Machine Learning: The Next AI Revolution | by Matthew Stewart, PhD Researcher | Towards Data Science

Figure 1: Tiny ML, the next AI revolution [5]

TinyML is often implemented in low energy systems such as microcontrollers or sensors to perform automated tasks. One trivial example is Internet of Things (IoT) devices. However, The biggest challenge in implementing TinyML is that it required “full-stack” engineers or data scientists who have profound knowledge in building hardware, design system architecture, developing software, and applications.

TinyML,  IoT and embedded system

In [2], Internet of things (IoT) reflects the network of physical objects (a.k.a, things) that are embedded with sensors, software, and other technologies for the purpose of connecting and exchanging data with other devices and systems over the Internet. Therefore, most IoT devices should be applied TinyML to enhance their data collection and data processing. In other words, as argued by many machine learning experts, the relationship between TinyML, IoT, and embedded systems will be a long-lasting relationship (TinyML belongs to IoT).

Vietnam AI Lab

Applications

Spectrino: TinyML Arduino & IoT Based Touch-Free Solutions - Arduino Project Hub

Figure 2: One commercial application of TinyML in a smart house [6]

In the future, the era of information explosion, TinyML enables humans to deliver many brilliant applications, that help us reduce stress in processing data. Some examples include:

In agriculture: Profit losses due to animal illnesses can be reduced by using wearable devices. These smart sensors can help to monitor health vitals such as heart rate, blood pressure, temperature, etc.  and TinyML will be useful in making prediction on  the onslaught of disease and epidemics

In industry:  TinyML can prevent downtime due to equipment failure by enabling real-time decisions without human interaction in the manufacturing sector. It can signal workers to perform preventative maintenance when necessary, based on equipment conditions.

In retail: TinyML can help to increase profits in indirect ways by providing effective means for warehouse or store monitoring. As smart sensors will possibly become popular in the future, they could be utilized in small stores, supermarkets, or hypermarkets to monitor shelves in-store. TinyML will be definitely useful in processing those data and prevent items from becoming out of stock. Humans will enjoy endless amusement came from these  ML-based applications for the economic sector.

In mobility: TinyML will help sensors have more power in ingesting real-time traffic data. Once those sensors are applied in reality, humans will be no longer worry about traffic-related issues (such as traffic jams, traffic accidents)

Imagine when all sensors in the embedded systems mentioned in the above applications are connected in a super-fast Internet connection, every TinyML algorithm will be controlled by a giant ML system. That is a time when humans can take advantage of computer power in performing boring tasks. We, certainly, feel happier, have more chances for our family, and have more time to come up with important decisions.

MTI Tech AI Linkedin Page

First glance at the potential of TinyML

According to a survey done by ABI [3], by 2030, there are almost 250 billion microcontrollers in our printers, TVs, cars, and pacemakers can now perform tasks that previously only our computers and smartphones could handle. All of our devices and appliances are getting smarter thanks to microcontrollers. In addition, in [4] Silent Intelligence also predicts that TinyML can reach more than $ 70 billion in economic value at the end of 2025. From 2016 to 2020, the number of microcontrollers (MCU) was increasing rapidly, and this figure is predicted to rise in the next 3 years.

Machine Learning with AWS Recognition

What is AWS Recognition?

Amazon Recognition is a service that makes it easy to add powerful, image and video-based, visual analysis to your applications.

Recognition Image lets you easily build powerful applications to search, verify, and organize millions of images.

Recognition Video lets you extract motion-based context from stored or live stream videos and helps you analyze them.

You just provide an image or video to the Recognition API, and the service can identify objects, people, text, scenes, and activities. It can detect any inappropriate content as well.

Amazon Recognition also provides highly accurate facial analysis and facial recognition. You can detect, analyze, and compare faces for a wide variety of use cases, including user verification, cataloging, people counting, and public safety.

Amazon Recognition is a HIPAA eligible service

You need to ensure that the Amazon S3 bucket you want to use is in the same region as your Amazon Recognition API endpoint.

How does it work?

Amazon Recognition provides two API sets, they are Amazon Recognition Image for analyzing images and Amazon Recognition Video, for analyzing videos.

Both API sets perform detection and recognition analysis of images and videos.
Amazon Recognition Video can be used to track the path of people in a stored video.
Amazon Recognition Video to searching a streaming video for persons whose facial descriptions match facial descriptions already stored by Amazon Recognition.

RecognizeCelebrities API returns information for up to 100 celebrities detected in an image.

Use cases for AWS recognition

Searchable image and video libraries

Amazon Recognition makes images and stored videos searchable

Face-based user verification

It can be used in building access or similar applications, compares a live image to a reference image

Sentiment and demographic analysis

Amazon Recognition detects emotions such as happiness, sadness, or surprise, and demographic information such as gender from facial images.

Recognition can analyze images and send the emotion and demographic attributes to Amazon Redshift for periodic reporting on trends such as in-store locations and similar scenarios.

Facial recognition

Images, Stored Videos, and Streaming videos can be searches for faces that match those in a face collection A face collection is an index of faces that you own and manage

Unsafe Content Detection

Amazon Recognition can detect explicit and suggestive adult content in images and in videos

For example, social and dating sites, photo-sharing platforms, blogs and forums, apps for children, e-commerce sites, entertainment, and online advertising services.

Celebrity recognition

Amazon Recognition can recognize thousands of celebrities (politicians, sports, business, entertainment, and media) within supplied images and in videos.

Text detection

It  detecting text in an image allows for extracting textual content from images.

Benefits

Integrate powerful image and video recognition into your apps

Amazon Recognition removes the complexity of building image recognition capabilities into applications by making powerful and accurate analysis available with a simple API.

Deep learning-based image and video analysis

Recognition uses deep learning technology to accurately analyze images, find and compare faces in images, and detect objects and scenes within images and videos.

Scalable image analysis

Amazon Recognition enables for the analysis of millions of images.
This allows for curating and organizing massive amounts of visual data.

Low cost

Clients pay for the images and videos they analyze and the face metadata that stored. There are no minimum fees or upfront commitments.