EXPLORING UNIVERSAL SENTENCE ENCODER MODEL (USE)

In NLP, encoding text is the heart of understanding language.  There are many implementations like Glove, Word2vec, fastText which are aware of word embedding. However, these embeddings are only useful for word-level and may not perform well in case we would like to expand to encode for sentences or in general, greater than one word. In this post, we would like to introduce one of the SOTAs for such a task: the Universal Sentence Encoder model

1. What is USE (UNIVERSAL SENTENCE ENCODER MODEL)?

The Universal Sentence Encoder (USE) encodes text into high dimensional vectors (embedding vectors or just embeddings). These vectors are supposed to capture the textual semantic. But why do we even need them?

A vector is an array of numbers of a particular dimension. With the vectors in hand, it’s much easier for computers to work on textual data. For example, we can say two data points are similar or not just by calculating the distance between the two points’ embedding vectors.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://amitness.com/2020/06/universal-sentence-encoder/)

The embedding vectors then in turn can be used for other NLP downstream tasks such as text classification, semantic similarity, clustering…

2.USE architecture

It comes with two variations with the main difference resides in the embedding part. One is equipped with the encoder part from the famous Transformer architecture, the other one uses Deep Averaging Network (DAN)

2.1 Transformer encoder

The Transformer architecture is designed to handle sequential data, but not in order like the RNN-based architectures. It uses the attention mechanism to compute context-aware representations of words in a sentence taking into account both the ordering and significance of all the other words. The encoder takes input as a lowercased PTB tokenized string and outputs the representations of each sentence as a fixed-length encoding vector by computing the element-wise sum of the representations at each word position. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.

Universal Sentence Encoder uses only the encoder branch of Transformer to take advantage of its strong embedding capacity.

UNIVERSAL SENTENCE ENCODER MODEL
(Image source: https://arxiv.org/abs/1706.03762)

2.2 Deep Averaging Network (DAN):

DAN is a simple Neural Network that takes an average of embeddings for words and bi-grams and then passed the “combined” vector through a feedforward deep neural network (DNN) to produce sentence embeddings. Similar to the Transformer encoder, DAN takes as input a lowercased PTB tokenized string and outputs a 512-dimensional sentence embedding.

UNIVERSAL SENTENCE ENCODER MODEL
(Image source: https://medium.com/tech-that-works/deep-averaging-network-in-universal-sentence-encoder-465655874a04)

The two have a trade-off of accuracy and computational resource requirement. While the one with the Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.

3. How was it trained?

The key idea for training this model is to make the model work for generic tasks such as:

  • Modified Skip-thought
  • Conversational input-response prediction
  • Natural language inference.

3.1 Modified skip-thought:

given a sentence, the model needs to predict the sentences around it.

Radar Image Compressing by Huffman Algorithm

In AI Lab, one of our projects is “Weather Forecasting”, in which we process the data from weather radars, and perform prediction using nowcasting techniques, i.e.  forecasting in a short period like 60 minutes.

To develop a forecasting model and do an evaluation, we need to run many test cases. Each test case corresponds to one rainfall event, whose duration can be several hours. In the rainy season, one rainfall event cans last for several days.

Given one event and one interesting area (80km2 of Tokyo city for example). In our actual work, a list of arrays of rainfall intensity needs to be stored in a machine. Each array corresponds to one minute of observation by radar. If the event lasts for 5 days, then the number of arrays is

60 minutes x 24 hours x 5 days = 7200 arrays

Therefore, storing such an amount of data costs a lot of memory.

One solution is to modify the way feeding data to the evaluation module. However, for the scope of this blog, let assume that we would like to keep that amount of data on the computer. How do we do to compress data, and how much data can be compressed?

The post aims to introduce the Huffman algorithm, one of the basic methods for data compressing. By learning about the Huffman algorithm, we can easily understand the mechanism behind some advanced compressing methods.

Let us start with a question: how does a computer store data?

Data in computer language

Computer uses bits of {0,1} to represent numbers. One character of 0 or 1 is called a bit. Any string of bits is called a code.

Each number is encoded by a unique string of bits, or code. The encoder should satisfy the “prefix” condition. This condition states that no code is the prefix of any other codes. This makes it is always possible to decode a bit-string back to a number.

Let us consider an example. Let assume that we have a data d as a list of numbers, d = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3].

To store this list d, the computer needs to encode the three numbers: 1, 2, 3. The two possible encoders are:

Code/Number 1 2 3
Code 1 ‘0’ ‘01’ ‘11’
Code 2 ‘0’ ‘10’ ‘11’

But the Code 1 violates the ‘prefix’ condition, as ‘0’ is a prefix of ’01’, while Code 2 does not. Therefore, Code 2 can be used for encoding.

So, how many bits needed to represent N different numbers? In theoretically, we need codes of k-bits length to encode 2^k numbers.

For example, to encode 2 numbers, we need codes of 1-bit length, which are ‘0’ and ‘1’.

To encode 4 = 2^2 numbers, we need codes of 2-bits length, which are: ’00’, ’01’, ’10’, ’11’

To encode 256 = 2^8 numbers, we need codes of 8-bits length, which are: … [we skip it :)]

Consider one radar image of dimension 401×401 including float numbers that take value from 0 to 200 mm/h. If we are interested in one decimal number, this data can be multiplied by 10 and saved as an integer. At that time, the array contains 401×401 integers, whose values are from 0 to 2000.

Following the above logic, to encode 2000 different numbers, we need 11-bits codes, because 2^11 = 2048. If so, the total number of bits to encode this data is: 401 x 401 x 11 bits = 1,768,811 bits.

Compressing data aims to reduce the total number of bits that are used to encode it, in other words, to find an encoder that has a minimum average of code lengths.

Expected code length

The expected or average code length (ECL) of a data d is defined as:

Huffman Algorithmwhere x is the values of data, f(x) is its frequency, and l(x) is the length of code used to encode the value x. This quantity tells us how many bits on average are used to encode a number in the data x and by a Code C.

Minimum expected code length

It can be proved mathematically that, the ECL of a code is bounded below by the entropy of given data. What is the entropy of data? It’s a quantity to measure information, that was initially introduced by Shannon (a Mathematician and father of Information Theory). This quantity tells us how much uncertainty the data contains.

From the mathematical point of view, the entropy of the data is defined as

Huffman Algorithm

And by the source coding theorem, the following inequality is always true

Huffman Algorithm

which is equivalent to

Huffman Algorithm

The ECL achieves the minimum value if l(x) = -log2 [f(x)], meaning that, the length of the code of value x depends on the frequency of that value in the data. For example, in our radar images, the most frequent number is 0 (0.1mm/h), then we should use a short string of bits to encode the value 0. On the contrary, the number 2000 (0.1mm/h) has a very low frequency, then a longer bit-string must be used. In that way, the average of bits used to encode one value will be reduced to the entropy value, which is the lower bound of the compression. If we use a code that yields an ECL less than the entropy, then the compression makes a loss of information.

In summary, the compressing could be minimized if the length of codes follows the formula: l(x) = -log2 [f(x)].

The Huffman Encoding

The Huffman algorithm is one of the methods to encode data into bit-strings whose ECL approximates the lower bound of compression. This is a very simple algorithm that can be produced by several code lines in Python. For details of the Huffman algorithm, you can refer to the link:

https://www.geeksforgeeks.org/huffman-coding-greedy-algo-3/

Encode and Decode radar images

As described above, a radar image contains 401×401 data points, each point is an integer taking value from 0 to 2000.

By performing the Huffman encoder, we obtained the following codes (Figure 1).

Huffman Algorithm
Figure 1. A part of Huffman codes for 2000 integer numbers

By using these codes, the total number of bits of one specific image is 416,924 bits, which is much smaller than the codes using 11-bits length.

In actual work, we can convert image data to bits by concatenate value by value, then write everything to a binary file. As the computer groups every 8 bits to a byte, then the binary file consumes 416,924/8 bits = 52KB.

To decode a binary file back to the image array, we load the binary file and transform it to bit. As the Huffman code verifies the ‘prefix’ condition, by searching and mapping, the original data can always be reconstructed without losing any information.

If we write the image array in the numpy format, then the storage can be up to 1.3MB/array. Now with Huffman compression, the storage can be reduced to 52KB.

If we use the ‘ZIP’ application to compress the numpy file, the storage of the zip file is only 48KB, meaning that the ‘ZIP’ is outperforming our Huffman encoding. That is because ZIP is using another encoding algorithm that allows reaching closer to the lower bound of compression.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Weather Data Science Project

Please check about Weather Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

Introduction to Healthcare Data Science

Introduction to Healthcare Data Science (Overview)

Healthcare analytics is the collection and analysis of data in the healthcare field to study determinants of disease in human populations, identify and mitigate risk by predicting outcomes. This post introduces  some common epidemiological study designs and an overview of the modern healthcare data analytics process.

Types of Epidemiologic Studies

In general, epidemiologic studies can be classified into 3 types: Interventional study, Observational study, Meta-analysis study

Interventional or Randomized control study

Clinical medicine relies on evidence base found in strong research to inform best practices and improve clinical care. The gold standard for study design to provide evidence is Randomized control (RCT). The main idea of this kind of research is to prove the root cause that makes a certain disease happen or the causal effect of a treatment. RCTs are performed in fairly homogeneous patient populations when participants are allocated by chance to two similar groups. The researchers then try different interventions or treatments for these two groups and compare the outcomes.

As an example, a study was conducted to assess whether improved lifestyle habits could reduce the hemoglobin A1c (HbA1c) levels of employees. In the experiment, the intervention consisted of a 3-month competition among employees to adopt  healthier lifestyle habits (Eat better, Move more, and Quit smoking) or keep their current lifestyle. After the intervention, employees with elevated HbA1c significantly reduced their HbA1c levels while employees without elevated HbA1c levels of employees without intervention were not changed.

In ideal conditions, there are no confounding variables in a randomized experiment, thus RCTs are often designed to investigate the causal relationship between exposure and outcome. However, RCTs have several limitations, RCTs are often costly, time-intensive, labor-intensive, slow, and can consist of homogeneous patients that are seldom generalizable to every patient population.

Observational studies

Unlike RCTs, Observational studies have no active interventions which mean the researchers do not interfere with their participants. In contrast with interventional studies, observation studies are usually performed in heterogeneous patient populations. In these studies, researchers often define an outcome of interest (e.g a disease) and use data collected on patients such as demographic, labs, vital signs, and disease states to explore the relationship between exposures and outcome, determine which factors contributed to the outcome and attempt to draw inferences about the effects of different exposures on the outcome. Findings from observational studies can subsequently be developed and tested with the use of RCTs in targeted patient populations.

Observational studies tend to be less time- and cost-intensive. There are three main study designs in observational studies: prospective study design, retrospective study design, and cross-sectional study design.

Follow-up study/ Prospective study/ Longitudinal (incidence) study

A prospective study is a study in which a group of disease-free individuals is identified as a baseline and are followed over some time until some of them develop the disease. The development of disease over time is then related to other variables measured at baseline, generally called exposure variables. The study population in a prospective study is often called a cohort.

Retrospective study/ Case-Control study

A retrospective study is a study in which two groups of individuals are initially identified: (1) a group that has the disease under study (the cases) and (2) a group that does not have the disease under study (the controls). Cases are individuals who have a specific disease investigated in the research. Controls are those who did not have the disease of interest in the research. Usually, a retrospective history of health habits before getting the disease is obtained. An attempt is then made to relate their prior health habits to their current disease status. This type of study is also sometimes called a case-control study.

Cross-sectional (Prevalence) study/ Prevalence study

A cross-sectional study is one in which a study population is ascertained at a single point in time. This type of study is sometimes called a prevalence study because the prevalence of disease at one point in time is compared between exposed and unexposed individuals. . Prevalence of a disease is obtained by dividing the number of people who currently have the disease by the number of people in the study population.

Meta-data analysis

Often more than one investigation is performed to study a particular research question, by different research groups reporting significant differences for a particular finding and other research groups reporting no significant differences. Therefore, in meta-analysis researchers collect and synthesize findings from many existing studies and provides a clear picture of factors associated with the development of a certain disease. These results may be utilized for ranking and prioritizing risk factors in other researches.

Modern Healthcare Data analytics approach

Secondary Analysis and modern healthcare data analytics approach

In primary research infrastructure, designing a large-scale randomized controlled trial (RCTs) is expensive and sometimes unfeasible.  The alternative approach for expansive data is to utilize electronic health records (EHR). In contrast with the primary analysis, secondary analysis performs retrospective research using data collected for purposes other than research such as Electronic Health Record (EHR). Modern healthcare data analytic projects apply advanced data analysis methods, such as machine learning, and perform integrative analysis to leverage a wealth of deep clinical and administrative data with longitudinal history from EHR to get a more comprehensive understanding of the patient’s condition.

Electronic Health Record (EHR)

EHRs are data generated during routine patient care. Electronic health records contain large amounts of longitudinal data and a wealth of detailed clinical information. Thus, the data, if properly analyzed and meaningfully interpreted, could vastly improve our conception and development of best practices. Common data in EHR are listed as the following:

  • Demographics

    Age, gender, occupation, marital status, ethnicity

  • Physical measurement

    SBP, DBP, Height, Weight, BMI, waist circumference

  • Anthropometry

    Stature, sitting height, elbow width, weight, subscapular, triceps skinfold measurement

  • Laboratory

    Creatinine, hemoglobin, white blood cell count (WBC), total cholesterol, cholesterol, triglyceride, gamma-glutamyl transferase (GGT)

  • Symptoms

    frequency in urination, skin rash, stomachache, cough

  • Medical history and Family diseases

    diabetes, traumas, dyslipidemia, hypertension, cancer, heart diseases, stroke, diabetes, arthritis, etc

  • Lifestyle habit

    Behavior risk factors from Questionnaires such as Physical activity, dietary habit, smoking, drinking alcohol, sleeping, diet, nutritional habits, cognitive function, work history, and digestive health, etc

  • Treatment

    Medications (prescriptions, dose, timing), procedures, etc.

Using EHR to Conduct Outcome and Health Services Research

In the secondary analysis, the process of analyzing data often includes steps:

  1. Problem Understanding and Formulating the Research Question: In this step, the process of transforming a clinical question into research is defined. There are 3 key components of the research question: the study sample (or patient cohort), the exposure of interest (e.g., information about patient demographic, lifestyle habit, medical history, regular health checkup test result), and the outcome of interest (e.g., a patient has diabetes or not after 5 years):
  2. Data Preparation and Integration: Extracted raw data can be collected from different data sources, or be in separate datasets with different representation and formats. Data Preparation and Integration is the process of combining and reorganizing data derived from various data sources (such as databases, flat files, etc.) into a consistent dataset that contains all the information required for desired statistical analysis
  3. Exploratory Data Analysis/ Data Understanding: Before statistics and machine learning models are employed, there is an important step of exploring data which is important for understanding the type of information that has been collected and what they mean. Data Exploration consists of investigating the distribution of variables, patterns, and nature inside the data and checking the quality of the underlying data. This preliminary examination will influence which methods will be most suitable for the data preprocessing step and choosing the appropriate predictive model
  4. Data Preprocessing: Data preprocessing is one of the most important steps and critical in the success of machine learning techniques. Electronic health records (EHR) often were collected for clinical purposes. Therefore, these databases can have many data quality issues. Preprocessing aims at assessing and improving the quality of data to allow for reliable statistical analysis.
  5. Feature Selection: Since the final dataset may have several hundreds of data fields, and not all of them are relevant to explain the target variable. In many machine learning algorithms, high-dimensionality can cause overfitting or reduce the accuracy of the model instead of improving it. Features selection algorithms are used to identify features that have an important predictive role. These techniques do not change the content of the initial features set, only select a subset of them. The purpose of feature selection is to help to create optimize and cost-benefit models for enhancing prediction performance.
  6. Predictive Model: To develop prediction models with statistical models and machine learning algorithms could be employed. The purpose of machine learning is to design and develop prediction models, by allowing the computer to learn from data or experience to solve a certain problem. These models are useful for understanding the system under study, the models can be divided according to the type of outcome that they produce which includes the Classification model, Regression model, or Clustering model
  7. Prediction and Model Evaluation: This process is to evaluate the performance of predictive models. The evaluation should include internal and external validation. Internal validation refers to the model performance evaluation in the same dataset in which the model was developed. External validation is the evaluation of a prediction model in other populations with different characteristics to assess the generalizability of the model.
    Please also check our Healthcare Data Science example

References:

[1] Fundamentals of Biostatistics – Bernard Rosner, Harvard University

[2] Secondary Analysis of Electronic Health Records – Springer Open

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineers.
Please check our Career Page.

Healthcare Data Science Project

Please check about Healthcare Data Science about actual Data Science project examples.

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

 

Importance of TinyML

Introduction

Tiny Machine Learning (TinyML) [1] is, unsurprisingly, a machine learning technique but this technique is often utilized in building machine learning applications, which require high performance but have limited hardware. a tiny neural network on a microcontroller with really low power requirements (sometimes <1mW).

Tiny Machine Learning: The Next AI Revolution | by Matthew Stewart, PhD Researcher | Towards Data Science

Figure 1: Tiny ML, the next AI revolution [5]

TinyML is often implemented in low-energy systems such as microcontrollers or sensors to perform automated tasks. One trivial example is the Internet of Things (IoT) devices. However, The biggest challenge in implementing TinyML is that it required “full-stack” engineers or data scientists who have profound knowledge in building hardware, design system architecture, developing software, and applications.

TinyML,  IoT, and embedded system

In [2], the Internet of things (IoT) reflects the network of physical objects (a.k.a, things) that are embedded with sensors, software, and other technologies to connect and exchange data with other devices and systems over the Internet. Therefore, most IoT devices should be applied to TinyML to enhance their data collection and data processing. In other words, as argued by many machine learning experts, the relationship between TinyML, IoT, and embedded systems will be a long-lasting (TinyML belongs to IoT).

Vietnam AI Lab

Applications

Spectrino: TinyML Arduino & IoT Based Touch-Free Solutions - Arduino Project Hub
Figure 2: One commercial application of TinyML in a smart house [6]
In the future, the era of information explosion, TinyML enables humans to deliver many brilliant applications, that help us reduce stress in processing data. Some examples include:

In agriculture: Profit losses due to animal illnesses can be reduced by using wearable devices. These smart sensors can help to monitor health vitals such as heart rate, blood pressure, temperature, etc.  and TinyML will be useful in predicting on  the onslaught of disease and epidemics

In industry:  TinyML can prevent downtime due to equipment failure by enabling real-time decisions without human interaction in the manufacturing sector. It can signal workers to perform preventative maintenance when necessary, based on equipment conditions.

In retail: TinyML can help to increase profits in indirect ways by providing effective means for warehouse or store monitoring. As smart sensors will possibly become popular in the future, they could be utilized in small stores, supermarkets, or hypermarkets to monitor shelves in-store. TinyML will be useful in processing those data and prevent items from becoming out of stock. Humans will enjoy endless amusement came from these  ML-based applications for the economic sector.

In mobility: TinyML will help sensors have more power in ingesting real-time traffic data. Once those sensors are applied in reality, humans will be no longer worry about traffic-related issues (such as traffic jams, traffic accidents)

Imagine when all sensors in the embedded systems mentioned in the above applications are connected in a super-fast Internet connection, every TinyML algorithm will be controlled by a giant ML system. That is a time when humans can take advantage of computer power in performing boring tasks. We, certainly, feel happier, have more chances for our family, and have more time to come up with important decisions.

MTI Tech AI Linkedin Page

First glance at the potential of TinyML

According to a survey done by ABI [3], by 2030, there are almost 250 billion microcontrollers in our printers, TVs, cars, and pacemakers that can now perform tasks that previously only our computers and smartphones could handle. All of our devices and appliances are getting smarter thanks to microcontrollers. In addition, in [4] Silent Intelligence also predicts that TinyML can reach more than $ 70 billion in economic value at the end of 2025. From 2016 to 2020, the number of microcontrollers (MCU) was increasing rapidly, and this figure is predicted to rise in the next 3 years.

Machine Learning with AWS Recognition

What is AWS Recognition?

Amazon Recognition is a service that makes it easy to add powerful, image and video-based, visual analysis to your applications.

Recognition Image lets you easily build powerful applications to search, verify, and organize millions of images.

Recognition Video lets you extract motion-based context from stored or live stream videos and helps you analyze them.

You just provide an image or video to the Recognition API, and the service can identify objects, people, text, scenes, and activities. It can detect any inappropriate content as well.

Amazon Recognition also provides highly accurate facial analysis and facial recognition. You can detect, analyze, and compare faces for a wide variety of use cases, including user verification, cataloging, people counting, and public safety.

Amazon Recognition is a HIPAA eligible service

You need to ensure that the Amazon S3 bucket you want to use is in the same region as your Amazon Recognition API endpoint.

How does it work?

Amazon Recognition provides two API sets, they are Amazon Recognition Image for analyzing images and Amazon Recognition Video, for analyzing videos.

Both API sets perform detection and recognition analysis of images and videos.
Amazon Recognition Video can be used to track the path of people in a stored video.
Amazon Recognition Video to searching a streaming video for persons whose facial descriptions match facial descriptions already stored by Amazon Recognition.

RecognizeCelebrities API returns information for up to 100 celebrities detected in an image.

Use cases for AWS recognition

Searchable image and video libraries

Amazon Recognition makes images and stored videos searchable

Face-based user verification

It can be used in building access or similar applications, compares a live image to a reference image

Sentiment and demographic analysis

Amazon Recognition detects emotions such as happiness, sadness, or surprise, and demographic information such as gender from facial images.

Recognition can analyze images and send the emotion and demographic attributes to Amazon Redshift for periodic reporting on trends such as in-store locations and similar scenarios.

Facial recognition

Images, Stored Videos, and Streaming videos can be searches for faces that match those in a face collection A face collection is an index of faces that you own and manage

Unsafe Content Detection

Amazon Recognition can detect explicit and suggestive adult content in images and videos.

For example, social and dating sites, photo-sharing platforms, blogs and forums, apps for children, e-commerce sites, entertainment, and online advertising services.

Celebrity recognition

Amazon Recognition can recognize thousands of celebrities (politicians, sports, business, entertainment, and media) within supplied images and in videos.

Text detection

It detecting text in an image allows for extracting textual content from images.

Benefits

Integrate powerful image and video recognition into your apps

Amazon Recognition removes the complexity of building image recognition capabilities into applications by making powerful and accurate analysis available with a simple API.

Deep learning-based image and video analysis

Recognition uses deep learning technology to accurately analyze images, find and compare faces in images, and detect objects and scenes within images and videos.

Scalable image analysis

Amazon Recognition enables the analysis of millions of images.
This allows for curating and organizing massive amounts of visual data.

Low cost

Clients pay for the images and videos they analyze and the face metadata stored. There are no minimum fees or upfront commitments.

Basic time – related machine learning models

Introduction

With data that have time-related information, time features can be created to possibly add more information to the models.

Since how to consider time series for machine learning is a broad topic, this article only aims to introduced basic ways to create time features for those models.

Type of data that is expected for this application

It is expected that transaction data type or any kind that similar to it will be the common one for this application. Other kinds of data that have timestamps information for each data point should also apply to some extent of this approach well.

Considering before attempt: a need of analyzing the problem and scope

For data with the time element, it can be presented as a time series, which is how a set of data points described by an entity are follow ordered indexes of time. One aspect considered for time series is that observations are expected to depend on the previous one in time sequence, with the latter is correlated to the one before. In those cases, using time series models for forecasting is a straightforward approach to use this data. Another way to approach it is to use feature engineering to transform data to have features that can be used for supervised machine learning models, which is the focus of this article.

Using the time series model or adapting the machine learning model is dependent. In some cases, domain knowledge or business requirement will influence this decision.  It is better to analyze the problem first to see the need of using either one or both types of models.

Regardless of the domain knowledge  or business requirement aspects, the approach decision should have always considering the efficiency the approach will bring in terms of accuracy and computation cost.

Basic methods

A first preprocessing step to have the first set of time features: extracting time information from timestamp

The most straightforward thing to do is to extract basic time units, which for instance are hours, date, month, years into separates features. Another kind of information that can also be extracted is the characteristic of the time, which could be whether the time is at a part of days (morning, afternoon), is weekend or not or is it a holiday, etc.

In some business requirements or domains’ aspects, those initial features at this level are already needed to see if the value of observation is followed those factors or not. For example, the data is the record of timestamps of customers visiting a shop and their purchases. There is a need to know at which hours, date, month… a customer would come and purchase so that follow-up actions can be made to increase sales.

Aggregate techniques

Regarding feature engineering for time data, the well-known technique that is commonly used is aggregate features by taking statistics (variance, max, min, etc.) of the set of values grouped by the desired time unit: hours, days, months…

Apart from that, a time window could be defined and compute aggregate by rolling or expanding the time window.

  • Rolling: have a fixed time window size and to predict a value for the data point at a time, features will be computed from aggregating backward number of time steps according to the time window
  • Expanding: from the data point, the window will be the whole record of past time steps.

There are also two aspects of aggregating:

  • Aggregating to create new features for the current data points. For the first case, the model is considered to include the time series characteristic, meaning a moment will likely be related to other moments from the recent past.
  • Aggregating to create a new set of data points with a corresponding new set of features from the current ones. For the second one, the new number of data points considering for the model is changed and each new data point is the summary of information from a subset of initial data points. As a result, objects for the models may be shifted like being mentioned in considering the before part. If the data only about the information record of one entity, or in other words only contains one time series of an entity, through this technique, the new computed data points can be the summary of other features’ value in the chosen time unit. On the other hand, if there are more entities observed in the data set, each new data point is then the summary information of each observed entity.

How to decide on the focus objects for the problem and the approach is situational but for a fresh problem and fresh data with no specific requirement or prior domain knowledge, it is better to consider all of them for the model and execute feature selection to see if the created time features are of any value.

Dealing with hours of a day – Circular data

For some needs, a specific time of the day is required to be focus. A use case for detecting fraud transactions is a good example of this. To find something like the most frequent time that a kind of behavior is performed, for instance, using the arithmetic mean may be misleading and is not a good representation. An important point that needs to be considered is that hours of the day is a circular data and it should be represented on a circular axis with its’ value between 0 to 2π. To have a better representation of the mean, using von Mises distribution to have periodic mean is a suitable approach for this situation (Mishtert, 2019).

Validation for the model

Before the model building, a validation set is needed to be selected from the data first. In the usual cases, to avoid overfitting data will be randomly shuffled and then will be divided into a training set and validation set. However, for this kind of situation, it shouldn’t be done so to avoid the mistake of having past data in the validation and the future data in the training, in other words using future data to predict the past.

Anytime Algorithm For Machine Learning

Definition

In computer science, most algorithms run to completion: they provide a single answer after performing some fixed amount of computation. But nowadays, in machine learning generation, the models take a long time to train and predict the result, the user may wish to terminate the algorithm before completion. That’s where anytime algorithm came in, an algorithm that can return a valid solution even if it is interrupted before it ends. The longer it keeps running, the better solution the user gets.

What makes anytime algorithms unique is their ability to return many possible outcomes for any given input. An anytime algorithm uses many well-defined quality measures to monitor progress in problem-solving and distributed computing resources. While this may sound like dynamic programming, the difference is that it is fine-tuned through random adjustments, rather than sequential.

The expected performance of anytime algorithm
Figure 1: The expected performance of anytime algorithm

Algorithm prerequisites

Initialize: While some algorithms start with immediate guesses, anytime algorithm take more calculated approach and have a start–up period before making any guesses

Growth direction: How the quality of the program’s “output” or result, varies as run time

Growth rate: Amount of increase with each step. Does it change constantly or unpredictably?

End condition: The amount of runtime needed

Case studies

  • Clustering with time series

In clustering, we must compute distance or similarity between pairs of time series and we also have a lot of measurements. As many types of research proved that dynamic time warping (DTW) is more robust than others, but the complexity O(N2) of DTW makes computation more trouble.

With anytime algorithm, the life is easier than ever. Below is pseudocode for anytime clustering algorithm.

Algorithm [Clusters] = AnytimeClustering(Dataset)
1.  aDTW = BuildApproDistMatrix(Dataset)
2.  Clusters = Clustering(aDTW, Dataset)
3.  Disp("Setup is done, interruption is possible")
4.  O = OrderToUpdateDTW(Dataset)
5.  For i = 1:Length(O)
6.     aDTW(O(i)) = DTW(Dataset, O(i))
7.     if UserInterruptIsTrue()
8.        Clusters = Clustering(aDTW, Dataset)
9.        if UserTerminateIsTrue()
10.          return
11.       endif
12.    endif
13. endfor
14. Clusters = Clustering(aDTW, Dataset)

Firstly, we need to approximate the distance matrix of the dataset.

Secondly, we do cluster to get the very basic result.

Thirdly, we make some heuristic to find the optimal solution to update the distance matrix, after each updating, we get a better result than before.

Lastly, if the algorithm keeps running longer, it will finish the task and we can get the final optimal output.

  • Learning optimal Bayesian networks

Another common example for anytime algorithm is searching optimal output in Bayesian networks. As we know that Bayesian network is NP–hard. If we have too many variables or events, the algorithm will fail due to limited time and memory. That’s why the anytime weighted A* algorithm is often applied to find an optimal result.

Bayesian Networks of 4 variables
Figure 2: Bayesian Networks of 4 variables

Weighted A* algorithm use this cost function to minimize

f(n) = g(n) + ε * h(n)

where n is the last node on the graph, g(n) is the cost of the path from the start node to n, and h(n) is a heuristic that estimates the cost of the cheapest path from n to the last node, ε gradually lowers to 1. The anytime algorithm bases on this cost function to stop and continue at any time.

Overfitting in Machine Learning

In any data science project, once the variables that correspond to business objectives are defined, typically the project proceeds with the following aims: (1) to understand the data quantitatively, (2) to describe the data by producing a ‘model’ that represents adequately the data at hand, and (3) to use this model to predict the outcome of those variables from ‘future’ observed data.

Anyone dealing with machine learning is bound to be familiar with the phenomenon of overfitting, one of the first things taught to students, and probably one of the problems that will continue to shadow us in any data-centric job: when the predictive model works very well in the lab, but behaves poorly when deployed in the real world.

What is Overfitting?

Overfitting is defined as “the production of an analysis which corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably.”

Whether a regression, decision tree, or neural network, the different models are constructed differently: some depend on a few features (in the case of a linear model), some are too complex to visualize and understand at a glance (in the case of a deep neural network). Since there is inherently noise in data, the model fitting (‘learning’) process can either learn too little pattern (‘underfit’) or learn too many ‘false’ patterns discernible in the ‘noise’ of the seen data (‘overfit’) not meant to be present in the intended application, that end up distorting the model. Typically, this can be illustrated in the diagram below:

what's overfitting

Detecting overfitting

In Machine Learning (ML), training and testing are often done on different parts of data available. The models — whether they be trees, neural networks, equations — that have characterized the training data very well, are validated on a ‘different’ part of data. Overfitting is detected when the model has learned too many ‘false patterns’ from the training data that do not generalize to the validation data. As such, most ML practitioners reduce the problem of overfitting to a matter of knowing when to stop training, typically by choosing an ‘early stopping’ value around the red dotted line, illustrated in the loss vs. several iterations graph below.

Detecting overfitting

If detecting overfitting is reduced to a matter of “learning enough”, inherent to the contemporary ML training process, it is assumed to be “solved” when the model works on a portion of the held-out data. In some cases, k-fold / cross-validation is performed, which splits the data into k ‘folds’: one held-out test set and k-1 training sets. Given that most models do not make it to the crucible of the production environment, most practitioners stop at demonstrating (in a lab-controlled environment) that the model “works”; when the numbers from cross-validation statistics check out fine, showing consistent performance across all folds.

But.. is it so simple?

In simpler or more trivial models, overfitting may not even be noticeable! However, with more rigorous testing or when the model is lucky enough to be deployed to the production stage, there is usually more than one validation set (or more than one client’s data!). It is not infrequent that we see various validation/real-world data do not ‘converge’ at the same rate, making the task of choosing an early stopping point more challenging. With just two different validation data sets, one can observe the following:

what about this curve? overfitting

What about this curve?

what about this curve? overfitting

When we get curves that look like the graphs above, it is harder to answer these questions: When does overfitting ‘begin’? How to deal with multiple loss functions? What is happening here??

Digging your way out

Textbook ML prescribes many ‘solutions’ — which might just work if you are lucky — sometimes without you knowing exactly what is the problem:

  • Data augmentation:

    Deep down, most overfitting problems will go away if we have an abundant amount of ‘good enough’ data, which for most data scientists is a utopian dream. Limited resource makes it impossible to collect the perfect data: clean, complete, unbiased, independent, and cheap. For many neural network ML pipelines, data augmentation has become de rigeur, which aims to make the model not learn characteristics unique to the dataset, by multiplying the amount of training data available. Without collecting more data, data can be ‘multiplied’ by varying them through small perturbations to make them seem different from the model. Whether or not this approach is effective will depend on what is your model learning!

  • Regularization:

    The regularization term is routinely added in model fitting, to penalize overly complex models. In regression methods, L1 or L2 penalty terms are often added to encourage smaller coefficients and thus, ‘simpler’ models. In the decision trees, methods such as decision tree pruning or limiting tree maximum depth, are typical ways to ‘keep model simple’. Combined with ensemble methods, e.g. bagging or boosting, they can be used to avoid overfitting and make use of multiple weak learners to arrive  at a robust predictor. In DL techniques, regularization is achieved by introducing a dropout layer, that will randomly turn off neurons during training.

If you have gone through all those steps outlined above and still feel somehow that you just got lucky with a particular training/validation data set, you are not alone. We need to dig deeper to truly understand what is happening and get out of this sticky mess, or accept the risk of deploying such limited models and set up a future trap for ourselves!

Implementation for Adversarially Constrained Autoencoder Interpolation (ACAI)

Introduction

Autoencoders provide a powerful framework for learning compressed representations by encoding all of the information needed to reconstruct a data point in a latent code. In some cases, autoencoders can “interpolate”: By decoding the convex combination of the latent codes for two data points, the autoencoder can produce an output that semantically mixes characteristics from the data points. In this paper, we propose a regularization procedure that encourages interpolated outputs to appear more realistic by fooling a critical network that has been trained to recover the mixing coefficient from interpolated data. We then develop a simple benchmark task where we can quantitatively measure the extent to which various autoencoders can interpolate and show that our regularizer dramatically improves interpolation in this setting. We also demonstrate empirically that our regularizer produces latent codes which are more effective on downstream tasks, suggesting a possible link between interpolation abilities and learning useful representations. – [1]

The idea comes from the paper “Implementation from the paper: Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer” (https://arxiv.org/abs/1807.07543), also known as ACAI framework.

Today I will walk through the implementation of this fantastic idea. The implementation is based on tensorflow 2.0 and python 3.6. Let’s start!

Implementation

First, we need to import some dependency packages.

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.layers import Input, Dense, Reshape, Flatten, Dropout, multiply, GaussianNoise
from tensorflow.keras.layers import BatchNormalization, Activation, Embedding, ZeroPadding2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import LeakyReLU
from tensorflow.keras.layers import UpSampling2D, Conv2D, Reshape
from tensorflow.keras.layers import Lambda
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import losses
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import mnist
import keras.backend as K

import matplotlib.pyplot as plt

import numpy as np
import tqdm
import os

import io
from PIL
import Image
from sklearn.decomposition import PCA
# from sklearn.manifold import TSNE
import seaborn as sns
import pandas as pd

Next, we define the overall framework of ACAI, which composes of 3 parts: encoder, decoder and discriminator (also called as critic in the paper).

class ACAI():
    def __init__(self, img_shape=(28,28), latent_dim=32, disc_reg_coef=0.2, ae_reg_coef=0.5, dropout=0.0):
        self.latent_dim = latent_dim
        self.ae_optim = Adam(0.0001)
        self.d_optim = Adam(0.0001)
        self.img_shape = img_shape
        self.dropout = dropout
        self.disc_reg_coef = disc_reg_coef
        self.ae_reg_coef = ae_reg_coef
        self.intitializer = tf.keras.initializers.VarianceScaling(
                            scale=0.2, mode='fan_in', distribution='truncated_normal')
        self.initialize_models(self.img_shape, self.latent_dim)

    def initialize_models(self, img_shape, latent_dim):
        self.encoder = self.build_encoder(img_shape, latent_dim)
        self.decoder = self.build_decoder(latent_dim, img_shape)
        self.discriminator = self.build_discriminator(latent_dim, img_shape)
        
        img = Input(shape=img_shape)
        latent = self.encoder(img)
        res_img = self.decoder(latent)
        
        self.autoencoder = Model(img, res_img)
        discri_out = self.discriminator(img)


    def build_encoder(self, img_shape, latent_dim):
        encoder = Sequential(name='encoder')
        encoder.add(Flatten(input_shape=img_shape))
        encoder.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        encoder.add(Dropout(self.dropout))
        encoder.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        encoder.add(Dropout(self.dropout))
        encoder.add(Dense(latent_dim))
        
        encoder.summary()
        return encoder
    
    def build_decoder(self, latent_dim, img_shape):
        decoder = Sequential(name='decoder')
        decoder.add(Dense(1000, input_dim=latent_dim, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        decoder.add(Dropout(self.dropout))
        decoder.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        decoder.add(Dropout(self.dropout))
        decoder.add(Dense(np.prod(img_shape), activation='sigmoid'))
        decoder.add(Reshape(img_shape))
        
        decoder.summary()
        return decoder

    def build_discriminator(self, latent_dim, img_shape):
        discriminator = Sequential(name='discriminator')
        discriminator.add(Flatten(input_shape=img_shape))
        discriminator.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        discriminator.add(Dropout(self.dropout))
        discriminator.add(Dense(1000, activation=tf.nn.leaky_relu, kernel_initializer=self.intitializer))
        discriminator.add(Dropout(self.dropout))
        discriminator.add(Dense(latent_dim))

        # discriminator.add(Reshape((-1,)))
        discriminator.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))
        
        discriminator.summary()
        return discriminator

Some utility functions for monitoring the results:

def make_image_grid(imgs, shape, prefix, save_path, is_show=False):
    # Find the implementation in below github repo

def flip_tensor(t):
    # Find the implementation in below github repo

def plot_to_image(figure):
    # Find the implementation in below github repo

def visualize_latent_space(x, labels, n_clusters, range_lim=(-80, 80), perplexity=40, is_save=False, save_path=None):
     # Find the implementation in below github repo

Next, we define the training worker, which is called at each epoch:

@tf.function
def train_on_batch(x, y, model: ACAI):
    # Randomzie interpolated coefficient alpha
    alpha = tf.random.uniform((x.shape[0], 1), 0, 1)
    alpha = 0.5 - tf.abs(alpha - 0.5)  # Make interval [0, 0.5]

    with tf.GradientTape() as ae_tape, tf.GradientTape() as d_tape:
        # Constructs non-interpolated latent space and decoded input
        latent = model.encoder(x, training=True)
        res_x = model.decoder(latent, training=True)

        ae_loss = tf.reduce_mean(tf.losses.mean_squared_error(tf.reshape(x, (x.shape[0], -1)), tf.reshape(res_x, (res_x.shape[0], -1))))

        inp_latent = alpha * latent + (1 - alpha) * latent[::-1]
        res_x_hat = model.decoder(inp_latent, training=False)

        pred_alpha = model.discriminator(res_x_hat, training=True)
        # pred_alpha = K.mean(pred_alpha, [1,2,3])
        temp = model.discriminator(res_x + model.disc_reg_coef * (x - res_x), training=True)
        # temp = K.mean(temp, [1,2,3])
        disc_loss_term_1 = tf.reduce_mean(tf.square(pred_alpha - alpha))
        disc_loss_term_2 = tf.reduce_mean(tf.square(temp))

        reg_ae_loss = model.ae_reg_coef * tf.reduce_mean(tf.square(pred_alpha))

        total_ae_loss = ae_loss + reg_ae_loss
        total_d_loss = disc_loss_term_1 + disc_loss_term_2

    grad_ae = ae_tape.gradient(total_ae_loss, model.autoencoder.trainable_variables)
    grad_d = d_tape.gradient(total_d_loss, model.discriminator.trainable_variables)

    model.ae_optim.apply_gradients(zip(grad_ae, model.autoencoder.trainable_variables))
    model.d_optim.apply_gradients(zip(grad_d, model.discriminator.trainable_variables))

    return {
        'res_ae_loss': ae_loss,
        'reg_ae_loss': reg_ae_loss,
        'disc_loss': disc_loss_term_1,
        'reg_disc_loss': disc_loss_term_2

    }

Next, we need to define a main training function:

def train(model: ACAI, x_train, y_train, x_test,
          batch_size, epochs=1000, save_interval=200,
          save_path='./images'):
    n_epochs = tqdm.tqdm_notebook(range(epochs))
    total_batches = x_train.shape[0] // batch_size
    if not os.path.exists(save_path):
        os.makedirs(save_path)
    for epoch in n_epochs:
        offset = 0
        losses = []
        random_idx = np.random.randint(0, x_train.shape[0], x_train.shape[0])
        for batch_iter in range(total_batches):
            # Randomly choose each half batch
            imgs = x_train[offset:offset + batch_size,::] if (batch_iter < (total_batches - 1)) else x_train[offset:,::]
            offset += batch_size

            loss = train_on_batch(imgs, None, model)
            losses.append(loss)

        avg_loss = avg_losses(losses)
        # wandb.log({'losses': avg_loss})
            
        if epoch % save_interval == 0 or (epoch == epochs - 1):
            sampled_imgs = model.autoencoder(x_test[:100])
            res_img = make_image_grid(sampled_imgs.numpy(), (28,28), str(epoch), save_path)
            
            latent = model.encoder(x_train, training=False).numpy()
            latent_space_img = visualize_latent_space(latent, y_train, 10, is_save=True, save_path=f'./latent_space/{epoch}.png')
            # wandb.log({'res_test_img': [wandb.Image(res_img, caption="Reconstructed images")],
            #            'latent_space': [wandb.Image(latent_space_img, caption="Latent space")]})
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype(np.float32) / 255.
x_test = x_test.astype(np.float32) / 255.
ann = ACAI(dropout=0.5)
train(model=ann,
        x_train=x_train,
        y_train=y_train,
        x_test=x_test,
        batch_size=x_train.shape[0]//4,
        epochs=2000,
        save_interval=50,
        save_path='./images')

Results

Some of the result from ACAI after finishing:

First is the visualization of MNIST dataset after encoded by the encoder, we can see that the cluster is well separated, and applying downstream tasks on latent space will lead to significant improvement in comparison to raw data (such as clustering, try KMeans and check it out yourself :D).

ACAI result

 

Second is the visualization of interpolation power on latent space:

  • Interpolation with alpha values in range [0,1.0] with step 0.1.
  • 1st row and final rows are the source and destination images, respectively.
  • Formula:
mix_latent = alpha * src_latent + (1 - alpha) * dst_latent

output

We can see that there is a very smooth morphing from the digits at the top row to the digits at the bottom row.

The whole running code is available at github (acai_notebook). It’s your time to play with the paper :D.

Reference

[1] David Berthelot, Colin Raffel, Aurko Roy, and Ian Goodfellow. Understanding and improving interpolation in autoencoders via an adversarial regularizer, 2018.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

Pytorch part 1: Introducing Pytorch

Pytorch is a deep learning framework and a scientific computing package
The scientific computing aspect of PyTorch’s is primarily a result PyTorch’s tensor library and associated tensor operations. That means you can take advantage of Pytorch for many computing tasks, thanks to its supporting tensor operation, without touching deep learning modules.

Important to note that PyTorch tensors and their associated operations are very similar to numpy n-dimensional arrays. A tensor is actually an n-dimensional array.


Pytorch builds its library around Object Oriented Programming(OOP) concept. With object-oriented programming, we orient our program design and structure around objects. The tensor in Pytorch is presented by the object torch. Tensor which is created from numpy ndarray objects. Two objects share memory. This makes the transition between PyTorch and NumPy very cheap from a performance perspective.


With PyTorch tensors, GPU support is built-in. It’s very easy with PyTorch to move tensors to and from a GPU if we have one installed on our system. Tensors are super important for deep learning and neural networks because they are the data structure that we ultimately use for building and training our neural networks.
Talking a bit about history.

The initial release of PyTorch was in October of 2016, and before PyTorch was created, there was and still is, another framework called Torch which is also a machine learning framework but is based on the Lua programming language. The connection between PyTorch and this Lua version, called Torch, exists because many of the developers who maintain the Lua version are the individuals who created PyTorch. And they have been working for Facebook since then till now.


Below are the primary PyTorch modules we’ll be learning about and using as we build neural networks along the way.

Pytorch package description
Image 1. Pytorch package description

Why use Pytorch for deep learning?

  • PyTorch’s design is modern, Pythonic. When we build neural networks with PyTorch, we are super close to programming neural networks from scratch. When we write PyTorch code, we are just writing and extending standard Python classes, and when we debug PyTorch code, we are using the standard Python debugger. It’s written mostly in Python and only drops into C++ and CUDA code for operations that are performance bottlenecks.
  • It is a thin framework, which makes it more likely that PyTorch will be capable of adapting to the rapidly evolving deep learning environment as things change quickly over time.
  • Stays out of the way and this makes it so that we can focus on neural networks and less on the actual framework.

Why PyTorch is great for deep learning research


The reason for this research suitability is that Pytorch uses a dynamic computational graph, in contrast with tensorfow which uses a static computational graph, to calculate derivatives.


Computational graphs are used to graph the function operations that occur on tensors inside neural networks. These graphs are then used to compute the derivatives needed to optimize the neural network. Dynamic computational graph means that the graph is generated on the fly as the operations are created. Static graphs that are fully determined before the actual operations occur.