## k-Nearest Neighbors algorithms

In this blog post, I am going to introduce one of the most intuitive algorithms in the field of Supervised Learning[1], the k-Nearest Neighbors algorithm (kNN).

# The original k-Nearest Neighbors algorithm

The kNN algorithm is very intuitive. Indeed, with the assumption that items close together in the dataset are typically similar, kNN infers the output of a new sample by first constructing the distance score with every sample in the training dataset. From there, it creates a ‘neighbor zone’ by selecting samples that are ‘near’ the candidate one, and does the supervised tasks based on samples lie inside that zone. The task could be either classification or regression.

Let’s start with the basic kNN algorithm. Let be our training dataset with samples belong to classes, where is the class of one sample, and denotes the corresponding feature vector that describes the characteristics of that sample. Furthermore, it is necessary to define the suitable distance metric, since it drives the algorithm to select neighbors and make predictions later on. The distance metric , is a mapping over a vector space , where the following conditions are satisfied :

In the following steps to describe the k-Nearest Neighbors algorithm, the Euclidean distance will be used as the distance metric .

For any new instance :

• Find where is the set of samples that are closest to
• The way to define the nearest neighbors is based on distance metric (Note that we are using Euclidean distance).

• The classifier is defined as:

where is the unit function. Note that for the regression problem, the function will just an average of all response values from neighbor samples.

Please also check Vietnam AI Lab

# Weighted k-Nearest Neighbors

In the kNN algorithm, we weigh all neighbors equally. It may affect the inference steps, especially when the neighbor zone becomes bigger and bigger. To strengthen the effect of ‘close’ neighbors than others, the weighted scheme of k-Nearest Neighbors is applied.

Weighted k-Nearest Neighbors is based on the idea that, within , such observations that are closer to , should get a higher weight than the further neighbors. Now it is necessary to note some properties of any weighting schemes on any distance metric :

• decreases monotonously for

For any new instance :

• We find where is the set of samples that are closest to
• The th neighbor is used for standardization of the smallest distance:
• We transform the standardized distance with any kernel function into weights .
• The classifier is defined as:

# The pros and cons of kNN, and further topics

The kNN and weighted kNN do not rely on any specific assumption on the distribution of the data, so it is quite easy to apply it to many problems as the baseline model. Furthermore, kNN (and its family) is very intuitive for understanding and implementing, which again makes it a worthy try-it-first approach for many supervised problems.

Despite those facts, kNN still has challenges in some aspects: It is computationally expensive – especially when the dataset size becomes huge. Another challenge is choosing the ‘correct’ distance metric that best follows the assumption for using this algorithm: items close together in the data set should be typically similar. Lastly, the curse of dimensionality heavily affects the distance metric. Beyer et al.[2] proves that, under some preconditions, in high dimension space, all points converge to the same distance from the query point. In this case, the concept of ‘nearest neighbors’ is no longer meaningful.

## Machine Learning development with AWS Sage Maker

Make your Machine Learning team working easier, focus more on business and quick deployment with AWS managed service SageMaker.

Today, Machine Learning(ML) is resolving complex problems which make more business values for customer and many companies also apply ML to resolve robust business problems. ML have more benefit, but also more challenges to building the ML model with high accuracy. I currently working on the AI team to help the company deliver AI/ML project quickly and help Data Scientist(DS) team developing Data Pipeline, Machine Learning pipeline which helps project grow and quick delivery with high quality.

# Overview of Machine Learning development

Here is the basic machine learning process which is the practice model of big companies. We are including multiple phases (business analyst, data processing, model training, and deployment), multiple steps each phase, and a fleet of tools that we use to result in dedicated steps.

Business problems: is including problems that challenge the business and we can use ML learning as a better solution to resolve it.
ML Problem Framing: is a phase that helps DS and Engineering definition for ML problems, propose ML solutions, data pipeline and, planning.
Data processing(Collection, integration, preparation and cleaning, visualization, and analysis): This phase including multiple steps that help to prepare data for visualization and ML training.
Model Training(Feature engineering, Model training, and parameter tuning, model evaluation): DS and developer will working on this phase to develop engineering features and prepare data for specific model and training model using frameworks such as Tensorflow, Pytorch.

When we don’t use any platform such as AWS SageMaker or Azure ML Studio then we take more time to develop complex stack skills. We need to have many skills in compute, network, storage, ML frameworks, programming language, engineering features …

When we develop an ML model with a lack of skills and complex components that takes more time to handle tasks about programming, compute, and challenges for the engineering team which using the model and deploys it. In Figure 2, we have multiple instants of cloud computing (Infrastructure as a ServicePlatform as a ServiceApplication as a Service) that provide resource type according to business necessary at the level of control. We can choose a specific instant layer for business necessary or compact of layers to meet the business objective. So, which Machine Learning projects and research environment, I would like to highly recommend for any DS and Develop should use PaaS and AaaS first to meet your business requirement first to quick delivery, reduce cost and effort. That is the main reason I want to describe AWS SageMaker service which can use as a standard platform for ML team to quickly develop/deploy ML model, focus on resolve ML business problems and improve the quality of the model.

# AWS SageMaker offers the base of end to end machine learning development environment

When we developing an ML model, we need to take care of multiple partitions and sometimes we want to try with a new model and responsibility accuracy response as quickly as possible. So, that also depends on “Is there enough data for processing and training?” and “How much time will take for training a new model?”. AWS SageMaker is using in thousands of AI companies and developed by experts and best practices of ML which help to improve the ML process, working environment. In my opinion, when I want to focus on building a model and resolve the challenge problem then I want to make another easy at the basic level and spend more time to resolve the main problems at first. SageMaker provides me a notebook solution and a notebook is a great space that I coding and analyzing data for training. Working together with SageMaker SDK, I can easy to connect and use other resources inside of AWS such as S3 bucket and training jobs. Everything helps me quickly develop and deliver a new model. So, I want to highlight the main benefit which we got from this service and also have disadvantages.

*💰 Cost-effective:

– SageMaker provides solutions for training ML model with training jobs that are distributed, elastic, high-performance computing using spot instance to save cost up to 90%, pay only for training time in seconds. (document)
– Elastic Inference: This feature helps to save cost for computes need GPU to process deep learning model such as prediction time. (document)
* 🎯 Reduce lack of skills and focus to resolve business problems. We can easy to set up a training environment from a notebook with “click” for elastic of CPUs/GPUs
* 🌐 Connectivity and easy to deploy
– AWS SageMaker is AWS managed service and it easy to integrate with other AWS services inside of a private network. Which also impact to big data solution, ETL processed with data can be processing inside of a private network and reduce cost for the transfer.
– EndPoint function helps DS/Develop easy to deploy trained model with “clicks” or from SDK. (document)
* 🏢 Easy to manage: When working with multiple teams on AWS, more resources will come every day, and challenges with the IT team to manage resources, roles that will impact cost and security. AWS managed service will help to reduce the resource we need to create.

* AWS SageMaker is a managed service, it implements best practices and focuses on popular frameworks, sometimes it does not match your requirement and should be considered before choosing it.
* 🎿 Learning new skills and basic knowledge of AWS Cloud: When we working on AWS cloud, basic knowledge about cloud infrastructure is necessary and new knowledge with AWS managed service which you want to use.
* 👮 It also more expensive than a normal EC2 instance because it supports dedicated to ML, we need to choose the right resource for the development to save cost.

AWS SageMaker is the best suite service for the production environment. It helps to build a quality model and standard environment. Which reduces the risk in product development. We trade-off to get most of the benefit and quickly achieve the goal of the team. Thank you so much for reading, please let me know if you have any concerns.

# Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.

# I. A Brief Overview

Consider an example of a courtroom trial:

A car company C is accused of not manufacturing environment-friendly vehicles. The average CO2 emission per car from different manufacturers based on a survey from the previous year is 120.4 grams per kilometer. But for a random batch of 100 cars produced at C’s factory, the average CO2 emission is 121.2 grams per kilometer with a standard deviation of 1.8.

At the trial, Company C is not considered to be guilty as long as their wrongdoing is not proven. A public prosecutor tries to prove that C is guilty and can only succeed when enough evidence is presented.

The example above illustrates the concepts of hypothesis testing; specifically, there are two conflicting hypotheses:

i) C is not guilty; or

ii) C is guilty

The first is called the null hypothesis (denoted by H0), and the second the alternative hypothesis (denoted by HA). At the start of the trial, the null hypothesis is temporarily accepted, until proven otherwise. The goal of hypothesis testing is to perform some sort of transformed comparison between the two numbers 121.2 and 120.4 to either reject H0 and accept HA or vice versa. This one-sample mean testing because we are comparing the average value obtained from one sample (121.2) with the average value assumed to represent the whole population (120.4)

# II. Required Steps for Hypothesis Testing

The six steps below must be followed to conduct a hypothesis test. The details will be elaborated on with our example afterward.

1) Set up null and alternative hypotheses and check conditions.

2) Determine the significance level, alpha.

3) Calculate the test statistic.

4) Calculate the probability value (a.k.a the p-value), or find the rejection region. For the following example, we will use the p-value.

5) Decide on the null hypothesis.

6) State the overall conclusion.

# III. A step-by-step example

## 1) Set up hypotheses:

We already mentioned in the beginning the two hypotheses. But now we will formalize them:

### Null hypothesis:

Company C’s CO2 mean (denoted by μ ) is equal to the population mean (denoted by μ0):           μ = μ0

### Alternative hypothesis:

Company C’s CO2 mean is greater than the population mean: μ > μ0

The hypothesis testing for the one-sample means we are conducting requires the data to come from an approximately normal distribution or a large enough sample size, which can be quite subjective. To keep things simple, we decide that the data gathered from company C is big enough with the sample size being 100 cars.

## 2) Determine the significance level, alpha, or confidence level

The significance level and its complementary, the confidence level, provide a level of probability cutoff for our test to make decisions about the hypotheses. A common value for alpha is 5%, which is the same as a confidence level of 95%.

## 3) Calculate the test statistic

For the one-sample mean test, we calculate the t* test statistic using the formula:

where s is the standard deviation from the sample we are testing, 1.8, and n is the size of the sample, 100.

## Bias in Data Science – the Good, the Bad and the Avoidable !?

In recent years, there have been a few prominent examples of accidental bias in machine-learning applications, such as smartphones’ beauty filters (that essentially ended up whitening skin) [1] or Microsoft’s from-innocent-teen-to-racist-in-24-hours chatbot [2,3]. Examples such as these fell victim to inherently biased data being fed into algorithms too complex to allow for much transparency. Hidden bias continues to be an issue on ubiquitous social media platforms, such as Instagram, whose curators appear to profess themselves both regretful AND baffled [4]. Unfortunately, any model will somewhat regurgitate what it has been fed and interventions at this level of model complexity may prove tricky.

Interestingly, bias itself does not need to be harmful and is often built into a model’s design on purpose, either to address only a subset of the overall population or to model a real-world state, for instance when predicting house prices from its size and number of bedrooms, the model’s bias parameter often represents the average house price in the data set. Thus, we need to distinguish between conscious and unconscious bias in data analysis. Additionally, there is the factor of intent, i.e. is the person conducting the analysis well-intentioned to follow a good scientific method or trying to manipulate it to achieve a particular outcome.

In the following, I will only discuss aspects of unintentional and unconscious bias, meaning bias hidden from the researcher or data scientist introducing it. This is by no means an exhaustive discussion, but merely a highlight of some pervasive aspects:

A. Data availability bias

B. Coherent theory bias

C. Method availability/popularity bias

# A. Data availability bias

A. The problem of scientists’ selecting their data out of convenience rather than suitability or representativeness for the current task has been around for a while [4], e.g. the ideal data set may not be available in a machine-readable format or would require higher costs and more time for processing, in short, several obstacles to doing an analysis quickly. For instance, in the area of Natural Language Processing, the major European languages, like English, French, and German, etc. tend to receive more attention because both data and tools to analyze them are widely available. Similarly, psychology research has mostly focused on so-called WEIRD societies (White, Educated, Industrialized, Rich, Democratic) [5] and out of convenience often targets the even smaller population of “North American college students” that unsurprisingly have been found to not represent human populations at large.

# B. Coherent theory bias

B. Various studies suggest that we as people strongly favor truths that fit into our pre-existent worldview, and why would scientists be exempt from this? Thus, it appears when people analyze data they are often biased by their underlying beliefs about the outcome and are then less likely to yield unexpected non-significant results [6]. This does not include scientists disregarding new evidence because of conflicting interests [7]. This phenomenon is commonly referred to as confirmation bias or, more fittingly, “my side” bias.

# C. Method availability/popularity bias

C. There is a tendency of hailing new trendy algorithms as one-fits-all solutions for whatever task or application. The solution is presented before examining the problem and its actual requirements. While more complex models are often more powerful, this comes at the cost of interpretability, which in some cases is not advisable. Additionally, some methods, both simple and complex ones, enjoy popularity primarily because they come ready-to-use in one’s favorite programming language.

Going forward…

We as data scientists should:

a. Select our data carefully with our objective in mind. Get to know our data and its limitations.

b. Be honest with ourselves about possible emotional investment in our analyses’ outcomes and resulting conflicts.

c. Examine the problem and its (theoretical) solutions BEFORE making any model design choices.

# References:

[2] https://www.bbc.com/news/technology-35902104(last accessed 21.10.2020)

[5] Joseph Rudman (2003) Cherry Picking in Nontraditional Authorship Attribution Studies, CHANCE, 16:2,26-32, DOI: 10.1080/09332480.2003.10554845

[6] Henrich, Joseph; Heine, Steven J., and Norenzayan, Ara. The Weirdest People in the World? Behavioral and Brain Sciences, 33(2-3):61–83, 2010. doi: 10.1017/S0140525X0999152X.

[7] Hewitt CE, Mitchell N, Torgerson DJ. Heed the data when results are not significant. BMJ. 2008;336(7634):23-25. doi:10.1136/bmj.39379.359560.AD

[8] Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi:10.1371/journal.pmed.0020124

# Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.

# Vietnam AI / Data Science Lab

Please also visit Vietnam AI Lab

# Motivation

What makes computers useful for us is primarily the ability to solve problems. The procedure in which computers solve a problem is an algorithm.  In the recent context of an increasing number of algorithms available for solving data-related problems, there is increasing demand for a higher level of understanding of algorithm’s performance for data scientists to choose the right algorithms for their problems.

Having a general perception of the efficiency of an algorithm would help to shape the thought process for creating or choosing better algorithms. With this intention in mind, I would like to create a series of posts to discuss what makes a good algorithm in practice, or for short, efficient algorithm. And this article is the first step of the journey.

# Define Efficiency

An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, ‘acceptable’ means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input.

There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage. In the next 2 sections, we will be looking at the two different perspectives for measuring the efficiency of an algorithm from theoreticians and practitioners.

# Theoreticians perspective

Theoreticians are interested in measuring the efficiency of an algorithm without actually have to run it in several machines and input size. The key idea is that they are not going to consider the runtimes of the algorithm on any particular input. Rather, they look at what is known as asymptotic runtimes. Or in other words, they look at how the runtime scale with input size (n) as n gets larger. Does the output scale proportional to n, or proportional to n squared, or maybe exponential in n? These rates of growth are so different that as long as n is sufficiently large, constant multiples that come from other measures like temporary disk usage, long-term disk usage would be relatively small and neglected.

# Practitioner perspective

While certainly useful, the asymptotic runtime of an algorithm doesn’t tell the whole story. Some algorithms have good asymptotic runtime, but constants that are so huge that they effectively can’t be used. Ever. Lipton calls them Galactic Algorithms. A galactic algorithm is an algorithm that is wonderful in its asymptotic behavior but is never used to compute anything.

In practice, other factors that can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, how an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.

Implementation issues can also effect efficiencies, such as the choice of programming language, or how the algorithm is coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases, a language implemented by an interpreter may be much slower than a language implemented by a compiler.

## Binomial Theorem

Can you expand on ? I guess you would find that is quite easy to do. You can easily find that .

How about the expansion of . It is no longer easy.

It is no longer easy, isn’t it. However, if we use Binomial Theorem, this expansion becomes an easy problem.

Binomial Theorem is a very intriguing topic in mathematics and it has a wide range of applications.

# Theorem

Let  be real numbers (or complex, or polynomial). For any positive integer , we have:

where,

Proof:

We will use prove by induction. The base case  is obvious. Now suppose that the theorem is true for the case , that is assume that:

we will need to  show that, this is true for



Let us consider the left-hand side of the equation above



## We can now apply Pascal’s identity:



The equation above can be simplified to:

as we desired.

### Example 1:  Power rule in Calculus

In calculus, we always use the power rule that

We can prove this rule using the Binomial Theorem.

Proof:

Recall that derivative for any continuous function f(x) is defined as:



Let  be a positive integer and let

The derivative of f(x) is:

### Example 2:  Binomial Distribution

Let X be the number of Head a sequence of n independent coin tossing. X is usually model by binomial distribution in the probability model. Let  be the probability that a head shows up in a toss, and let . The probability that there is  head in the sequence of  toss is:



We know that sum of all the probability must equal to 1. In order to show this, we can use Binomial Theorem. We have:



Please also check another article Gaussian Samples and N-gram language models ,Bayesian model, Monte Carlo for statistics knowledge.

# Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.

# Vietnam AI / Data Science Lab

Please also visit Vietnam AI Lab

## Monte Carlo Simulation

On a nice day 2 years ago, when I was in the financial field. My boss sent our team an email. In this email, he would like to us propose some machine learning techniques to predict stock price.

So, after accepting the assignment from my manager, our team begin to research and apply some approaches for prediction. When we talk about Machine Learning, we often think of supervised and unsupervised learning. But one of the algorithms we applied is one that we forgot however equally highly effective algorithm: Monte Carlo Simulation.

# What is Monte Carlo simulation?

The Monte Carlo method is a technique that uses random numbers and probability to solve complex problems. The Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial sectors, project management, costs, and other forecasting machine learning models.[1]

Now let’s jump into python implementation to see how it applies,

# Python Implementation

In this task, we used data of DXG stock dataset from 2017/01/01 to 2018/08/24 and we would like to know what is stock price after 10 days, 1 month, and 3 months, respectively

We will simulate the return of stock and next price will be calculated by

P(t) = P(0) * (1+return_simulate(t))

Calculate mean and standard deviation of stock returns

miu = np.mean(stock_returns, axis=0)
dev = np.std(stock_returns)

Simulation process

simulation_df = pd.DataFrame()
last_price = init_price
for x in range(mc_rep):
count = 0
daily_vol = dev
price_series = []
price = last_price * (1 + np.random.normal(miu, daily_vol))
price_series.append(price)
for y in range(train_days):
if count == train_days-1:
break
price = price_series[count] * (1 + np.random.normal(miu, daily_vol))
price_series.append(price)
count += 1
simulation_df[x] = price_series

Visualization Monte Carlo Simulation

fig = plt.figure()
fig.suptitle('Monte Carlo Simulation')
plt.plot(simulation_df)
plt.axhline(y = last_price, color = 'r', linestyle = '-')
plt.xlabel('Day')
plt.ylabel('Price')
plt.show()

Now, let’s check with actual stock price after 10 days, 1 month and 3 months

plt.hist(simulation_df.iloc[9,:],bins=15,label ='histogram')
plt.axvline(x = test_simulate.iloc[10], color = 'r', linestyle = '-',label ='Price at 10th')
plt.legend()
plt.title('Histogram simulation and last price of 10th day')
plt.show()

We can see the most frequent occurrence price is pretty close to the actual price after 10th

If the forecast period is longer, the results are not good gradually

Simulation for next 1 month

After 3 months

# Conclusion

Monte Carlo simulation is used a lot in finance, although it has some weaknesses, hopefully through this article, you will have a new look at the simulation application for forecasting.

# Reference

[1] Pratik Shukla, Roberto Iriondo, “Monte Carlo Simulation An In-depth Tutorial with Python”, medium, https://medium.com/towards-artificial-intelligence/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8

Please also check Gaussian Samples and N-gram language models,
Bayesian Statistics for more statistics knowledge.

# Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.

# Vietnam AI / Data Science Lab

Please also visit Vietnam AI Lab

## Bayesian estimator of the Bernoulli parameter

In this post, I will explain how to calculate a Bayesian estimator. The taken example is very simple: estimate the parameter θ of a Bernoulli distribution.

A random variable X which has the Bernoulli distribution is defined as

with

In this case, we can write

.

In reality, the simplest way to estimate θ is to sample X, count how many times the event occurs, then estimate the probability of occurring events. This is exactly what the frequentists do.

In this post, I will show how do the Bayesian statisticians estimate θ. Although this doesn’t have a meaningful application, it helps to understand how do Bayesian statistics work. Let’s start.

# The posterior distribution of θ

Denote Y as the observation of the event. Given the parameter θ, if we sample the event n time, then the probability that the event occurs k time is (this is called the probability density function of Bernoulli )

In Bayesian statistics, we would like to calculate

$p(\theta|Y=y)$

By using the Bayesian formula, we have

With the prior distribution of theta as a Uniform distribution, p(θ) = 1, and it is easy to prove that

where Γ is the Gamma distribution. Hence, the posterior distribution is

Fortunately, this is the density function of the Beta distribution:

We use the following properties for evaluating the posterior mean and variance of theta.

If , then

# Simulation

In summary, the Bayesian estimator of theta is the Beta distribution with the  mean and variance as above. Here are the Python codes for simulating data and estimating theta

def bayes_estimator_bernoulli(data, a_prior=1, b_prior=1, alpha=0.05):
'''Input:
data is a numpy array with binary value, which has the distribution B(1,theta)    a_prior, b_prior: parameters of prior distribution Beta(a_prior, b_prior)    alpha: significant level of the posterior confidence interval for parameter    Model:
for estimating the parameter theta of a Bernoulli distribution    the prior distribution for theta is Beta(1,1)=Uniform[0,1]    Output:
a,b: two parameters of the posterior distribution Beta(a,b)
pos_mean: posterior estimation for the mean of theta
pos_var: posterior estimation for the var of theta'''
n = len(data)
k = sum(data)
a = k+1
b = n-k+1
pos_mean = 1.*a/(a+b)
pos_var = 1.*(a*b)/((a+b+1)*(a+b)**2)
## Posterior Confidence Interval
theta_inf, theta_sup = beta.interval(1-alpha,a,b)
print('Prior distribution: Beta(%3d, %3d)' %(a_prior,b_prior))
print('Number of trials: %d, number of successes: %d' %(n,k))
print('Posterior distribution: Beta(%3d,%3d)' %(a,b))
print('Posterior mean: %5.4f' %pos_mean)
print('Posterior variance: %5.8f' %pos_var)
print('Posterior std: %5.8f' %(np.sqrt(pos_var)))
print('Posterior Confidence Interval (%2.2f): [%5.4f, %5.4f]' %(1-alpha, theta_inf, theta_sup))
return a, b, pos_mean, pos_var

# Example
n = 129 # sample size
data = np.random.binomial(size=n, n=1, p=0.6)
a, b, pos_mean, pos_var = bayes_estimator_bernoulli(data)

And the result is

Prior distribution: Beta(  1,   1)
Number of trials: 129, number of successes: 76
Posterior distribution: Beta( 77, 54)
Posterior mean: 0.5878
Posterior variance: 0.00183556
Posterior std: 0.04284341
Posterior Confidence Interval (0.95): [0.5027, 0.6703]

In the simulation, we simulated 129 data from the Bernoulli distribution with θ=0.6. And the Bayesian estimation of θ is the posterior mean which is 0.5878.
This is a very simple example of Bayesian estimation. In reality, it is usually tricky to determine a closed-form solution of the posterior distribution from the given prior distribution. In that case, the Monte Carlo technique is one of the solutions to approximate the posterior distribution.
Please also check Gaussian Samples and N-gram language models for more statistics knowledge.

# Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.

# Vietnam AI / Data Science Lab

Please also visit Vietnam AI Lab

# Background

In part 1 of my project, I built a unigram language model: it estimates the probability of each word in a text simply based on the fraction of times the word appears in that text.

The text used to train the unigram model is the book “A Game of Thrones” by George R. R. Martin (called train). The texts on which the model is evaluated are “A Clash of Kings” by the same author (called dev1), and “Gone with the Wind” — a book from a completely different author, genre, and time (called dev2).

In this part of the project, I will build higher n-gram models, from bigram (n=2) to 5-gram (n=5). These models are different from the unigram model in part 1, as the context of earlier words is taken into account when estimating the probability of a word.

# Higher n-gram language models

## Training the model

For a given n-gram model:

The example below shows how to calculate the probability of a word in a trigram model:

# Dealing with words near the start of a sentence

In higher n-gram language models, the word near the start of each sentence will not have a long enough context to apply the formula above. To make the formula consistent for those cases, we will pad these n-grams with sentence-starting symbols [S]. Below are two such examples under the trigram model:

From the above formulas, we see that the n-grams containing the starting symbols are just like any other n-gram. The only difference is that we count them only when they are at the start of a sentence. Lastly, the count of n-grams containing only [S] symbols is naturally the number of sentences in our training text:

# Dealing with unknown n-grams

Similar to the unigram model, the higher n-gram models will encounter n-grams in the evaluation text that never appeared in the training text. This can be solved by adding pseudo-counts to the n-grams in the numerator and/or denominator of the probability formula a.k.a. Laplace smoothing. However, as outlined in part 1 of the project, Laplace smoothing is nothing but interpolating the n-gram model with a uniform model, the latter model assigns all n-grams the same probability:

Hence, for simplicity, for an n-gram that appears in the evaluation text but not the training text, we just assign zero probability to that n-gram. Later, we will smooth it with the uniform probability.

# Background

Language modeling — that is, predicting the probability of a word in a sentence — is a fundamental task in natural language processing. It is used in many NLP applications such as autocompletespelling correction, or text generation.

Currently, language models based on neural networks, especially transformers, are the state of the art: they predict very accurately a word in a sentence based on surrounding words. However, in this project, I will revisit the most classic language model: the n-gram models.

# Data

In this project, my training data set — appropriately called train — is “A Game of Thrones”, the first book in the George R. R. Martin fantasy series that inspired the popular TV show of the same name.

Then, I will use two evaluating texts for our language model:

# Unigram language model

## What is a unigram?

In natural language processing, an n-gram is a sequence of n words. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), “natural language processing” is a trigram (n = 3), and so on. For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words.

## Training the model

A language model estimates the probability of a word in a sentence, typically based on the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence:

The unigram language model makes the following assumptions: