Gaussian samples – Part (3)


The goal of this project is to generate Gaussian samples in 2-D from uniform samples, the latter of which can be readily generated using built-in random number generators in most computer languages.

In part 1 of the project, the inverse transform sampling was used to convert each uniform sample into respective x and y coordinates of our Gaussian samples, which are themselves independent standard normal (having mean of 0 and standard deviation of 1):

Generate x and y coordinates from uniform samples (U₁ and U₂) using inverse transform sampling

However, this method uses the inverse cumulative distribution function (CDF) of the Gaussian distribution, which is not well-defined. Therefore, we approximated this function using the simple Taylor series. However, this only samples accurately near the Gaussian mean, and under-samples more extreme values at both ends of the distribution.

A: left-side area (uniformly sampled). Tᵢ: the iᵗʰ-degree Taylor series approximation of the inverse Gaussian CDF: √2 Erfinv(2A-1)

In part 2 of the project, we used the Box-Muller transform, a more direct method to transform the uniform samples into Gaussian ones. The implementation of the algorithm is quite simple, as seen below, but its derivation requires some clever change of variables: instead of sampling Gaussian x and y coordinates for each point, we will sample a uniformly distributed angle (from 0 to 2π) and an exponentially-distributed random variable that represents half of a squared distance of the sample to the origin.

Generate x and y coordinates from uniform samples (U₁ and U₂) using Box-Muller transform

In this part of the project, I will present an even simpler method than the above two methods. Even better, this method is one that every statistics students are already familiar with.

The Central Limit Theorem

It turns out, we can rely on the most fundamental principle of statistics to help us generate Gaussian samples: the central limit theorem. In very simple terms, the central limit theorem states that:

Given n independent random samples from the same distribution, their sum will converge to a Gaussian distribution as n gets large.

Therefore, to generate a Gaussian sample, we can just generate many independent uniform samples and add them together! We then repeat this routine to generate the next standard Gaussian sample until we get enough samples for our x-coordinates. Finally, we just repeat the same steps to generate the y coordinates.



Note that this method will work even if the samples that we start with are not uniform — they are Poisson-distributed, for example. This is because the central limit theorem holds for virtually all probability distributions *cough let’s not talk about the Cauchy distribution cough*.

Generate Gaussian samples by central limit theorem

To generate, say, 1000 Gaussian sums (n_sums = 1000) where each is the sum of 100 uniform samples (n_additions = 100):

n_additions = 100 
n_points = 1000# 0. Initialize random number generator
rng = np.random.RandomState(seed=24) # 1. Generate matrix of uniform samples
uniform_matrix = rng.uniform(size=(n_additions, n_points))# 2. Sum uniform elements down each column to get all Gaussian sums
gaussians = uniform_matrix.sum(axis=0)

We can apply the above method to generate Gaussian samples for each coordinate, using different random number generators for x and for y to ensure that the coordinates are independent of each other. Visualizing the intermediate sums after each addition, we see that:

Gaussian samples – Part (2)


In part 1 of this project, I’ve shown how to generate Gaussian samples using the common technique of inversion sampling:

The bad news: the Gaussian inverse CDF is not well-defined, so we have to approximate that function, and the simple Taylor series was used. However, this only samples accurately near the Gaussian mean, and under-samples more extreme values at both ends of the distribution.

A: left-side area (uniformly sampled). Tᵢ: the iᵗʰ-degree Taylor series approximation of the inverse Gaussian CDF: √2 Erfinv(2A-1)

Therefore, in this part of the project, we will investigate a more “direct” sampling method that does not depend on the approximation of the Gaussian inverse CDF. This method is called the Box-Muller transform, after the two mathematicians who invented the method in 1958: the British George E. P. Box and the American Mervin E. Muller.

Left: George E. P. Box (1919–2013). Right: Mervin E. Muller (1928–2018)

How does the Box-Muller transform work?

For this project, my goal is to generate Gaussian samples in two dimensions i.e. generating samples whose x and y coordinates are independent standard normals (Gaussian with zero mean and standard deviation of 1). In part 1, I used the inverse Gaussian CDF to generate separately the x and y coordinates from their respective uniform samples (U₁ and U₂):

Generate x and y coordinates from uniform samples (U₁ and U₂) using inverse transform sampling

For the Box-Muller transform, I will also start with the same two uniform samples. However, I will transform these uniform samples into the x and y coordinates using much simpler formulas:

Generate x and y coordinates from uniform samples (U₁ and U₂) using Box-Muller transform

Despite the strong coupling between U₁ and U₂ in each of the two formulas above, the generated x and y coordinates, which are both standard Gaussians, are still surprisingly independent from each other! In the derivation of the Box-Muller transform that follows, I will demonstrate why this is indeed this case.

Derivation of Box-Muller transform

We know that for any two independent random variables x and y, the joint probability density f(x, y) is simply the product of the individual density functions: f(x) and f(y). Furthermore, the Pythagoras theorem allows us to combine the x² and y² terms in each of the Gaussian density functions. This results in the -r²/2 term in the exponential of the joint distribution, where r is the distance from the origin to the 2-D Gaussian sample.



To make it simpler, we then define a variable s that is equal to r²/2. In other words, s is simply half of the squared distance from the origin of our Gaussian sample. Written this way, the joint PDF is simply the product of a constant (1/2π) and an exponential (e⁻ˢ).

Gaussian samples – Part (1)


Gaussian sampling — that is, generating samples from a Gaussian distribution — plays an important role in many cutting-edge fields of data science, such as Gaussian processvariational autoencoder, or generative adversarial network. As a result, you often see functions like tf.random.normal in their tutorials.

But, deep down, how does computer know how to generate Gaussian samples? This series of blog posts will show 3 different ways that we can program our computer (via Python) to do so. You will also see how R and Python generate Gaussian samples using modified versions of some of these methods.

Starting point: the uniform number generator

Of course, we can’t generate Gaussian samples from thin air. Instead, we start with a random number generator that exists in almost all programming languages: the uniform random number generator. It generates a random number that could take any value between 0 and 1. For Python, the numpy.random module uses the Mersenne twister to generate a uniformly-distributed float that is in the interval [0.0, 1.0).

Since Gaussians are better visualized in 2 dimensions — we are all familiar with the Gaussian “blob” in the xy-plane — I will demonstrate the 3 sampling methods in 2-D, especially since one of the methods do generate Gaussians in two dimensions at the same time.



Step 1: Generate standard Gaussian samples in 2-D. Step 2: Transform standard Gaussian samples to have given means, variances, and covariance between x and y

As a result, this series is broken down into 3 parts (see accompanying image):

Method 1: Inverse transform sampling

This is the most basic, yet most common, way to convert a uniform random sample into a random sample of any distribution, including Gaussian. This method works by applying the inverse function of the Gaussian CDF (cumulative distribution function) to transform a uniform sample into a Gaussian sample.

To make sure that the Gaussian samples for the x- and y-coordinates are independent, we can use two different uniform samples, one for x (U₁), and one for y (U₂). These two uniform samples can be generated using two different random number generators (two different RandomState initialized by different seeds, for example) so that they are independent in the first place.

Generate 2-D Gaussian samples from uniform samples

How does this work?

This method works by exploiting a mind-blowing principle:

For any distribution, the cumulative probability is always uniformly distributed.

The arithmetic proof of this principle is straightforward but rather boring, and you can view it from this lecture. Instead, I will show the geometric interpretation of this principle. But first, let’s clarify what cumulative probability is:

Practice Design for Try/Fail Fast

At the moment, AI/ML/DL are hot keywords in the trend of Software development. The world have more successful projects based on AI technologies such as Google Translate, AWS Alexa, …AI makes machine smarter than. So, the way from idea to successfully have many challenges if want to make great solutions. I have some time working with AI projects and start-ups to build great solutions based on Algorithms and ML; I aimed to propose and implement solutions that help the development team working smoothly. Today, I would like to describe the development process, architecture, CI/CD, and Programming for quickly implement multiple AI approaches with Agile software development methodology.

Continues Integration and Continues Deployment

– Batch Processing, Parallel Processing
– Data-Driven Development and Test Driven Development (to be continued)


AI project including multiple services with domains focus on: AI/ML/DL, and engineering that develop independent, integration and verification automatically. Popular, the ML services very specially with Engineering Service, resolve challenges problems linking with technologies: Machine learning, deep learning, big data, distributed computing … Microservices architecture, in this case, is a first choice, that helps to separate businesses problem into specific services and can be resolve by specific domain knowledge of Data Science team, Engineering team. And more advantage of microservices with Agile development, more information here. With the AI project, there will focus more on “How to resolve business by AI technology?”.

Microservices maybe not the best choice but that help to quickly development and delivery with Agile methodology.

Continues Integration and Continues Deployment

When a project including multiple teams, multiple services challenges at the integration and deployment. CI/CD is most popular with software development but I got more specific from Data Science(DS) team. The big question of DS is “We have more solutions to resolve this problem, Could you help me propose a solution to quickly evaluation and integration?

With the Engineering team, CI/CD pipeline is so general. With AI solution, you will meet some challenges linking to:
– How to running on distributed computing? We choose batch jobs
– How to save money with long-time jobs? We choose AWS spot instances
– How do parallel jobs improve performance? We running parallel jobs and parallel on structure design(Python coding)
– How to control Data versions, Model versions? we choose Data Version Control and AWS S3 to versioning training/evaluation data and models

All solutions applied to my project aimed to resolve challenges of AI technology, but it interesting. A good abstract of structure will help to quickly integrate and deliver multiple approaches.

This pipeline can implement with any CI/CD framework such as Gitlab CI, Jenkins, AWS Code Build … So, each framework should have a function for custom distributed and parallel jobs. Because the jobs in the pipeline need specific resources and the resource should be auto-scale. Example for Training Jobs need more GPUs and System Evaluation need more CPUs for parallels, a scalable resource is most important to save the cost.

CI/CD pipeline including training and system will help fast try and fast result, the implementation can easy to integrate quickly, trust and ablility to control quality.

N-gram language models – Part 3



In previous parts of my project, I built different n-gram models to predict the probability of each word in a given text. This probability is estimated using an n-gram — a sequence of words of length n — which contains the word. The below formula shows how the probability of the word “dream” is estimated as part of the trigram “have a dream”:

N-gram language models
The vertical line denotes the probability of “dream” given the previous words “have a”

We train the n-gram models on the book “A Game of Thrones” by George R. R. Martin (called ). We then evaluate the models on two texts: “A Clash of Kings” by the same author (called ), and “Gone with the Wind” — a book from a completely different author, genre, and time (called ).

N-gram language models


The metric to evaluate the language model is average log-likelihood: the average of the log probability that the model assigns to each word in the evaluation text.

N-gram language models
N_eval: total number of words in the evaluation text

Often, the log of base 2 is applied to each probability, as is the case in the first two parts of the project. Nevertheless, in this part, I will use a natural log, as it makes it simpler to derive the formulas that we will be using.


In part 2, the various n-gram models — from unigram to 5-gram — were evaluated on the evaluation texts ( and , see graphs below).

N-gram language models


From this, we notice that:

  • The Bigram model performs slightly better than the unigram model. This is because the previous word to the bigram can provide important context to predict the probability of the next word.
  • Surprisingly, the trigram model and up is much worse than bigram or unigram models. This is largely due to the high number of trigrams, 4-grams, and 5-grams that appear in the evaluation texts but nowhere in the training text. Hence, their predicted probability is zero.
  • For most n-gram models, their performance is slightly improved when we interpolate their predicted probabilities with the uniform model. This seems rather counter-intuitive since the uniform model simply assigns equal probability to every word. However, as explained in part 1, interpolating with this “dumb” model reduces the overfit and variance of the n-gram models, helping them generalize better to the evaluation texts.

In this part of the project, we can extend model interpolation even further: instead of separately combining each n-gram model with the uniform, we can interpolate different n-gram models with one another, along with the uniform model.

What to interpolate?

The first question to ask when interpolating multiple models together is:

To answer this question, we use the simple strategy outlined below:

  1. First, we start with the uniform model. This model will have very low average log-likelihoods on the evaluation texts since it assigns every word in the text the same probability.
  2. Next, we interpolate this uniform model with the unigram model and re-evaluate it on the evaluation texts. We naively assume that the models will have equal contribution to the interpolated model. As a result, each model will have the same interpolation weight of 1/2.
  3. We then add the bigram model to the mix. Similarly, in this 3-model interpolation, each model will simply have the same interpolation weight of 1/3.
  4. We keep adding higher n-gram models to the mix while keeping the mixture weights the same across model, until we reach the 5-gram model. After each addition, the combined model will be evaluated against the two evaluation texts,  and .

Coding the interpolation

In part 2, each evaluation text had a corresponding probability matrix. This matrix has 6 columns — one for each model — and each row of the matrix represents the probability estimates of each word under the 6 models. However, since we want to optimize the model performance on both evaluation texts, we will vertically concatenate these probability matrices into one big evaluation probability matrix (803176 rows × 6 columns):

Duality theorems


Duality theorems
Find x₁ and x₂ to minimize f(x₁, x₂). Source

Optimization shows up everywhere in machine learning, from the ubiquitous gradient descent to quadratic programming in SVM, to expectation-maximization algorithm in Gaussian mixture models.

However, one aspect of optimization that always puzzled me is duality: what on earth are a primal form and dual form of an optimization problem, and what good do they serve?

Therefore, in this project, I will:

  • Go over the primal and dual forms for the most basic optimization problem: linear programming.
  • Show that by solving one form of the optimization problem, we will have also solved the other one. This requires us to prove two fundamental duality theorems in linear programming: weak duality theorem and strong duality theorem. The former theorem will be proven in this part, while the latter will be proven in the next part of the project.
  • Explain why we should care about duality by showing its application to some data science problems.

Linear programming


All (constrained) optimization problems have three components:

1.Objective function: the function whose value we are trying to optimize, which can mean minimize or maximize depending on the problem. The value of the objective function will be called the  from here on.

2.Decision variables: the variables in the objective function whose value will be fine-tuned to give the objective function its .

3.Constraints: additional equations or inequalities that the decision variables must conform to.

With these components, we can define linear programming as such:

Linear programming is an optimization problem where the objective function and constraints are all linear functions of the decision variables.

This principle can be seen in the following formulation of a linear program:

Duality Theorems



x: vector containing the decision variables

c: vector containing coefficients for each decision variable in the objective function. For simplicity, we will call these coefficients the objective coefficients from here on.

A: matrix in which each row contains the coefficients of each constraint

b: vector containing the limiting values of each constraint

Note that the vector inequalities in the above formula imply element-wise inequalities. For example, x ≥ 0 means every element of x must be greater or equal to zero.

Geometric interpretation of a linear program

Although the above formula of a linear program seems quite abstract, let’s see what it looks like using a simple example.

  • Suppose we have only had 2 decision variables, x₁, and x₂. Therefore, our vector x is simply [x₁, x₂]

Duality Theorems

For More Detail, Please check the Link

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Blog

Please check our other Data Science Blog

AI / Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab
Vietnam AI Lab

Please also visit Vietnam AI Lab