Gaussian samples – Part(2)

Background

In part 1 of this project, I’ve shown how to generate Gaussian samples using the common technique of inversion sampling:

The bad news: the Gaussian inverse CDF is not well-defined, so we have to approximate that function, and the simple Taylor series was used. However, this only samples accurately near the Gaussian mean, and under-samples more extreme values at both ends of the distribution.

Image for post

A: left-side area (uniformly sampled). Tᵢ: the iᵗʰ-degree Taylor series approximation of the inverse Gaussian CDF: √2 Erfinv(2A-1)

Therefore, in this part of the project, we will investigate a more “direct” sampling method that does not depend on the approximation of the Gaussian inverse CDF. This method is called the Box-Muller transform, after the two mathematicians who invented the method in 1958: the British George E. P. Box and the American Mervin E. Muller.

Gaussian

Left: George E. P. Box (1919–2013). Right: Mervin E. Muller (1928–2018)

How does the Box-Muller transform work?

For this project, my goal is to generate Gaussian samples in two dimensions i.e. generating samples whose x and y coordinates are independent standard normals (Gaussian with zero mean and standard deviation of 1). In part 1, I used the inverse Gaussian CDF to generate separately the x and y coordinates from their respective uniform samples (U₁ and U₂):

Gaussian

Generate x and y coordinates from uniform samples (U₁ and U₂) using inverse transform sampling

For the Box-Muller transform, I will also start with the same two uniform samples. However, I will transform these uniform samples into the x and y coordinates using much simpler formulas:

Gaussian

Generate x and y coordinates from uniform samples (U₁ and U₂) using Box-Muller transform

Despite the strong coupling between U₁ and U₂ in each of the two formulas above, the generated x and y coordinates, which are both standard Gaussians, are still surprisingly independent from each other! In the derivation of the Box-Muller transform that follows, I will demonstrate why this is indeed this case.

Derivation of Box-Muller transform

We know that for any two independent random variables x and y, the joint probability density f(x, y) is simply the product of the individual density functions: f(x) and f(y). Furthermore, Pythagorus theorem allows us to combine the x² and y² term in each of the Gaussian density function. This results in the -r²/2 term in the exponential of the joint distribution, where r is the distance from the origin to the 2-D Gaussian sample.

Gaussian

To make it simpler, we then define a variable s that is equal to r²/2. In other words, s is simply the half of squared distance from the origin of our Gaussian sample. Written this way, the joint PDF is simply the product of a constant (1/2π) and an exponential (e⁻ˢ).

Gaussian samples – Part(1)

Background

Gaussian sampling — that is, generating samples from a Gaussian distribution — plays an important role in many cutting-edge fields of data science, such as Gaussian processvariational autoencoder, or generative adversarial network. As a result, you often see functions like tf.random.normal in their tutorials.

But, deep down, how does a computer know how to generate Gaussian samples? This series of blog posts will show 3 different ways that we can program our computer (via Python) to do so. You will also see how R and Python generate Gaussian samples using modified versions of some of these methods.

Starting point: the uniform number generator

Of course, we can’t generate Gaussian samples from thin air. Instead, we start with a random number generator that exists in almost all programming languages: the uniform random number generator. It generates a random number that could take any value between 0 and 1. For Python, the numpy.random module uses the Mersenne twister to generate a uniformly-distributed float that is in the interval [0.0, 1.0).

Since Gaussians are better visualized in 2 dimensions — we are all familiar with the Gaussian “blob” in the xy-plane — I will demonstrate the 3 sampling methods in 2-D, especially since one of the methods do in fact generate Gaussians in two dimensions at the same time.

Gaussian

Step 1: Generate standard Gaussian samples in 2-D. Step 2: Transform standard Gaussian samples to have given means, variances, and covariance between x and y

As a result, this series is broken down into 3 parts (see accompanying image):

Method 1: Inverse transform sampling

This is the most basic, yet most common, way to convert a uniform random sample into a random sample of any distribution, including Gaussian. This method works by applying the inverse function of the Gaussian CDF (cumulative distribution function) to transform a uniform sample to a Gaussian sample.

To make sure that the Gaussian samples for the x- and y-coordinates are independent, we can use two different uniform samples, one for x (U₁), and one for y (U₂). These two uniform samples can be generated using two different random number generators (two different RandomState initialized by different seeds, for example) so that they are independent in the first place.

Gaussian

Generate 2-D Gaussian samples from uniform samples

How does this work?

This method works by exploiting a mind-blowing principle:

For any distribution, the cumulative probability is always uniformly distributed.

The arithmetic proof of this principle is straight-forward but rather boring, and you can view it from this lecture. Instead, I will show the geometric interpretation of this principle. But first, let’s clarify what cumulative probability is:

Practice Design for Try/Fail Fast


At the moment, AI/ML/DL are hot keywords in trend of Software development. The world have more successful projects based on AI technologies such as Google Translate, AWS Alexa, …AI make machine smarter than. So, the way from idea to successfully have many challenges if want to make great solution. I have some time working with AI projects and start-up build great solution based on Algorithms and ML; I aimed to propose and implement solutions that help development team working smoothly. Today, I would like to describe about development process, Architecture, CI/CD and Programming for quickly implement multiple AI approaches with Agile software development methodology.

Sessions:
– 
Architecture
– 
Continues Integration and Continues Deployment

– Batch Processing, Parallel Processing
– Data Driven Development and Test Driven Development (to be continues)

Architecture

AI project including multiple services with domains focus on: AI/ML/DL and engineering that develop independent, integration and verification automatically. Popular, the ML services very specially with Engineering Service, resolve challenges problems linking with technologies: Machine learning, deep learning, big data, distributed computing … Microservices architecture in this case is a first choose, that help to separate business problem to specific services and can be resolve by specific domain knowledge of Data Science team, Engineering team. And more advantage of microservices with Agile development, more information here. With AI project, there will focus more on “How to resolve business by AI technology?”.

Microservices maybe not a best choose but that help to quickly development and delivery with Agile methodology.

Continues Integration and Continues Deployment

When project including multiple teams, multiple services which challenges at the integration and deployment. CI/CD is most popular with software development but i got more specific from Data Science(DS) team. The big question of DS is “We have more solutions to resolve this problem, Could you help me propose solution to quickly evaluation and integration?

With Engineering team, CI/CD pipeline is so general. With AI solution, you will meet some challenges linking to:
– How to running on distributed computing? We choose batch jobs
– How to save money with long time jobs? We choose AWS spot instances
– How to parallel jobs to improvement performance? We running parallel jobs and parallel on structure design(Python coding)
– How to control Data versions, Model versions? we choose Data Version Control and AWS S3 to versioning training/evaluation data and models

All solutions applied on my project aimed to resolve challenges of AI technology, but it interesting. The good abstract of structure will help to quickly integration and deliver multiple approaches.

This pipeline can implement with any CI/CD framework such as Gitlab CI, Jenkins, AWS Code Build … So, each framework should have function for custom distributed and parallel jobs. Because the jobs in the pipeline need specific resource and the resource should be auto scale. Example for Training Jobs need more GPUs and System Evaluation need more CPUs for parallels, scalable resource is most important to save the cost.

CI/CD pipeline including training and system will help fast try and fast result, the implementation can easy to integrate quickly, trust and able to control quality.

N-gram language models -Part3

 


Background

In previous parts of my project, I built different n-gram models to predict the probability of each word in a given text. This probability is estimated using an n-gram — a sequence of word of length n — which contains the word. The below formula shows how the probability of the word “dream” is estimated as part of the trigram “have a dream”:

N-gram language models

The vertical line denotes the probability of “dream” given the previous words “have a”

We train the n-gram models on the book “A Game of Thrones” by George R. R. Martin (called ). We then evaluate the models on two texts: “A Clash of Kings” by the same author (called ), and “Gone with the Wind” — a book from a completely different author, genre, and time (called ).

N-gram language models

The metric to evaluate the language model is average log likelihood: the average of the log probability that the model assigns to each word in the evaluation text.

N-gram language models

N_eval: total number of words in the evaluation text

Often, log of base 2 is applied to each probability, as is the case in the first two parts of the project. Nevertheless, in this part, I will use natural log, as it makes it simpler to derive the formulas that we will be using.

Problem

In part 2, the various n-gram models — from unigram to 5-gram — were evaluated on the evaluation texts ( and , see graphs below).

N-gram language models

From this, we notice that:

  • Bigram model perform slightly better than unigram model. This is because the previous word to the bigram can provide important context to predict the probability of the next word.
  • Surprisingly, trigram model and up are much worse than bigram or unigram models. This is largely due to the high number of trigrams, 4-grams, and 5-grams that appear in the evaluation texts but nowhere in the training text. Hence, their predicted probability is zero.
  • For most n-gram models, their performance is slightly improved when we interpolate their predicted probabilities with the uniform model. This seems rather counter-intuitive, since the uniform model simply assigns equal probability to every word. However, as explained in part 1, interpolating with this “dumb” model reduces the overfit and variance of the n-gram models, helping them generalize better to the evaluation texts.

In this part of the project, we can extend model interpolation even further: instead of separately combining each n-gram model with the uniform, we can interpolate different n-gram models with one another, along with the uniform model.

What to interpolate?

The first question to ask when interpolating multiple models together is:

To answer this question, we use the simple strategy outlined below:

  1. First, we start with the uniform model. This model will have very low average log likelihoods on the evaluation texts, since it assigns every word in the text the same probability.
  2. Next, we interpolate this uniform model with the unigram model and re-evaluate it on the evaluation texts. We naively assume that the models will have equal contribution to the interpolated model. As a result, each model will have the same interpolation weight of 1/2.
  3. We then add the bigram model to the mix. Similarly, in this 3-model interpolation, each model will simply have the same interpolation weight of 1/3.
  4. We keep adding higher n-gram models to the mix, while keeping the mixture weights the same across models, until we reach the 5-gram model. After each addition, the combined model will be evaluated against the two evaluation texts,  and .

Coding the interpolation

In part 2, each evaluation text had a corresponding probability matrix. This matrix has 6 columns — one for each model — and each row of the matrix represents the probability estimates of each word under the 6 models. However, since we want to optimize the model performance on both evaluation texts, we will vertically concatenate these probability matrices into one big evaluation probability matrix (803176 rows × 6 columns):

Duality theorems


Introduction

Duality theorems

Find x₁ and x₂ to minimize f(x₁, x₂). Source

Optimization shows up everywhere in machine learning, from the ubiquitous gradient descent, to quadratic programming in SVM, to expectation-maximization algorithm in Gaussian mixture models.

However, one aspect of optimization that always puzzled me is duality: what on earth is a primal form and dual form of an optimization problem, and what good do they really serve?

Therefore, in this project, I will:

  • Go over the primal and dual forms for the most basic of optimization problem: linear programming.
  • Show that by solving one form of the optimization problem, we will have also solved the other one. This requires us to prove two fundamental duality theorems in linear programming: weak duality theorem and strong duality theorem. The former theorem will be proven in this part, while the latter will be proven in the next part of the project.
  • Explain why we should care about duality by showing its application to some data science problems.

Linear programming

Definition

All (constrained) optimization problems have three components:

1.Objective function: the function whose value we are trying to optimize, which can mean minimize or maximize depending on the problem. The value of the objective function will be called  from here on.

2.Decision variables: the variables in the objective function whose value will be fine-tuned to give the objective function its .

3.Constraints: additional equations or inequalities that the decision variables must conform to.

With these components, we can define linear programming as such:

Linear programming is an optimization problem where the objective function and constraints are all linear functions of the decision variables.

This principle can be seen in the following formulation of a linear program:

Duality theorems

where

x: vector containing the decision variables

c: vector containing coefficients for each decision variable in the objective function. For simplicity, we will call these coefficients the objective coefficients from here on.

A: matrix in which each row contains the coefficients of each constraint

b: vector containing the limiting values of each constraint

Note that the vector inequalities in the above formula implies element-wise inequalities. For example, x ≥ 0 means every element of x must be greater or equal to zero.

Geometric interpretation of a linear program

Although the above formula of a linear program seems quite abstract, let’s see what it looks like using a simple example.

  • Suppose we have only have 2 decision variables, x₁ and x₂. Therefore, our vector x is simply [x₁, x₂]

Duality theorems

For More Detail, Please check the Link