Practice Design for Try/Fail Fast

At the moment, AI/ML/DL are hot keywords in trend of Software development. The world have more successful projects based on AI technologies such as Google Translate, AWS Alexa, …AI make machine smarter than. So, the way from idea to successfully have many challenges if want to make great solution. I have some time working with AI projects and start-up build great solution based on Algorithms and ML; I aimed to propose and implement solutions that help development team working smoothly. Today, I would like to describe about development process, Architecture, CI/CD and Programming for quickly implement multiple AI approaches with Agile software development methodology.

Continues Integration and Continues Deployment

– Batch Processing, Parallel Processing
– Data Driven Development and Test Driven Development (to be continues)


AI project including multiple services with domains focus on: AI/ML/DL and engineering that develop independent, integration and verification automatically. Popular, the ML services very specially with Engineering Service, resolve challenges problems linking with technologies: Machine learning, deep learning, big data, distributed computing … Microservices architecture in this case is a first choose, that help to separate business problem to specific services and can be resolve by specific domain knowledge of Data Science team, Engineering team. And more advantage of microservices with Agile development, more information here. With AI project, there will focus more on “How to resolve business by AI technology?”.

Microservices maybe not a best choose but that help to quickly development and delivery with Agile methodology.

Continues Integration and Continues Deployment

When project including multiple teams, multiple services which challenges at the integration and deployment. CI/CD is most popular with software development but i got more specific from Data Science(DS) team. The big question of DS is “We have more solutions to resolve this problem, Could you help me propose solution to quickly evaluation and integration?

With Engineering team, CI/CD pipeline is so general. With AI solution, you will meet some challenges linking to:
– How to running on distributed computing? We choose batch jobs
– How to save money with long time jobs? We choose AWS spot instances
– How to parallel jobs to improvement performance? We running parallel jobs and parallel on structure design(Python coding)
– How to control Data versions, Model versions? we choose Data Version Control and AWS S3 to versioning training/evaluation data and models

All solutions applied on my project aimed to resolve challenges of AI technology, but it interesting. The good abstract of structure will help to quickly integration and deliver multiple approaches.

This pipeline can implement with any CI/CD framework such as Gitlab CI, Jenkins, AWS Code Build … So, each framework should have function for custom distributed and parallel jobs. Because the jobs in the pipeline need specific resource and the resource should be auto scale. Example for Training Jobs need more GPUs and System Evaluation need more CPUs for parallels, scalable resource is most important to save the cost.

CI/CD pipeline including training and system will help fast try and fast result, the implementation can easy to integrate quickly, trust and able to control quality.

N-gram language models -Part3



In previous parts of my project, I built different n-gram models to predict the probability of each word in a given text. This probability is estimated using an n-gram — a sequence of word of length n — which contains the word. The below formula shows how the probability of the word “dream” is estimated as part of the trigram “have a dream”:

N-gram language models

The vertical line denotes the probability of “dream” given the previous words “have a”

We train the n-gram models on the book “A Game of Thrones” by George R. R. Martin (called ). We then evaluate the models on two texts: “A Clash of Kings” by the same author (called ), and “Gone with the Wind” — a book from a completely different author, genre, and time (called ).

N-gram language models

The metric to evaluate the language model is average log likelihood: the average of the log probability that the model assigns to each word in the evaluation text.

N-gram language models

N_eval: total number of words in the evaluation text

Often, log of base 2 is applied to each probability, as is the case in the first two parts of the project. Nevertheless, in this part, I will use natural log, as it makes it simpler to derive the formulas that we will be using.


In part 2, the various n-gram models — from unigram to 5-gram — were evaluated on the evaluation texts ( and , see graphs below).

N-gram language models

From this, we notice that:

  • Bigram model perform slightly better than unigram model. This is because the previous word to the bigram can provide important context to predict the probability of the next word.
  • Surprisingly, trigram model and up are much worse than bigram or unigram models. This is largely due to the high number of trigrams, 4-grams, and 5-grams that appear in the evaluation texts but nowhere in the training text. Hence, their predicted probability is zero.
  • For most n-gram models, their performance is slightly improved when we interpolate their predicted probabilities with the uniform model. This seems rather counter-intuitive, since the uniform model simply assigns equal probability to every word. However, as explained in part 1, interpolating with this “dumb” model reduces the overfit and variance of the n-gram models, helping them generalize better to the evaluation texts.

In this part of the project, we can extend model interpolation even further: instead of separately combining each n-gram model with the uniform, we can interpolate different n-gram models with one another, along with the uniform model.

What to interpolate?

The first question to ask when interpolating multiple models together is:

To answer this question, we use the simple strategy outlined below:

  1. First, we start with the uniform model. This model will have very low average log likelihoods on the evaluation texts, since it assigns every word in the text the same probability.
  2. Next, we interpolate this uniform model with the unigram model and re-evaluate it on the evaluation texts. We naively assume that the models will have equal contribution to the interpolated model. As a result, each model will have the same interpolation weight of 1/2.
  3. We then add the bigram model to the mix. Similarly, in this 3-model interpolation, each model will simply have the same interpolation weight of 1/3.
  4. We keep adding higher n-gram models to the mix, while keeping the mixture weights the same across models, until we reach the 5-gram model. After each addition, the combined model will be evaluated against the two evaluation texts,  and .

Coding the interpolation

In part 2, each evaluation text had a corresponding probability matrix. This matrix has 6 columns — one for each model — and each row of the matrix represents the probability estimates of each word under the 6 models. However, since we want to optimize the model performance on both evaluation texts, we will vertically concatenate these probability matrices into one big evaluation probability matrix (803176 rows × 6 columns):

Duality theorems


Duality theorems

Find x₁ and x₂ to minimize f(x₁, x₂). Source

Optimization shows up everywhere in machine learning, from the ubiquitous gradient descent, to quadratic programming in SVM, to expectation-maximization algorithm in Gaussian mixture models.

However, one aspect of optimization that always puzzled me is duality: what on earth is a primal form and dual form of an optimization problem, and what good do they really serve?

Therefore, in this project, I will:

  • Go over the primal and dual forms for the most basic of optimization problem: linear programming.
  • Show that by solving one form of the optimization problem, we will have also solved the other one. This requires us to prove two fundamental duality theorems in linear programming: weak duality theorem and strong duality theorem. The former theorem will be proven in this part, while the latter will be proven in the next part of the project.
  • Explain why we should care about duality by showing its application to some data science problems.

Linear programming


All (constrained) optimization problems have three components:

1.Objective function: the function whose value we are trying to optimize, which can mean minimize or maximize depending on the problem. The value of the objective function will be called  from here on.

2.Decision variables: the variables in the objective function whose value will be fine-tuned to give the objective function its .

3.Constraints: additional equations or inequalities that the decision variables must conform to.

With these components, we can define linear programming as such:

Linear programming is an optimization problem where the objective function and constraints are all linear functions of the decision variables.

This principle can be seen in the following formulation of a linear program:

Duality theorems


x: vector containing the decision variables

c: vector containing coefficients for each decision variable in the objective function. For simplicity, we will call these coefficients the objective coefficients from here on.

A: matrix in which each row contains the coefficients of each constraint

b: vector containing the limiting values of each constraint

Note that the vector inequalities in the above formula implies element-wise inequalities. For example, x ≥ 0 means every element of x must be greater or equal to zero.

Geometric interpretation of a linear program

Although the above formula of a linear program seems quite abstract, let’s see what it looks like using a simple example.

  • Suppose we have only have 2 decision variables, x₁ and x₂. Therefore, our vector x is simply [x₁, x₂]

Duality theorems

For More Detail, Please check the Link