With data that have time-related information, time features can be created to to possibly add more information to the models.
Since how to consider time series for machine learning is a broad topic, this article only aims to introduced basic ways to create time features for those models.
Type of data that is expected for this application
It is expected that transaction data type or any kind that similar to it will be the common one for this application. Other kinds of data that have timestamps information for each data points should also be applicable to some extent of this approach well.
Considering before attempt: a need of analyzing the problem and scope
For data with time element, it can be presented as a time series, which is how a set of data points described an entity are follow ordered indexes of time. One aspect considered for time series is that observations is expected to depend on previous one in time sequence, with the later is correlated to the one before. In those cases, using time series models for forecasting is a straightforward approach to use this data. Another way to approach it is to use feature engineering to transform data to have features that can be used for supervised machine learning model, which is the focus of this article.
Using time series model or adapting machine learning model is depended. In some cases, domain knowledge or business requirement will influence this decision. It is better to analyze the problem first to see the need of using either one or both type of models.
Regardless of the domain knowledge or business requirement aspects, the approach decision should have always considering the efficiency the approach will bring for in term of accuracy and computation cost.
A first preprocessing step to have a first set of time features: extracting time information from timestamp
The most straightforward thing to do is to extract basic time units, which for instance are hours, date, month, year into separates features. Another kind of information that can also be extracted is the characteristic of the time period, which could be whether the time is at a part of days (morning, afternoon), is weekend or not or is it a holiday, etc.
In some business requirements or domains’ aspects, those initial features at this level is already needed to see if the value of observation is follow those factors or not. For example, the data is the record of timestamps of customer visiting a shop and their purchase. There is a need to know at which hours, date, month… a customer would come and purchase so that follow up actions can be made to increase sales.
Regarding feature engineering for time data, the well-known technique that commonly used is aggregate features by taking statistics (variance, max, min, etc.) of the set of values grouped by a desired time unit: hours, days, months…
Apart from that, a time window could be defined and compute aggregate by rolling or expanding time window.
- Rolling: have a fixed time window size and to predict a value for the data point at a time, features will be computed from aggregating backward number of time steps according to the time window
- Expanding: from the data point, the window will be the whole record of past time steps.
There are also two aspects of aggregating:
- Aggregating to create new features for the current data points. For the first case, the model is considered to include the time series characteristic, meaning a moment will likely to be related by other moments from the recent past.
- Aggregating to create a new set of data points with corresponding new set of features from the current ones. For the second one, the new number of data points considering for the model is changed and each new data point is the summary of information from a subset of initial data points. As a results, objects for the models may be shifted like being mentioned in considering before part. If the data only about information record of one entity, or in other word only contains one time series of an entity, through this techniques, the new computed data points can be the summary of other features’ value in the chosen time unit. On the other hand, if there are more entities observed in the data set, each new data points is then the summary information of each observed entities.
How to decide on the focus objects for the problem and the approach is situational but for fresh problem and fresh data with no specific requirement or prior domain knowledge, it is better to consider all of them for the model and execute feature selection to see if the created time features is of any value.
Dealing with hours of a day – Circular data
For some needs, specific time of a day is required to be focus at. A use case of of detecting fraud transactions is a good example for this. To find something like the most frequent time that a kind of behaviour is performed for instance, using arithmetic mean mean may be misleading and is not a good representation. An important point need to be considered is that hours of day is a circular data and it should be represented on a circular axis with its’ value between 0 to 2π. To have a better representation of the mean, using von Mises distribution to have periodic mean is a suitable approach for this situation (Mishtert, 2019).
Validation for the model
Before the model building, a validation set is needed to be selected from the data first. In the usual cases, to avoid overfitting data will be randomly shuffled and then will be divided into training set and validation set. However, for this kind of situations it shouldn’t be done so to avoid the mistake of having past data in the validation and the future data in the training, in other words using future data to predict the past.
In this blog post, I am going to introduce one of the most intuitive algorithms in the field of Supervised Learning, the k-Nearest Neighbors algorithm (kNN).
The original k-Nearest Neighbors algorithm
The kNN algorithm is very intuitive. Indeed, with the assumption that items close together in the dataset are typically similar, kNN infers the output of new sample by first constructing the distance score with every sample in the training dataset. From there, it creates a ‘neighbor zone’ by selecting samples that are ‘near’ the candidate one, and does the supervised tasks based on samples lie inside that zone. The task could be either classification or regression.
Let’s start with the basic kNN algorithm. Let be our training dataset with samples belong to classes, where is the class of one sample, and denotes the corresponding feature vector that describes the characteristics of that sample. Furthermore, it is necessary to define the suitable distance metric, since it drives the algorithm to select neighbors and make prediction later on. The distance metric , is a mapping over a vector space , where the following conditions are satisfied :
In the following steps to describe the k-Nearest Neighbors algorithm, the Euclidean distance will be used as the distance metric .
For any new instance :
- Find where is the set of samples that are closest to
- The way to define the nearest neighbors is based on distance metric (Note that we are using Euclidean distance).
- The classifier is defined as:
where is the unit function. Note that for the regression problem, the function will just an average of all response values from neighbor samples.
Please also check Vietnam AI Lab
Weighted k-Nearest Neighbors
In the kNN algorithm, we weight all neighbors equally. It may affect the inference steps, especially when the neighbor zone becomes bigger and bigger. To strengthen the effect of ‘close’ neighbors than others, the weighted scheme of k-Nearest Neighbors is applied.
Weighted k-Nearest Neighbors is based on the idea that, within , such observations that are closer to , should get a higher weight than the further neighbors. Now it is necessary to note some properties of any weighting schemes on any distance metric :
- decreases monotonously for
For any new instance :
- We find where is the set of samples that are closest to
- The th neighbor is used for standardization of the smallest distance:
- We transform the standardized distance with any kernel function into weights .
- The classifier is defined as:
The pros and cons of kNN, and further topics
The kNN and weighted kNN do not rely on any specific assumption on the distribution of the data, so it is quite easy to apply it on many problems as the baseline model. Furthermore, kNN (and its family) is very intuitive for understanding and implementing, which again make it a worthy try-it-first approach for many supervised problems.
Despite those facts, kNN still has challenges in some aspects: It is computationally expensive – especially when the dataset size becomes huge. Another challenge is choosing the ‘correct’ distance metric that best follows the assumption for using this algorithm: items close together in the data set should be typically similar. Lastly, the curse of dimensionality heavily affects the distance metric. Beyer et al. proves that, under some preconditions, in high dimension space, all points converge to the same distance from the query point. In this case, the concept of ‘nearest neighbors’ is no longer meaningful.
I. A Brief Overview
Consider an example of a courtroom trial:
A car company C is accused of not manufacturing environment-friendly vehicles. The average CO2 emission per car from different manufacturers based on a survey from the previous year is 120.4 grams per kilometer. But for a random batch of 100 cars produced at C’s factory, the average CO2 emission is 121.2 grams per kilometer with a standard deviation 1.8.
At the trial, Company C is not considered to be guilty as long as their wrongdoing is not proven. A public prosecutor tries to prove that C is guilty and can only succeed when enough evidence is presented.
The example above illustrates the concepts of hypothesis testing; specifically, there are two conflicting hypotheses:
i) C is not guilty; or
ii) C is guilty
The first is called the null hypothesis (denoted by H0), and the second the alternative hypothesis (denoted by HA). At the start of the trial the null hypothesis is temporarily accepted, until proven otherwise. The goal of hypothesis testing is to perform some sort of transformed comparison between the two numbers 121.2 and 120.4 to either reject H0 and accept HA or vice versa. This is a one-sample mean testing because we are comparing the average value obtained from one sample (121.2) with the average value assumed to represent for the whole population (120.4)
II. Required Steps for Hypothesis Testing
The six steps below must be followed to conduct a hypothesis test. The details will be elaborated with our example afterwards.
1) Set up null and alternative hypotheses and check conditions.
2) Determine the significance level, alpha.
3) Calculate the test statistic.
4) Calculate the probability value (a.k.a the p-value), or find the rejection region. For the following example, we will use the p-value.
5) Make a decision about the null hypothesis.
6) State the overall conclusion.
III. A step-by-step example
1) Set up hypotheses:
We already mentioned in the beginning the two hypotheses. But now we will formalize them:
Company C’s CO2 mean (denoted by μ ) is equal to the population mean (denoted by μ0): μ = μ0
Company C’s CO2 mean is greater than the population mean: μ > μ0
The hypothesis testing for one-sample mean we are conducting requires the data to come from an approximately normal distribution or a large enough sample size, which can be quite subjective. To keep things simple, we decide that the data gathered from company C is big enough with the sample size being 100 cars.
2) Determine the significance level, alpha, or confidence level
The significance level and its complementary, the confidence level, provide a level of probability cutoff for our test to make decisions about the hypotheses. A common value for alpha is 5%, which is the same as a confidence level of 95%.
3) Calculate the test statistic
For one-sample mean test, we calculate the t* test statistic using the formula:
where s is the standard deviation from the sample we are testing, 1.8, and n is the size of the sample, 100.
Can you expand on ? I guess you would find that is quite easy to do. You can easily find that .
How about the expansion of . It is no longer easy.
It is no longer easy isn’t it. However, if we use Binomial Theorem, this expansion becomes an easy problem.
Binomial Theorem is a very intriguing topic in mathematics and it has wide range of applications.
Let , be real numbers (or complex, or polynomial). For any positive integer , we have:
We will use prove by induction. The base case is obvious. Now suppose that the theorem is true for the case , that is assume that:
we will need to show that, this is true for
Let us consider the left hand side of equation above
We can now apply Pascal’s identity:
The equation above can be simplified to:
as we desired.
Example 1: Power rule in Calculus
In calculus, we always use the power rule that
We can prove this rule using Binomial Theorem.
Recall that derivative for any continuous function f(x) is defined as:
Let be a positive integer and let
The derivative of f(x) is:
Example 2: Binomial Distribution
Let X be the number of Head a sequence of n independent coin tossing. X is usually model by binomial distribution in probability model. Let be the probability that a head show up in a toss, and let . The probability that there is head in the sequence of toss is:
On a nice day 2 years ago, when I was on financial field. My boss sent our team an email. In this email, he would like to us propose some machine learning techniques to predict stock price.
So, after accepting the assignment from my manager, our team begin to research and apply some approaches for prediction. When we talk about Machine Learning, we often think of supervised and unsupervised learning. But one of the algorithms we applied is one that we usually forgotten however equally highly effective algorithm: Monte Carlo Simulation.
What is Monte Carlo simulation
Monte Carlo method is a technique that uses random numbers and probability to solve complex problems. The Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial sectors, project management, costs, and other forecasting machine learning models.
Now let’s jump into python implementation to see how it applies,
In this task we used data of DXG stock dataset from 2017/01/01 to 2018/08/24 and we would like to know what is stock price after 10 days, 1 months and 3 months, respectively
We will simulate the return of stock and next price will be calculated by
P(t) = P(0) * (1+return_simulate(t))
Calculate mean and standard deviation of stock returns
miu = np.mean(stock_returns, axis=0) dev = np.std(stock_returns)
simulation_df = pd.DataFrame() last_price = init_price for x in range(mc_rep): count = 0 daily_vol = dev price_series =  price = last_price * (1 + np.random.normal(miu, daily_vol)) price_series.append(price) for y in range(train_days): if count == train_days-1: break price = price_series[count] * (1 + np.random.normal(miu, daily_vol)) price_series.append(price) count += 1 simulation_df[x] = price_series
Visualization Monte Carlo Simulation
fig = plt.figure() fig.suptitle('Monte Carlo Simulation') plt.plot(simulation_df) plt.axhline(y = last_price, color = 'r', linestyle = '-') plt.xlabel('Day') plt.ylabel('Price') plt.show()
Now, let’s check with actual stock price after 10 days, 1 month and 3 months
plt.hist(simulation_df.iloc[9,:],bins=15,label ='histogram') plt.axvline(x = test_simulate.iloc, color = 'r', linestyle = '-',label ='Price at 10th') plt.legend() plt.title('Histogram simulation and last price of 10th day') plt.show()
We can see the most frequent occurrence price is pretty close with the actual price after 10th
If the forecast period is longer, the results is not good gradually
Simulation for next 1 month
After 3 months
Monte Carlo simulation is used a lot in finance, although it has some weaknesses, but hopefully through this article, you will have a new look on the simulation application for forecasting.
 Pratik Shukla, Roberto Iriondo, “Monte Carlo Simulation An In-depth Tutorial with Python”, medium, https://medium.com/towards-artificial-intelligence/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8
In this post, I will explain how to calculate a Bayesian estimator. The taken example is very simple: estimate the parameter θ of a Bernoulli distribution.
A random variable X which has the Bernoulli distribution is defined as
In this case, we can write
In reality, the simplest way to eatimate θ is to sample X, count how many time the event occurs, then estimate the probability of occuring of event. This is exactly what the frequestists do.
In this post, I will show how do the Bayesian statisticians estimate θ. Although this doesn’t have a meaningful application, but it helps to understand how do the Bayesian statistics work. Let’s start.
The posterior distribution of θ
Denote Y as the observation of the event. Given the parameter θ, if we sample the event n time, then the probability that the event occurs k time is (this is called the probability density function of Bernoulli )
In Bayesian statistics, we would like to calculate
By using the Bayesian formula, we have
With the prior distribution of theta as an Uniform distribution, p(θ) = 1, and it is easy to prove that
where Γ is the Gamma distribution. Hence, the posterior distribution is
Fortunately, this is the density function of the Beta distribution:
We use the following properties for evaluating the posterior mean and variance of theta.
If , then
In summary, the Bayesian estimator of theta is the Beta distribution with the mean and variance as above. Here is the Python codes for simulating data and estimating theta
def bayes_estimator_bernoulli(data, a_prior=1, b_prior=1, alpha=0.05): '''Input: data is a numpy array with binary value, which has the distribution B(1,theta) a_prior, b_prior: parameters of prior distribution Beta(a_prior, b_prior) alpha: significant level of the posterior confidence interval for parameter Model: for estimating the parameter theta of a Bernoulli distribution the prior distribution for theta is Beta(1,1)=Uniform[0,1] Output: a,b: two parameters of the posterior distribution Beta(a,b) pos_mean: posterior estimation for the mean of theta pos_var: posterior estimation for the var of theta''' n = len(data) k = sum(data) a = k+1 b = n-k+1 pos_mean = 1.*a/(a+b) pos_var = 1.*(a*b)/((a+b+1)*(a+b)**2) ## Posterior Confidence Interval theta_inf, theta_sup = beta.interval(1-alpha,a,b) print('Prior distribution: Beta(%3d, %3d)' %(a_prior,b_prior)) print('Number of trials: %d, number of successes: %d' %(n,k)) print('Posterior distribution: Beta(%3d,%3d)' %(a,b)) print('Posterior mean: %5.4f' %pos_mean) print('Posterior variance: %5.8f' %pos_var) print('Posterior std: %5.8f' %(np.sqrt(pos_var))) print('Posterior Confidence Interval (%2.2f): [%5.4f, %5.4f]' %(1-alpha, theta_inf, theta_sup)) return a, b, pos_mean, pos_var # Example n = 129 # sample size data = np.random.binomial(size=n, n=1, p=0.6) a, b, pos_mean, pos_var = bayes_estimator_bernoulli(data)
And the result is
Prior distribution: Beta( 1, 1) Number of trials: 129, number of successes: 76 Posterior distribution: Beta( 77, 54) Posterior mean: 0.5878 Posterior variance: 0.00183556 Posterior std: 0.04284341 Posterior Confidence Interval (0.95): [0.5027, 0.6703]
In part 1 of my project, I built a unigram language model: it estimates the probability of each word in a text simply based on the fraction of times the word appears in that text.
The text used to train the unigram model is the book “A Game of Thrones” by George R. R. Martin (called train). The texts on which the model is evaluated are “A Clash of Kings” by the same author (called dev1), and “Gone with the Wind” — a book from a completely different author, genre, and time (called dev2).
In this part of the project, I will build higher n-gram models, from bigram (n=2) all the way to 5-gram (n=5). These models are different from the unigram model in part 1, as the context of earlier words is taken into account when estimating the probability of a word.
Higher n-gram language models
Training the model
For a given n-gram model:
- The probability of each word depends on the n-1 words before it. For a trigram model (n = 3), for example, each word’s probability depends on the 2 words immediately before it.
- This probability is estimated as the fraction of times this n-gram appears among all the previous (n-1)-grams in the training set. In other words, training the n-gram model is nothing but calculating these conditional probabilities from the training text.
The example below shows the how to calculate the probability of a word in a trigram model:
Dealing with words near the start of sentence
In higher n-gram language models, the words near the start of each sentence will not have a long enough context to apply the formula above. To make the formula consistent for those cases, we will pad these n-grams with sentence-starting symbols [S]. Below are two such examples under the trigram model:
From the above formulas, we see that the n-grams containing the starting symbols are just like any other n-gram. The only difference is that we count them only when they are at the start of a sentence. Lastly, the count of n-grams containing only [S] symbols is naturally the number of sentences in our training text:
Dealing with unknown n-grams
Similar to the unigram model, the higher n-gram models will encounter n-grams in the evaluation text that never appeared in the training text. This can be solved by adding pseudo-counts to the n-grams in the numerator and/or denominator of the probability formula a.k.a. Laplace smoothing. However, as outlined part 1 of the project, Laplace smoothing is nothing but interpolating the n-gram model with a uniform model, the latter model assigns all n-grams the same probability:
Hence, for simplicity, for an n-gram that appears in the evaluation text but not the training text, we just assign zero probability to that n-gram. Later, we will smooth it with the uniform probability.
Language modeling — that is, predicting the probability of a word in a sentence — is a fundamental task in natural language processing. It is used in many NLP applications such as autocomplete, spelling correction, or text generation.
Currently, language models based on neural networks, especially transformers, are the state of the art: they predict very accurately a word in a sentence based on surrounding words. However, in this project, I will revisit the most classic of language model: the n-gram models.
- In part 1 of the project, I will introduce the unigram model i.e. model based on single words. Then, I will demonstrate the challenge of predicting the probability of “unknown” words — words the model hasn’t been trained on — and the Laplace smoothing method to alleviate this problem. I will then show that this smoothing method can be viewed as model interpolation, which is the combination of different models together.
- In part 2 of the project, I will introduce higher n-gram models and show the effect of model interpolation of these n-gram models.
- In part 3 of the project, aside from interpolating the n-gram models with equal proportions, I will use the expectation-maximization algorithm to find the best weights to combine these models.
In this project, my training data set — appropriately called train — is “A Game of Thrones”, the first book in the George R. R. Martin fantasy series that inspired the popular TV show of the same name.
Then, I will use two evaluating texts for our language model:
- The first text, called dev1, is the book “A Clash of Kings”, the second book in the same series as the training text. The word dev stands for development, as we will later use these texts to fine-tune our language model.
- The second text, called dev2, is the book “Gone with the Wind”, a 1939 historical fiction book set in the American South, written by Margaret Mitchell. This book is selected to show how our language model will fare when evaluated on a text that is very different from its training text.
Unigram language model
What is a unigram?
In natural language processing, an n-gram is a sequence of n words. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), “natural language processing” is a trigram (n = 3), and so on. For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words.
Training the model
A language model estimates the probability of a word in a sentence, typically based on the the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence:
The unigram language model makes the following assumptions:
The goal of this project is to generate Gaussian samples in 2-D from uniform samples, the latter of which can be readily generated using built-in random number generators in most computer languages.
In part 1 of the project, the inverse transform sampling was used to convert each uniform sample into respective x and y coordinates of our Gaussian samples, which are themselves independent standard normal (having mean of 0 and standard deviation of 1):
However, this method uses the inverse cumulative distribution function (CDF) of the Gaussian distribution, which is not well-defined. Therefore, we approximated this function using the simple Taylor series. However, this only samples accurately near the Gaussian mean, and under-samples more extreme values at both ends of the distribution.
In part 2 of the project, we used the Box-Muller transform, a more direct method to transform the uniform samples into Gaussian ones. The implementation of the algorithm is quite simple, as seen below, but its derivation requires some clever change of variables: instead of sampling Gaussian x and y coordinates for each point, we will sample a uniformly-distributed angle (from 0 to 2π) and an exponentially-distributed random variable that represents half of squared distance of the sample to the origin.
In this part of the project, I will present an even simpler method than the above two methods. Even better, this method is one that every statistics students are already familiar with.
The Central Limit Theorem
It turns out, we can rely on the most fundamental principle of statistics to help us generate Gaussian samples: the central limit theorem. In very simple terms, the central limit theorem states that:
Given n independent random samples from the same distribution, their sum will converge to a Gaussian distribution as n gets large.
Therefore, to generate a Gaussian sample, we can just generate many independent uniform samples and add them together! We then repeat this routine to generate the next standard Gaussian sample until we get enough samples for our x-coordinates. Finally, we just repeat the same steps to generate the y coordinates.
Note that this method will work even if the samples that we start with are not uniform — they are Poisson-distributed, for example. This is because the central limit theorem holds for virtually all probability distributions *cough let’s not talk about the Cauchy distribution cough*.
Generate Gaussian samples by central limit theorem
To generate, say, 1000 Gaussian sums (
n_sums = 1000) where each is the sum of 100 uniform samples (
n_additions = 100):
- First, we generate a 100 x 1000 matrix —
uniform_matrix— in which every element is a uniformly-distributed sample between 0 and 1.
- Then, calculating the 1000 sums is nothing but summing the uniform samples down each column of this matrix, hence the
axis=0argument. The result is a NumPy array
gaussians, which contains the 1000 Gaussian samples.
n_additions = 100 n_points = 1000# 0. Initialize random number generator rng = np.random.RandomState(seed=24) # 1. Generate matrix of uniform samples uniform_matrix = rng.uniform(size=(n_additions, n_points))# 2. Sum uniform elements down each column to get all Gaussian sums gaussians = uniform_matrix.sum(axis=0)
We can apply the above method to generate Gaussian samples for each coordinate, using different random number generator for x and for y to ensure that the coordinates are independent from each other. Visualizing the intermediate sums after each addition, we see that:
In part 1 of this project, I’ve shown how to generate Gaussian samples using the common technique of inversion sampling:
- First, we sample from the uniform distribution between 0 and 1 — green points in the below animation. These uniform samples represent the cumulative probabilities of a Gaussian distribution i.e. the area under the distribution to the left of some point.
- Next, we apply the inverse Gaussian cumulative distribution function (CDF) to these uniform samples. This will transform them to our desired Gaussian samples — blue points in the below animation.
The bad news: the Gaussian inverse CDF is not well-defined, so we have to approximate that function, and the simple Taylor series was used. However, this only samples accurately near the Gaussian mean, and under-samples more extreme values at both ends of the distribution.
Therefore, in this part of the project, we will investigate a more “direct” sampling method that does not depend on the approximation of the Gaussian inverse CDF. This method is called the Box-Muller transform, after the two mathematicians who invented the method in 1958: the British George E. P. Box and the American Mervin E. Muller.
How does the Box-Muller transform work?
For this project, my goal is to generate Gaussian samples in two dimensions i.e. generating samples whose x and y coordinates are independent standard normals (Gaussian with zero mean and standard deviation of 1). In part 1, I used the inverse Gaussian CDF to generate separately the x and y coordinates from their respective uniform samples (U₁ and U₂):
For the Box-Muller transform, I will also start with the same two uniform samples. However, I will transform these uniform samples into the x and y coordinates using much simpler formulas:
Despite the strong coupling between U₁ and U₂ in each of the two formulas above, the generated x and y coordinates, which are both standard Gaussians, are still surprisingly independent from each other! In the derivation of the Box-Muller transform that follows, I will demonstrate why this is indeed this case.
Derivation of Box-Muller transform
We know that for any two independent random variables x and y, the joint probability density f(x, y) is simply the product of the individual density functions: f(x) and f(y). Furthermore, Pythagorus theorem allows us to combine the x² and y² term in each of the Gaussian density function. This results in the -r²/2 term in the exponential of the joint distribution, where r is the distance from the origin to the 2-D Gaussian sample.
To make it simpler, we then define a variable s that is equal to r²/2. In other words, s is simply the half of squared distance from the origin of our Gaussian sample. Written this way, the joint PDF is simply the product of a constant (1/2π) and an exponential (e⁻ˢ).