Continue reading "[AI Lab] Hiring Data Scientist"
The post [AI Lab] Hiring Data Scientist first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Please contact below mail.
recruitment@mti-tech.vn
MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop our products using agile methods bringing the best deliverable results to the table in the shortest amount of time. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 100 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!
We are looking for Data Scientists who would like to participate in the project to use existing various data to apply to AI, moreover, combine with other data to create new value.
Currently, we are looking for candidates with experiences in Natural Language Processing (NLP), but any other fields of AI will be considered too.
Currently, development is mainly in Python. It is good to understand object thinking programming in Java etc. It is also good if you have parallel processing experience in the server-side language (Golang, etc.).
In addition, engineers who can use functional languages (Haskell, Erlang / Elixir) are treasures of talented people. Such people are interested in various programming languages, have mathematical curiosity, and many of them study by themselves. Although we do not have many opportunities to use these languages in actual development, we welcome such engineers as well.
Recently, in the development of Web services, engineers who have experienced APIs usage and libraries related to prominent AI such as open source, Google, IBM etc. from the standpoint of “User”, are often classified as “AI Engineers”. What MTI Group seeks is not such a technician, we find an experienced person who has learned the data sciences themselves deeply. On the other hand, if a person who has learned data science, and has not much experience in actual work, we still consider.
At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.
The post [AI Lab] Hiring Data Scientist first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Bias in Data Science – the Good, the Bad and the Avoidable !?"
The post Bias in Data Science – the Good, the Bad and the Avoidable !? first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Interestingly, bias itself does not need to be harmful and is often built into a model’s design on purpose, either to address only a subset of the overall population or to model a real-world state, for instance when predicting house prices from its size and number of bedrooms, the model’s bias parameter often represents the average house price in the data set. Thus, we need to distinguish between conscious and unconscious bias in data analysis. Additionally, there is the factor of intent, i.e. is the person conducting the analysis well-intentioned to follow good scientific method or trying to manipulate it to achieve a particular outcome.
In the following, I will only discuss aspects of unintentional and unconscious bias, meaning bias hidden from the researcher or data scientist introducing it. This is by no means an exhaustive discussion, but merely a highlight of some pervasive aspects:
A. Data availability bias
B. Coherent theory bias
C. Method availability/popularity bias
A. The problem of scientists’ selecting their data out of convenience rather than suitability or representativeness for the current task has been around for a while [4], e.g. the ideal data set may not be available in machine-readable format or would require higher costs and more time for processing, in short, several obstacles to doing an analysis quickly. For instance, in the area of Natural Language Processing, the major European languages, like English, French and German etc. tend to receive more attention because both data and tools to analyze them are widely available. Similarly, psychology research has mostly focused on so-called WEIRD societies (White, Educated, Industrialized, Rich, Democratic) [5] and out of convenience often targets the even smaller population of “North American college students” that unsurprisingly have been found to not represent human populations at large.
B. Various studies suggest that we as people strongly favour truths that fit into our pre-existent world view, and why would scientists be exempt from this? Thus, it appears when people analyze data they are often biased by their underlying beliefs about the outcome and are then less likely to yield to unexpected non-significant results [6]. This does not include scientists disregarding new evidence because of conflicting interests [7]. This phenomenon is commonly referred to as confirmation bias or, more fittingly, “myside” bias.
C. There is a tendency of hailing new trendy algorithms as one-fits-all solutions for whatever task or application. The solution is presented before examining the problem and its actual requirements. While more complex models are often more powerful, this comes at the cost of interpretability, which in some cases is not advisable. Additionally, some methods, both simple and complex ones, enjoy popularity primarily because they come ready-to-use in one’s favourite programming language.
Going forward…
We as data scientists should:
a. Select our data carefully with our objective in mind. Get to know our data and its limitations.
b. Be honest with ourselves about possible emotional investment in our analyses’ outcomes and resulting conflicts.
c. Examine the problem and its (theoretical) solutions BEFORE making any model design choices.
References:
[1] https://www.theguardian.com/technology/2017/apr/25/faceapp-apologises-for-racist-filter-which-lightens-users-skintone(last accessed 21.10.2020)
[2] https://www.bbc.com/news/technology-35902104(last accessed 21.10.2020)
[3] https://spectrum.ieee.org/tech-talk/artificial-intelligence/machine-learning/in-2016-microsofts-racist-chatbot-revealed-the-dangers-of-online-conversation(last accessed 21.10.2020)
[4] https://www.theguardian.com/technology/2020/aug/09/instagrams-censorship-of-black-models-photo-shoot-reignites-claims-of-race-bias-nyome-nicholas-williams(last accessed 21.10.2020)
[5] Joseph Rudman (2003) Cherry Picking in Nontraditional Authorship Attribution Studies, CHANCE, 16:2,26-32, DOI: 10.1080/09332480.2003.10554845
[6] Henrich, Joseph; Heine, Steven J., and Norenzayan, Ara. The Weirdest People in the World? Behavioral and Brain Sciences, 33(2-3):61–83, 2010. doi: 10.1017/S0140525X0999152X.
[7] Hewitt CE, Mitchell N, Torgerson DJ. Heed the data when results are not significant. BMJ. 2008;336(7634):23-25. doi:10.1136/bmj.39379.359560.AD
[8] Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. doi:10.1371/journal.pmed.0020124
The post Bias in Data Science – the Good, the Bad and the Avoidable !? first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Efficient Algorithms: An overview"
The post Efficient Algorithms: An overview first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>What makes computers useful for us is primarily the ability to solve problems. The procedure in which computers solve a problem is an algorithm. In the recent context of increasing number of algorithms available for solving data-related problems, there is increasing demand for higher level of understanding of algorithm’s performance in order for data scientists to choose the right algorithms for their problems.
Having a general perception for efficiency of an algorithm would help shaping the thought process for creating or choosing better algorithms. With this intention in mind, I would like to create a series of posts to discuss about what makes a good algorithm in practice, or in short, efficient algorithm. And this article is the first step of the journey.
An algorithm is considered efficient if its resource consumption, also known as computational cost, is at or below some acceptable level. Roughly speaking, ‘acceptable’ means: it will run in a reasonable amount of time or space on an available computer, typically as a function of the size of the input.
There are many ways in which the resources used by an algorithm can be measured: the two most common measures are speed and memory usage. In the next 2 sections, we will be looking at the two different perspectives for measuring efficiency of an algorithm from theoretician and practitioners.
In practice, there are other factors which can affect the efficiency of an algorithm, such as requirements for accuracy and/or reliability. As detailed below, the way in which an algorithm is implemented can also have a significant effect on actual efficiency, though many aspects of this relate to optimization issues.
Implementation issues can also have an effect on efficiency, such as the choice of programming language, or the way in which the algorithm is actually coded, or the choice of a compiler for a particular language, or the compilation options used, or even the operating system being used. In many cases a language implemented by an interpreter may be much slower than a language implemented by a compiler.
The post Efficient Algorithms: An overview first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Binomial Theorem"
The post Binomial Theorem first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>How about the expansion of . It is no longer easy.
It is no longer easy isn’t it. However, if we use Binomial Theorem, this expansion becomes an easy problem.
Binomial Theorem is a very intriguing topic in mathematics and it has wide range of applications.
Let , be real numbers (or complex, or polynomial). For any positive integer , we have:
where,
We will use prove by induction. The base case is obvious. Now suppose that the theorem is true for the case , that is assume that:
we will need to show that, this is true for
Let us consider the left hand side of equation above
$$$$
The equation above can be simplified to:
as we desired.
In calculus, we always use the power rule that
We can prove this rule using Binomial Theorem.
Proof:
Recall that derivative for any continuous function f(x) is defined as:
$$$$
Let be a positive integer and let
The derivative of f(x) is:
Let X be the number of Head a sequence of n independent coin tossing. X is usually model by binomial distribution in probability model. Let be the probability that a head show up in a toss, and let . The probability that there is head in the sequence of toss is:
$$$$
Please also check another article Gaussian Samples and N-gram language models ,Bayesian model , Monte Carlo for statistics knowledges.
The post Binomial Theorem first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Monte Carlo Simulation"
The post Monte Carlo Simulation first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>So, after accepting the assignment from my manager, our team begin to research and apply some approaches for prediction. When we talk about Machine Learning, we often think of supervised and unsupervised learning. But one of the algorithms we applied is one that we usually forgotten however equally highly effective algorithm: Monte Carlo Simulation.
Monte Carlo method is a technique that uses random numbers and probability to solve complex problems. The Monte Carlo simulation, or probability simulation, is a technique used to understand the impact of risk and uncertainty in financial sectors, project management, costs, and other forecasting machine learning models.[1]
Now let’s jump into python implementation to see how it applies,
In this task we used data of DXG stock dataset from 2017/01/01 to 2018/08/24 and we would like to know what is stock price after 10 days, 1 months and 3 months, respectively
We will simulate the return of stock and next price will be calculated by
P(t) = P(0) * (1+return_simulate(t))
Calculate mean and standard deviation of stock returns
miu = np.mean(stock_returns, axis=0) dev = np.std(stock_returns)
Simulation process
simulation_df = pd.DataFrame() last_price = init_price for x in range(mc_rep): count = 0 daily_vol = dev price_series = [] price = last_price * (1 + np.random.normal(miu, daily_vol)) price_series.append(price) for y in range(train_days): if count == train_days-1: break price = price_series[count] * (1 + np.random.normal(miu, daily_vol)) price_series.append(price) count += 1 simulation_df[x] = price_series
Visualization Monte Carlo Simulation
fig = plt.figure() fig.suptitle('Monte Carlo Simulation') plt.plot(simulation_df) plt.axhline(y = last_price, color = 'r', linestyle = '-') plt.xlabel('Day') plt.ylabel('Price') plt.show()
Now, let’s check with actual stock price after 10 days, 1 month and 3 months
plt.hist(simulation_df.iloc[9,:],bins=15,label ='histogram') plt.axvline(x = test_simulate.iloc[10], color = 'r', linestyle = '-',label ='Price at 10th') plt.legend() plt.title('Histogram simulation and last price of 10th day') plt.show()
We can see the most frequent occurrence price is pretty close with the actual price after 10th
If the forecast period is longer, the results is not good gradually
Simulation for next 1 month
After 3 months
Monte Carlo simulation is used a lot in finance, although it has some weaknesses, but hopefully through this article, you will have a new look on the simulation application for forecasting.
[1] Pratik Shukla, Roberto Iriondo, “Monte Carlo Simulation An In-depth Tutorial with Python”, medium, https://medium.com/towards-artificial-intelligence/monte-carlo-simulation-an-in-depth-tutorial-with-python-bcf6eb7856c8
Please also check Gaussian Samples and N-gram language models,
Bayesian Statistics for statistics knowledges.
The post Monte Carlo Simulation first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Hiring- Data Scientist (Algorithm Theory)"
The post Hiring- Data Scientist (Algorithm Theory) first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Job Title | Data Scientist (Algorithm Theory) |
Location | Ho Chi Minh | |
Contact | recruitment @ mti-tech.vn | Employment | Fulltime | |
Level | Middle/Senior | Report to | Line Manager |
MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop our products using agile methods bringing the best deliverable results to the table in the shortest amount of time. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 100 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!
We are looking for Data Scientists who would like to participate in the project to use existing various data to apply to AI, moreover, combine with other data to create new value.
Currently, we are looking for candidates with experiences in Natural Language Processing (NLP), but any other fields of AI will be considered too.
Currently, development is mainly in Python. It is good to understand object thinking programming in Java etc. It is also good if you have parallel processing experience in the server-side language (Golang, etc.).
In addition, engineers who can use functional languages (Haskell, Erlang / Elixir) are treasures of talented people. Such people are interested in various programming languages, have mathematical curiosity, and many of them study by themselves. Although we do not have many opportunities to use these languages in actual development, we welcome such engineers as well.
Recently, in the development of Web services, engineers who have experienced APIs usage and libraries related to prominent AI such as open source, Google, IBM etc. from the standpoint of “User”, are often classified as “AI Engineers”. What MTI Group seeks is not such a technician, we find an experienced person who has learned the data sciences themselves deeply. On the other hand, if a person who has learned data science, and has not much experience in actual work, we still consider.
At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.
The post Hiring- Data Scientist (Algorithm Theory) first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Bayesian estimator of the Bernoulli parameter"
The post Bayesian estimator of the Bernoulli parameter first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>A random variable X which has the Bernoulli distribution is defined as
with
In this case, we can write
.
In reality, the simplest way to eatimate θ is to sample X, count how many time the event occurs, then estimate the probability of occuring of event. This is exactly what the frequestists do.
In this post, I will show how do the Bayesian statisticians estimate θ. Although this doesn’t have a meaningful application, but it helps to understand how do the Bayesian statistics work. Let’s start.
Denote Y as the observation of the event. Given the parameter θ, if we sample the event n time, then the probability that the event occurs k time is (this is called the probability density function of Bernoulli )
In Bayesian statistics, we would like to calculate
By using the Bayesian formula, we have
With the prior distribution of theta as an Uniform distribution, p(θ) = 1, and it is easy to prove that
where Γ is the Gamma distribution. Hence, the posterior distribution is
Fortunately, this is the density function of the Beta distribution:
We use the following properties for evaluating the posterior mean and variance of theta.
If , then
In summary, the Bayesian estimator of theta is the Beta distribution with the mean and variance as above. Here is the Python codes for simulating data and estimating theta
def bayes_estimator_bernoulli(data, a_prior=1, b_prior=1, alpha=0.05):
'''Input:
data is a numpy array with binary value, which has the distribution B(1,theta) a_prior, b_prior: parameters of prior distribution Beta(a_prior, b_prior) alpha: significant level of the posterior confidence interval for parameter Model:
for estimating the parameter theta of a Bernoulli distribution the prior distribution for theta is Beta(1,1)=Uniform[0,1] Output:
a,b: two parameters of the posterior distribution Beta(a,b)
pos_mean: posterior estimation for the mean of theta
pos_var: posterior estimation for the var of theta'''
n = len(data)
k = sum(data)
a = k+1
b = n-k+1
pos_mean = 1.*a/(a+b)
pos_var = 1.*(a*b)/((a+b+1)*(a+b)**2)
## Posterior Confidence Interval
theta_inf, theta_sup = beta.interval(1-alpha,a,b)
print('Prior distribution: Beta(%3d, %3d)' %(a_prior,b_prior))
print('Number of trials: %d, number of successes: %d' %(n,k))
print('Posterior distribution: Beta(%3d,%3d)' %(a,b))
print('Posterior mean: %5.4f' %pos_mean)
print('Posterior variance: %5.8f' %pos_var)
print('Posterior std: %5.8f' %(np.sqrt(pos_var)))
print('Posterior Confidence Interval (%2.2f): [%5.4f, %5.4f]' %(1-alpha, theta_inf, theta_sup))
return a, b, pos_mean, pos_var
# Example
n = 129 # sample size
data = np.random.binomial(size=n, n=1, p=0.6)
a, b, pos_mean, pos_var = bayes_estimator_bernoulli(data)
And the result is
The post Bayesian estimator of the Bernoulli parameter first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "N-gram language models -Part2"
The post N-gram language models -Part2 first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>In part 1 of my project, I built a unigram language model: it estimates the probability of each word in a text simply based on the fraction of times the word appears in that text.
The text used to train the unigram model is the book “A Game of Thrones” by George R. R. Martin (called train). The texts on which the model is evaluated are “A Clash of Kings” by the same author (called dev1), and “Gone with the Wind” — a book from a completely different author, genre, and time (called dev2).
In this part of the project, I will build higher n-gram models, from bigram (n=2) all the way to 5-gram (n=5). These models are different from the unigram model in part 1, as the context of earlier words is taken into account when estimating the probability of a word.
For a given n-gram model:
The example below shows the how to calculate the probability of a word in a trigram model:
In higher n-gram language models, the words near the start of each sentence will not have a long enough context to apply the formula above. To make the formula consistent for those cases, we will pad these n-grams with sentence-starting symbols [S]. Below are two such examples under the trigram model:
From the above formulas, we see that the n-grams containing the starting symbols are just like any other n-gram. The only difference is that we count them only when they are at the start of a sentence. Lastly, the count of n-grams containing only [S] symbols is naturally the number of sentences in our training text:
Similar to the unigram model, the higher n-gram models will encounter n-grams in the evaluation text that never appeared in the training text. This can be solved by adding pseudo-counts to the n-grams in the numerator and/or denominator of the probability formula a.k.a. Laplace smoothing. However, as outlined part 1 of the project, Laplace smoothing is nothing but interpolating the n-gram model with a uniform model, the latter model assigns all n-grams the same probability:
Hence, for simplicity, for an n-gram that appears in the evaluation text but not the training text, we just assign zero probability to that n-gram. Later, we will smooth it with the uniform probability.
The post N-gram language models -Part2 first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "N-gram language models -Part1"
The post N-gram language models -Part1 first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Language modeling — that is, predicting the probability of a word in a sentence — is a fundamental task in natural language processing. It is used in many NLP applications such as autocomplete, spelling correction, or text generation.
Currently, language models based on neural networks, especially transformers, are the state of the art: they predict very accurately a word in a sentence based on surrounding words. However, in this project, I will revisit the most classic of language model: the n-gram models.
In this project, my training data set — appropriately called train — is “A Game of Thrones”, the first book in the George R. R. Martin fantasy series that inspired the popular TV show of the same name.
Then, I will use two evaluating texts for our language model:
In natural language processing, an n-gram is a sequence of n words. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), “natural language processing” is a trigram (n = 3), and so on. For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words.
Training the model
A language model estimates the probability of a word in a sentence, typically based on the the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence:
The unigram language model makes the following assumptions:
The post N-gram language models -Part1 first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>Continue reading "Gaussian samples – Part(3)"
The post Gaussian samples – Part(3) first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>The goal of this project is to generate Gaussian samples in 2-D from uniform samples, the latter of which can be readily generated using built-in random number generators in most computer languages.
In part 1 of the project, the inverse transform sampling was used to convert each uniform sample into respective x and y coordinates of our Gaussian samples, which are themselves independent standard normal (having mean of 0 and standard deviation of 1):
However, this method uses the inverse cumulative distribution function (CDF) of the Gaussian distribution, which is not well-defined. Therefore, we approximated this function using the simple Taylor series. However, this only samples accurately near the Gaussian mean, and under-samples more extreme values at both ends of the distribution.
In part 2 of the project, we used the Box-Muller transform, a more direct method to transform the uniform samples into Gaussian ones. The implementation of the algorithm is quite simple, as seen below, but its derivation requires some clever change of variables: instead of sampling Gaussian x and y coordinates for each point, we will sample a uniformly-distributed angle (from 0 to 2π) and an exponentially-distributed random variable that represents half of squared distance of the sample to the origin.
In this part of the project, I will present an even simpler method than the above two methods. Even better, this method is one that every statistics students are already familiar with.
It turns out, we can rely on the most fundamental principle of statistics to help us generate Gaussian samples: the central limit theorem. In very simple terms, the central limit theorem states that:
Given n independent random samples from the same distribution, their sum will converge to a Gaussian distribution as n gets large.
Therefore, to generate a Gaussian sample, we can just generate many independent uniform samples and add them together! We then repeat this routine to generate the next standard Gaussian sample until we get enough samples for our x-coordinates. Finally, we just repeat the same steps to generate the y coordinates.
Note that this method will work even if the samples that we start with are not uniform — they are Poisson-distributed, for example. This is because the central limit theorem holds for virtually all probability distributions *cough let’s not talk about the Cauchy distribution cough*.
To generate, say, 1000 Gaussian sums (n_sums = 1000
) where each is the sum of 100 uniform samples (n_additions = 100
):
uniform_matrix
— in which every element is a uniformly-distributed sample between 0 and 1.axis=0
argument. The result is a NumPy array gaussians
, which contains the 1000 Gaussian samples.n_additions = 100
n_points = 1000# 0. Initialize random number generator
rng = np.random.RandomState(seed=24) # 1. Generate matrix of uniform samples
uniform_matrix = rng.uniform(size=(n_additions, n_points))# 2. Sum uniform elements down each column to get all Gaussian sums
gaussians = uniform_matrix.sum(axis=0)
We can apply the above method to generate Gaussian samples for each coordinate, using different random number generator for x and for y to ensure that the coordinates are independent from each other. Visualizing the intermediate sums after each addition, we see that:
The post Gaussian samples – Part(3) first appeared on MTI Technology AI Lab -Data Science in Vietnam.
]]>