Japanese Project Leader in AI / Data Science Lab Vietnam

Hiring Japanese Project Leader in AI / Data Science Lab Vietnam

Job Title Japanese Project Leader Location Ho Chi Minh 
Contact recruitment@mti-tech.vn Employment Fulltime  
Level Report to  Line Manager

 

If you want to join in exciting and challenging projects, MTI Tech AI / Data Science Lab could be the next destination for your career.

 MTI Technology Vietnam specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop AI / Data Science solutions to the customer. With the original members of 20 people, we grow our members up to more than 180 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

What you Project Leader will do in AI / Data Science Lab?

 Must Skill

  • Japanese Speaking (Japanese Language Certification N1 or Japanese Native Speaker only. other Japanese Certification Holder(N2/N3 ..) is not required in this job )
  • English Skill (at least Business Level English Skill)
  • Logical Thinking
  • Can work well to have team work minds with Data Scientist and Engineer for AI / Data Science project.
  • Having a challenge mind for new thing.
  • Having experiences for software development projects as like leader/BSE.
  • Having experiences for leadership to work with team.
  • Having experiences about Delivery Management
  • Having experiences for communicate with stakeholder in business.
  • At least 2 years of experience in IT Fields.

Option Skill

  • Experience for coding /programming software

You will love working here!

  • Competitive salary + bonus + other benefits
  • Yearly Reviewed Salary
  • 13th month salary (Paid 2 times in June and December)
  • Performance Bonus
  • Monthly Allowances for: Lunch, Gasoline, Internal Cafeteria
  • Seniority Allowance
  • Technical Allowance
  • Japanese Language Allowance
  • Private Medical Insurance
  • Employee Referral Incentive
  • Fun Activities: English Class, Sport Clubs, Happy Hours, Team Building
  • Chance to work in Japan site
  • Yearly Staff Trip, Company Party

ベトナムAIラボ – プロジェクトリーダー(日本語人材)

業務としてはプロジェクトのリーダー及び、自然言語処理のサポートがメインとなります。日本語が母国語または日本語検定N1を保有しており、
かつ、ビジネス英語が問題なくできる
ことが重要な採用条件となります。

About MTI Technology AI Lab

Please check below information about our AI / Data Science Lab.

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab about our Data Scientist, AI / Data Science Technology(Machine Learning etc).

 

EXPLORING UNIVERSAL SENTENCE ENCODER MODEL (USE)

In NLP, encoding text is the heart of understanding language.  There are many implementations like Glove, Word2vec, fastText which are aware of word embedding. However, these embeddings are only useful for word-level and may not perform well in case we would like to expand to encode for sentences or in general, greater than one word. In this post, we would like to introduce one of the SOTAs for such a task: the Universal Sentence Encoder model

1.What is USE(UNIVERSAL SENTENCE ENCODER MODEL)?

The Universal Sentence Encoder (USE) encodes text into high dimensional vectors (embedding vectors or just embeddings). These vectors are supposed to capture the textual semantic. But why do we even need them?

A vector is an array of numbers of a particular dimension. With the vectors in hand, it’s much easier for computers to work on textual data. For example, we can say two data points are similar or not just by calculating the distance between the two points’ embedding vectors.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://amitness.com/2020/06/universal-sentence-encoder/)

The embedding vectors then in turn, can be used for other NLP downstream tasks such as text classification, semantic similarity, clustering…

2.USE architecture

It comes with two variations with the main difference resides in the embedding part. One is equipped with the encoder part from the famous Transformer architecture, the other one uses Deep Averaging Network (DAN)

2.1 Transformer encoder

The Transformer architecture is designed to handle sequential data, but not in order like the RNN-based architectures. It use the attention mechanism to compute context-aware representations of words in a sentence taking into account both the ordering and significance of all the other words. The encoder takes input as a lowercased PTB tokenized string and outputs the representations of each sentence as a fixed-length encoding vector by computing the element-wise sum of the representations at each word position. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.

Universal Sentence Encoder uses only the encoder branch of Transformer to take advantage of its strong embedding capacity.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://arxiv.org/abs/1706.03762)

2.2 Deep Averaging Network (DAN):

DAN is a simple Neural Network that takes average of embeddings for words and bi-grams and then passed the “combined” vector through a feedforward deep neural network (DNN) to produce sentence embeddings. Similar to the Transformer encoder, DAN takes as input a lowercased PTB tokenized string and output a 512 dimensional sentence embedding.

UNIVERSAL SENTENCE ENCODER MODEL

(Image source: https://medium.com/tech-that-works/deep-averaging-network-in-universal-sentence-encoder-465655874a04)

The two have a trade-off of accuracy and computational resource requirement. While the one with Transformer encoder has higher accuracy, it is computationally more intensive. The one with DNA encoding is computationally less expensive and with little lower accuracy.

3. How was it trained?

The key idea for training this model is to make the model work for generic tasks such as:

  • Modified Skip-thought
  • Conversational input-response prediction
  • Natural language inference.

3.1 Modified skip-thought:

given a sentence, the model needs to predict the sentences around it.

 

UNIVERSAL SENTENCE ENCODER MODEL

  • 3.2 Conversational input-response prediction:

    In this task, the model needs to predict the correct response for a given input among a list of correct responses and other randomly sampled responses.

Full Stack Engineer in AI Lab

 We are looking for Full Stack Engineer in MTI Tech AI Lab in Ho Chi Minh ,Vietnam.

Software Development , Cloud Experiences(AWS etc), English , Communication

Job Title Full Stack Engineer in AI Lab Location Ho Chi Minh
Contact recruitment@mti-tech.vn Employment Fulltime
Level Senior Report to Line Manager

 

If you want to join in exciting and challenging projects, MTI Tech could be the next destination for your career.

 MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 180 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

The Job

What you will do (Must skill / experiences) in AI / Data Science Lab.

  • Logical Thinking
  • Having much interested in new fields
  • Having attitude/mind to challenge new technology.
  • Enthusiasm for delivering joyful user experiences
  • At least 5 years of experience in IT Fields.
  • At least 5 years of experience in Industry as a Software Engineer in Web
  • At least 3 years of experience in .NET Experiences or Python experience
  • Strong experience with Backend and also front end development both experiences.
  • Strong Skill and actual Knowledge for DB related to performance
  • Strong actual Experiences for Cloud computing (AWS or Azure or GCP)
  • Ability to make test case and test code for development.
  • Strong experience/skills for technical problem solving
  • Strong English Communication skill (especially English speaking and hearing skills)
  • Strong Experiences about CI/CD environment.

Nice to have (Option)  in AI / Data Science Lab.

  • Leadership skill for managing team
  • Having attitude of Team work mind.
  • Smooth Communication Skill with team member
  • Experience for AI related technology.
  • Experience for Mobile Development.
  • Experience for Big Data is Strong Plus.
  • Having interested with Mathematics, Statistics.
  • Experience for Agile Scrum Experience.

No need Machine Learning Skill for join to MTI Technology as a engineer. 

More on MTI – what is it like to work in MTI AI Lab?

“Delivering the business values through the Data Science”

At MTI Technology AI Lab has 2 type of staff . 1 is Data Scientist, 1 is Engineer.
The Engineer will support Data Scientist with Cloud and Software development.
We are supporting the business problems of client by the AI / Data Science.

You will love working here!

  • Competitive salary + bonus + other benefits
  • Yearly Reviewed Salary
  • 13th month salary (Paid 2 times in June and December)
  • Performance Bonus
  • Monthly Allowances for: Lunch, Gasoline, Internal Cafeteria
  • Seniority Allowance
  • Technical Allowance
  • Japanese Language Allowance
  • Private Medical Insurance
  • Employee Referral Incentive
  • Fun Activities: English Class, Sport Clubs, Happy Hours, Team Building
  • Chance to work in Japan site
  • Yearly Staff Trip, Company Party

“Due to volume of applications, we regret only shortlisted candidates will be notified.”

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Hiring AI Engineer (Senior)

Hiring AI Engineer(Senior) in AI / Data Science Lab

Job Title AI Engineer Location Ho Chi Minh 
Contact recruitment@mti-tech.vn Employment Fulltime  
Level Senior Report to  Line Manager

 

If you want to join in exciting and challenging projects, MTI Tech AI / Data Science Lab could be the next destination for your career.

 MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop AI / Data Science solutions to the customer. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 180 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

What you Engineer will do in AI / Data Science Lab?

 Who we are looking for in AI / Data Science Lab

  • Faster Cycle for Try and Error and make code.
  • Logical Thinking
  • Can work well with Data Scientist  for AI / Data Science project.
  • .NET  or Python experiences for 5 years.
  • Having much interested in new fields
  • Enthusiasm for delivering joyful user experiences
  • At least 5 years of experience in IT Fields or Data Science Fields.
  • At least 5 years of experience in Industry as a Software Engineer or AI Engineer
  • Strong Skill and actual Knowledge for Database related to performance
  • Strong Skill for Cloud computing (AWS or Azure or GCP)
  • Strong for English and communication skill with team members and managers
  • Having an attitude to learning a programming language.
  • Loving Mathematics, Statistics.

Nice to have  in AI / Data Science Lab

  • Leadership skill for managing team
  • Experience in AWS, Google Cloud Platform, Microsoft Azure is a strong plus
  • Experience for Kaggle, Kaggler is Strong Plus.
  • Experience for Big Data is Strong Plus.
  • Fluent Japanese is a plus
  • Experienced with AI product

More on MTI AI / Data Science Lab – what is it like to work in MTI?

At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.

You will love working here!

  • Competitive salary + bonus + other benefits
  • Yearly Reviewed Salary
  • 13th month salary (Paid 2 times in June and December)
  • Performance Bonus
  • Monthly Allowances for: Lunch, Gasoline, Internal Cafeteria
  • Seniority Allowance
  • Technical Allowance
  • Japanese Language Allowance
  • Private Medical Insurance
  • Employee Referral Incentive
  • Fun Activities: English Class, Sport Clubs, Happy Hours, Team Building
  • Chance to work in Japan site
  • Yearly Staff Trip, Company Party

“Due to volume of applications, we regret only shortlisted candidates will be notified.”

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about experiences for Data Science Project

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

[Hiring]Lead / Senior Data Scientist in Vietnam

We are looking for Lead / Senior Data Scientist in Vietnam
(Team work experiences and Leadership experiences)

If you want to join in exciting and challenging projects, MTI Tech could be the next destination for your career.

Contact

Please contact below mail.
recruitment@mti-tech.vn

MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop our products using agile methods bringing the best deliverable results to the table in the shortest amount of time. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 100 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

The Job

We are looking for Data Scientists who would like to participate in the project to use existing various data to apply to AI, moreover, combine with other data to create new value.

Currently, we are looking for candidates with experiences in Algorithms or Natural Language Processing (NLP), but any other fields of AI will be considered too.

Example of data

  • Data of medical examination results, medical questionnaire results or image of them in health check, or general medical examination.
  • Data of athletes’ training results and vital.
  • Pregnancy activities data of pregnant women.
  • Life data such as weather and navigation.
  • Text data such as newspapers.
    We have many project style for outsourcing/offshoring, research etc.

Example of application

  • By combining and analyzing Healthcare data such as medical examination/ medical questionnaire results and labor data such as mental health check and overtime records etc., we can find out future health risks at an earlier stage.

Programming Language

  • Python, R, MATLAB, SPSS
  • Java, JavaScript, Golang, Haskell, Erlang/Elixir is a plus.

Currently, development is mainly in Python. It is good to understand object thinking programming in Java etc. It is also good if you have parallel processing experience in the server-side language (Golang, etc.).

In addition, engineers who can use functional languages (Haskell, Erlang / Elixir) are treasures of talented people. Such people are interested in various programming languages, have mathematical curiosity, and many of them study by themselves. Although we do not have many opportunities to use these languages in actual development, we welcome such engineers as well.

Operational Environment

  • Amazon Web Services(AWS), Google Cloud Platform(GCP), Microsoft AZURE, Redmine, Gitlab , Data Lake etc.

 Who we are looking for

Recently, in the development of Web services, engineers who have experienced APIs usage and libraries related to prominent AI such as open source, Google, IBM etc. from the standpoint of “User”, are often classified as “AI Engineers”. What MTI Group seeks is not such a technician, we find an experienced person who has learned the data sciences themselves deeply. On the other hand, if a person who has learned data science, and has not much experience in actual work, we still consider.

  • Have a great ambition and ability to study the most leading-edge research by yourself and apply them to your own development.
  • Having Team work experiences with Data Scientist and AI Engineer and Having Leadership to lead project members.
  • Have technical skills and creativity to build new technologies from scratch by yourself if it is necessary but does not exist yet.
  • Adapt yourself to our working culture in a team such as discussion or sharing together. Personality is preferred. Excellent person has a variety of personalities. However, being able to work only on your own becomes a problem.
  • Have experiences in research and study related to Statistical Mathematic such as Regression analysis, SVM or Information theory.
  • Have taken part in research/business about AI, Machine Learning, Natural language processing (NLP), Neural network and so on.
  • Have experiences in research and study related to Engineering and Science, Econometrics, Behavior Psychology, Medical Statistic and so on.
  • Have working experiences in Statistical Analysis or Data Scientist.
  • In AI development, Trial & Error repeats many times to solve the problem with unclear specification or unfixed answer (result). For this reason, we are looking for the individual with the following requirements:
    • Be agile in the cycle of Trial & Error (speed of use your thought in code)
    • Be concerned with even small issues/ problems and solve all problems efficiently by your logical thought.
    • Be curious about knowledge. The person who is greatly interested in and curious about knowledge surely grows the most.
  • Have deep experiences in research.
  • English skill: Be able to use your English reading skill to gain information related to AI.

More on MTI – what is it like to work in MTI?

At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.

[Hiring] 2 more Data Scientist

We are looking for Data Scientist in HCM, Vietnam now.
(Senior, Middle , Leader Level)

If you or your friends are interested with Data Scientist,
Please check below Job Description!!
We have many projects of business and also have many project style for outsourcing/offshoring, research etc.

Job Description for Data Scientist

https://www.linkedin.com/jobs/view/2279641602/

Hiring Data Scientist / Engineer

We are looking for Data Scientist and Engineer.
Please check our Career Page.

Data Science Project

Please check about Data Science Project example. 

Vietnam AI / Data Science Lab

Vietnam AI Lab

Please also visit Vietnam AI Lab

[AI Lab] Hiring Data Scientist

If you want to join in exciting and challenging projects, MTI Tech could be the next destination for your career.

Contact

Please contact below mail.
recruitment@mti-tech.vn

MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop our products using agile methods bringing the best deliverable results to the table in the shortest amount of time. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 100 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

The Job

We are looking for Data Scientists who would like to participate in the project to use existing various data to apply to AI, moreover, combine with other data to create new value.

Currently, we are looking for candidates with experiences in Algorithms/ Natural Language Processing (NLP), but any other fields of AI will be considered too.

Example of data

  • Data of medical examination results, medical questionnaire results or image of them in health check, or general medical examination.
  • Data of athletes’ training results and vital.
  • Pregnancy activities data of pregnant women.
  • Life data such as weather and navigation.
  • Text data such as newspapers.
    We have many project style for outsourcing/offshoring, research etc.

Example of application

  • By combining and analyzing Healthcare data such as medical examination/ medical questionnaire results and labor data such as mental health check and overtime records etc., we can find out future health risks at an earlier stage.

Programming Language

  • Python, R, MATLAB, SPSS
  • Java, JavaScript, Golang, Haskell, Erlang/Elixir is a plus.

Currently, development is mainly in Python. It is good to understand object thinking programming in Java etc. It is also good if you have parallel processing experience in the server-side language (Golang, etc.).

In addition, engineers who can use functional languages (Haskell, Erlang / Elixir) are treasures of talented people. Such people are interested in various programming languages, have mathematical curiosity, and many of them study by themselves. Although we do not have many opportunities to use these languages in actual development, we welcome such engineers as well.

Operational Environment

  • AWS, Google Cloud Platform, Microsoft AZURE, Redmine, GitHub etc.

 Who we are looking for

Recently, in the development of Web services, engineers who have experienced APIs usage and libraries related to prominent AI such as open source, Google, IBM etc. from the standpoint of “User”, are often classified as “AI Engineers”. What MTI Group seeks is not such a technician, we find an experienced person who has learned the data sciences themselves deeply. On the other hand, if a person who has learned data science, and has not much experience in actual work, we still consider.

  • Have a great ambition and ability to study the most leading-edge research by yourself and apply them to your own development.
  • Have technical skills and creativity to build new technologies from scratch by yourself if it is necessary but does not exist yet.
  • Adapt yourself to our working culture in a team such as discussion or sharing together. Personality is preferred. Excellent person has a variety of personalities. However, being able to work only on your own becomes a problem.
  • Have experiences in research and study related to Statistical Mathematic such as Regression analysis, SVM or Information theory.
  • Have taken part in research/business about AI, Machine Learning, Natural language processing (NLP), Neural network and so on.
  • Have experiences in research and study related to Engineering and Science, Econometrics, Behavior Psychology, Medical Statistic and so on.
  • Have working experiences in Statistical Analysis or Data Scientist.
  • In AI development, Trial & Error repeats many times to solve the problem with unclear specification or unfixed answer (result). For this reason, we are looking for the individual with the following requirements:
    • Be agile in the cycle of Trial & Error (speed of use your thought in code)
    • Be concerned with even small issues/ problems and solve all problems efficiently by your logical thought.
    • Be curious about knowledge. The person who is greatly interested in and curious about knowledge surely grows the most.
  • Have deep experiences in research.
  • English skill: Be able to use your English reading skill to gain information related to AI.

More on MTI – what is it like to work in MTI?

At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.

Hiring- Data Scientist (Algorithm Theory)

 

Job Title Data Scientist
(Algorithm Theory)
Location Ho Chi Minh
Contact recruitment  @  mti-tech.vn Employment Fulltime
Level Middle/Senior Report to Line Manager

If you want to join in exciting and challenging projects, MTI Tech could be the next destination for your career.

 MTI Technology specializes in creating smart mobile contents and services that transform and transcend customers’ lives. We design and develop our products using agile methods bringing the best deliverable results to the table in the shortest amount of time. MTI stands for an attitude: seeking a balance in excellence, pragmatism and convenience for customers. With the original members of 20 people, we grow our members up to more than 100 bright talents and continue to grow more. Looking for a place to grow your talents and be awesome? This is the place!

The Job

We are looking for Data Scientists who would like to participate in the project to use existing various data to apply to AI, moreover, combine with other data to create new value.

Currently, we are looking for candidates with experiences in Algorithms, Natural Language Processing (NLP), but any other fields of AI will be considered too.

Example of data

  • Data of medical examination results, medical questionnaire results or image of them in health check, or general medical examination.
  • Data of athletes’ training results and vital.
  • Pregnancy activities data of pregnant women.
  • Life data such as weather and navigation.
  • Text data such as magagine.
    We have many project style for outsourcing/offshoring, research etc.

Example of application

  • By combining and analyzing Healthcare data such as medical examination/ medical questionnaire results and labor data such as mental health check and overtime records etc., we can find out future health risks at an earlier stage.

Programming Language

  • Python, R, MATLAB, SPSS
  • Java, JavaScript, Golang, Haskell, Erlang/Elixir is a plus.

Currently, development is mainly in Python. It is good to understand object thinking programming in Java etc. It is also good if you have parallel processing experience in the server-side language (Golang, etc.).

In addition, engineers who can use functional languages (Haskell, Erlang / Elixir) are treasures of talented people. Such people are interested in various programming languages, have mathematical curiosity, and many of them study by themselves. Although we do not have many opportunities to use these languages in actual development, we welcome such engineers as well.

Operational Environment

  • AWS, Google Cloud Platform, Microsoft AZURE, Redmine, GitHub etc.

 Who we are looking for

Recently, in the development of Web services, engineers who have experienced APIs usage and libraries related to prominent AI such as open source, Google, IBM etc. from the standpoint of “User”, are often classified as “AI Engineers”. What MTI Group seeks is not such a technician, we find an experienced person who has learned the data sciences themselves deeply. On the other hand, if a person who has learned data science, and has not much experience in actual work, we still consider.

  • Have experiences in research and study related to Algorithm Theory such as Discrete mathematics, Search Algorithm. Top priority skill.

  • Have experiences in research and study about mathematics.

  • Have a great ambition and ability to study the most leading-edge research by yourself and apply them to your own development.
  • Have technical skills and creativity to build new technologies from scratch by yourself if it is necessary but does not exist yet.
  • Adapt yourself to our working culture in a team such as discussion or sharing together. Personality is preferred. Excellent person has a variety of personalities. However, being able to work only on your own becomes a problem.
  • Have experiences in research and study related to Statistical Mathematic such as Regression analysis, SVM or Information theory.
  • Have taken part in research/business about AI, Machine Learning, Natural language processing (NLP), Neural network and so on.
  • Have experiences in research and study related to Engineering and Science, Econometrics, Behavior Psychology, Medical Statistic and so on.
  • Have working experiences in Statistical Analysis or Data Scientist.
  • In AI development, Trial & Error repeats many times to solve the problem with unclear specification or unfixed answer (result). For this reason, we are looking for the individual with the following requirements:
    • Be agile in the cycle of Trial & Error (speed of use your thought in code)
    • Be concerned with even small issues/ problems and solve all problems efficiently by your logical thought.
    • Be curious about knowledge. The person who is greatly interested in and curious about knowledge surely grows the most.
  • Have deep experiences in research.
  • English skill: Be able to use your English reading skill to gain information related to AI.

More on MTI – what is it like to work in MTI?

At MTI Technology, our goal is to empower every individual to learn, discover, be able to communicate openly and honestly to create the best services based on effective teamwork.

N-gram language models -Part2

Background

In part 1 of my project, I built a unigram language model: it estimates the probability of each word in a text simply based on the fraction of times the word appears in that text.

N-gram

The text used to train the unigram model is the book “A Game of Thrones” by George R. R. Martin (called train). The texts on which the model is evaluated are “A Clash of Kings” by the same author (called dev1), and “Gone with the Wind” — a book from a completely different author, genre, and time (called dev2).

N-gram

In this part of the project, I will build higher n-gram models, from bigram (n=2) all the way to 5-gram (n=5). These models are different from the unigram model in part 1, as the context of earlier words is taken into account when estimating the probability of a word.

Higher n-gram language models

Training the model

For a given n-gram model:

The example below shows the how to calculate the probability of a word in a trigram model:

N-gram

For simplicity, all words are lower-cased in the language model, and punctuations are ignored. The presence of the [END] tokens are explained in part 1.

Dealing with words near the start of sentence

In higher n-gram language models, the words near the start of each sentence will not have a long enough context to apply the formula above. To make the formula consistent for those cases, we will pad these n-grams with sentence-starting symbols [S]. Below are two such examples under the trigram model:

N-gram

From the above formulas, we see that the n-grams containing the starting symbols are just like any other n-gram. The only difference is that we count them only when they are at the start of a sentence. Lastly, the count of n-grams containing only [S] symbols is naturally the number of sentences in our training text:

N-gram

S_train: number of sentences in training text

Dealing with unknown n-grams

Similar to the unigram model, the higher n-gram models will encounter n-grams in the evaluation text that never appeared in the training text. This can be solved by adding pseudo-counts to the n-grams in the numerator and/or denominator of the probability formula a.k.a. Laplace smoothing. However, as outlined part 1 of the project, Laplace smoothing is nothing but interpolating the n-gram model with a uniform model, the latter model assigns all n-grams the same probability:

N-gram

Laplace smoothing for unigram model: each unigram is added a pseudo-count of k. N: total number of words in training text. V: number of unique unigrams in training text.

Hence, for simplicity, for an n-gram that appears in the evaluation text but not the training text, we just assign zero probability to that n-gram. Later, we will smooth it with the uniform probability.

N-gram language models -Part1

Background

Language modeling — that is, predicting the probability of a word in a sentence — is a fundamental task in natural language processing. It is used in many NLP applications such as autocompletespelling correction, or text generation.

google N-gram

Currently, language models based on neural networks, especially transformers, are the state of the art: they predict very accurately a word in a sentence based on surrounding words. However, in this project, I will revisit the most classic of language model: the n-gram models.

Data

In this project, my training data set — appropriately called train — is “A Game of Thrones”, the first book in the George R. R. Martin fantasy series that inspired the popular TV show of the same name.

Then, I will use two evaluating texts for our language model:

N-gram

Unigram language model

What is a unigram?

In natural language processing, an n-gram is a sequence of n words. For example, “statistics” is a unigram (n = 1), “machine learning” is a bigram (n = 2), “natural language processing” is a trigram (n = 3), and so on. For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words.

Training the model

A language model estimates the probability of a word in a sentence, typically based on the the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence:

N-gram

For simplicity, all words are lower-cased in the language model, and punctuations are ignored. The [END] token marks the end of the sentence, and will be explained shortly.

The unigram language model makes the following assumptions: