Saturday, September 8, 2018

n-gram model

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles. Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram").

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model. n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing).

n-grams can also be used for efficient approximate matching. n-grams have been used to design kernels that allow machine learning algorithms such as support vector machines* to learn from string data.

To choose a value for n in an n-gram model, it is necessary to find the right trade off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones.

source: https://en.wikipedia.org/wiki/N-gram

BONUS: Google Ngram Viewer - historical frequency of some AI terms
 

* In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

source: https://en.wikipedia.org/wiki/Support_vector_machine

No comments:

Post a Comment