In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles. Using Latin numerical prefixes, an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram").
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model. n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing).
n-grams can also be used for efficient approximate matching. n-grams have been used to design kernels that allow machine learning algorithms such as support vector machines* to learn from string data.
To choose a value for n in an n-gram model, it is
necessary to find the right trade off between the stability of the
estimate against its appropriateness. This means that trigram (i.e.
triplets of words) is a common choice with large training corpora
(millions of words), whereas a bigram is often used with smaller ones.
source: https://en.wikipedia.org/wiki/N-gram
BONUS: Google Ngram Viewer - historical frequency of some AI terms
* In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
source: https://en.wikipedia.org/wiki/Support_vector_machine
No comments:
Post a Comment