Saturday, September 8, 2018

Bag-of-words model

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). The bag-of-words model is commonly used in methods of document classification where the frequency of occurrence of each word is used as a feature for training a classifier.

In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words. Each key is the word, and each value is the number of occurrences of that word in the given text document.

Example usage: spam filtering

In Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail. To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.

source: https://en.wikipedia.org/wiki/Bag-of-words_model

No comments:

Post a Comment