Bag of Words

Bag of words is a simple modeling concept, where only the set of words matter. It simplifies the document for modeling purpose, by removing the order of words. Lets say, there is a document that has the following content.

Taj Mahal

Construction of the mausoleum was essentially completed in 1643 but work continued on other phases of the project for another 10 years. The Taj Mahal complex is believed to have been completed in its entirety in 1653 at a cost estimated at the time to be around 32 million rupees, which in 2015 would be approximately 52.8 billion rupees (U.S. $827 million). The construction project employed some 20,000 artisans under the guidance of a board of architects led by the court architect to the emperor, Ustad Ahmad Lahauri.

For a human, each of the words (referred to as Terms), their exact order matters. But when this is to be processed by a computer, this model is too complex (for some use cases). What if we throw away the order of words, and just treat the document as a “Bag of words”. This kind of modeling will loose the meaning of the document. i.e. you will not be able to find the difference between “David killed Goliath” and “Goliath killed David”. If the task at hand is to do clustering of documents, i.e. put documents that talk about similar stuff together, or classification, then this BoW modelling is good enough. Both documents are talking about David, Goliath and being killed, so they might be similar.

Leave a Reply

Your email address will not be published.