Bow
Bag of Words (BoW)¶
- Bag of Words (BoW) is a simple and widely used technique in Natural Language Processing (NLP) to represent text data as numerical features for machine learning models.
- The BoW model converts text into a "bag" of individual words, ignoring grammar and word order but preserving the frequency of words.
Key Concepts of Bag of Words:¶
Vocabulary: BoW creates a set of all unique words (or "tokens") from the text data, which forms the vocabulary.
Word Frequency: Each word in the text is assigned a frequency count, representing how many times it appears in the document.
Document Representation: Each document (or sentence) is represented as a vector, where each dimension corresponds to a word from the vocabulary, and the value represents the frequency of that word in the document.
How BoW Works¶
-
Tokenization: First, the text is split into individual tokens (words or phrases).
-
Vocabulary Building: A unique vocabulary of words from the entire corpus is constructed.
-
Vector Representation: For each document, a vector is created, where each position corresponds to a word in the vocabulary, and the value in that position is the count of the word in the document. If the word does not appear, the count is 0.
Example of Bag of Words¶
- Let’s assume we have two sentences:
- Sentence 1: "The cat sat on the mat."
- Sentence 2: "The dog lay on the rug."
Step 1: Tokenization
- Extract the words (ignoring punctuation):
- Sentence 1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
- Sentence 2: ['the', 'dog', 'lay', 'on', 'the', 'rug']
Step 2: Vocabulary Building
- Create a unique list of words (vocabulary) from both sentences:
- Vocabulary: ['the', 'cat', 'sat', 'on', 'mat', 'dog', 'lay', 'rug']
Step 3: Vector Representation
- Create a frequency vector for each sentence based on the vocabulary:
- Sentence 1: [2, 1, 1, 1, 1, 0, 0, 0]
- 'the' appears twice, 'cat', 'sat', 'on', and 'mat' appear once, and 'dog', 'lay', 'rug' do not appear.
- Sentence 2: [2, 0, 0, 1, 0, 1, 1, 1]
- 'the' appears twice, 'dog', 'lay', 'on', and 'rug' appear once, and 'cat', 'sat', 'mat' do not appear.
Advantages of BoW¶
- Simplicity: BoW is easy to understand and implement.
- Effective for Simple Tasks: It works well for tasks where word frequency is important, like document classification or spam detection.
Limitations of BoW¶
- Ignores Word Order: BoW disregards the order of words in the document, so it cannot capture the meaning or context of words based on how they are arranged.
- Example: The sentences "The dog chased the cat" and "The cat chased the dog" would have the same BoW representation, despite having different meanings.
- High Dimensionality: For large corpora, the vocabulary can become very large, resulting in high-dimensional feature vectors that may be inefficient to process.
- Sparse Representation: Most BoW vectors are sparse, meaning that many positions in the vector are 0, which can be inefficient in terms of memory and computation.
Variations of BoW¶
- TF-IDF (Term Frequency-Inverse Document Frequency): Instead of using raw word counts, TF-IDF assigns weights to words based on their importance within the document and across all documents in the corpus. This reduces the importance of common words and highlights words that are more meaningful in specific contexts.
- N-grams: BoW typically treats each word as an independent token, but in the N-gram model, contiguous sequences of N words (bigrams, trigrams, etc.) are considered as tokens to preserve some word order.
Implementing BoW in Python¶
from sklearn.feature_extraction.text import CountVectorizer
# Sample sentences
documents = ["The cat sat on the mat.", "The dog lay on the rug."]
# Create a Bag of Words model
vectorizer = CountVectorizer()
# Fit and transform the documents
bow_matrix = vectorizer.fit_transform(documents)
# View the vocabulary
print(vectorizer.get_feature_names_out())
# View the Bag of Words representation
print(bow_matrix.toarray())
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
# Tokenize sentences
sentence1 = "The cat sat on the mat."
sentence2 = "The dog lay on the rug."
# Tokenize and count word frequencies
tokens1 = word_tokenize(sentence1.lower())
tokens2 = word_tokenize(sentence2.lower())
bow1 = Counter(tokens1)
bow2 = Counter(tokens2)
print(bow1)
print(bow2)