TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency)¶
- TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical technique used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (called the corpus).
- Unlike simple word count or Bag of Words (BoW), TF-IDF not only considers how often a word appears in a document but also accounts for how common or rare the word is across the entire corpus.
- This helps identify the most relevant words while minimizing the impact of commonly occurring words (like "the", "is", "and").
Key Components of TF-IDF¶
-
Term Frequency (TF):
- This measures how often a term (word) appears in a document. A higher term frequency means that the word is more important in that specific document.
- Formula:
- TF(t,d)= (Total number of terms in document d)/(Number of times term t appears in document d)
- Example: If the word "cat" appears 3 times in a document with 100 words, the TF for "cat" would be:
- TF(cat,document)= 100/3 = 0.03
-
Inverse Document Frequency (IDF):
- This measures how important a word is across the entire corpus. Words that appear in many documents get a lower score, while words that appear in fewer documents get a higher score.
- Formula:
- IDF(t,D)=log(Total number of documents/Number of documents containing the term t)
- Example: If the corpus contains 1000 documents, and the word "cat" appears in 50 of them, the IDF for "cat" would be:
- IDF(cat,corpus)=log( 1000/50)=log(20)≈1.3
-
TF-IDF Score:
- F-IDF combines the term frequency (TF) and inverse document frequency (IDF) to assign a weight to each word in a document. Words that are frequent in a document but rare in the corpus receive a higher score, indicating their importance.
- Formula:
- TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
- Example: Using the previous values for TF and IDF of the word "cat", the TF-IDF score for "cat" in a document would be:
- TF-IDF(cat)=0.03×1.3=0.039
Example of TF-IDF¶
- Let's calculate TF-IDF for two simple documents:
- Document 1: "The cat is on the mat."
- Document 2: "The dog sat on the rug."
Step 1: Calculate Term Frequency (TF)
- Document 1: "The cat is on the mat." (Total words: 6)
- TF("the") = 2/6 = 0.33
- TF("cat") = 1/6 = 0.17
- TF("mat") = 1/6 = 0.17
- TF("is") = 1/6 = 0.17
- TF("on") = 1/6 = 0.17
- Document 2: "The dog sat on the rug." (Total words: 6)
- TF("the") = 2/6 = 0.33
- TF("dog") = 1/6 = 0.17
- TF("sat") = 1/6 = 0.17
- TF("on") = 1/6 = 0.17
- TF("rug") = 1/6 = 0.17
Step 2: Calculate Inverse Document Frequency (IDF)
- For a corpus of two documents:
- IDF("the"): Appears in both documents, so:
- IDF(the)=log(2/2)=log(1)=0
-
(Common words like "the" get an IDF of 0, meaning they have no distinguishing power.)
-
IDF("cat"): Appears in Document 1 only:
- IDF(cat)=log(2/1)=log(2)≈0.693
- IDF("on"): Appears in both documents:
- IDF(on)=log(2/2)=0
Step 3: Calculate TF-IDF Scores
- For Document 1:
- TF-IDF("cat") = 0.17 × 0.693 = 0.118
- TF-IDF("the") = 0.33 × 0 = 0
- TF-IDF("on") = 0.17 × 0 = 0
-
For Document 2:
- TF-IDF("dog") = 0.17 × 0.693 = 0.118
- TF-IDF("the") = 0.33 × 0 = 0
- TF-IDF("on") = 0.17 × 0 = 0
-
Thus, terms like "cat" and "dog" get a higher TF-IDF score because they are less frequent across the entire corpus, while common words like "the" and "on" get lower scores.
Advantages of TF-IDF:¶
- Relevance Weighting: TF-IDF assigns higher scores to words that are more important in a specific document while downplaying the impact of common words that appear frequently across documents.
- Effective for Document Search: TF-IDF is widely used in search engines to rank documents based on relevance to a query.
- Simplicity: It’s simple to compute and works well as a baseline for text representation.
Limitations of TF-IDF:¶
- Does Not Capture Semantics: TF-IDF only considers word frequency and does not capture the meaning or context of the words.
- Sparsity: For large vocabularies, the document-term matrix generated by TF-IDF can be sparse (lots of zeros), leading to inefficiencies in memory and computation.
- Fixed Vocabulary: TF-IDF requires a fixed vocabulary from the training corpus, making it less suitable for dynamic environments where new words are frequently introduced.
Implementing TF-IDF in Python:¶
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = ["The cat is on the mat.", "The dog sat on the rug."]
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)
# Display the vocabulary
print(vectorizer.get_feature_names_out())
# Display the TF-IDF representation
print(tfidf_matrix.toarray())
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
# Download NLTK stop words (if you don't have them already)
nltk.download('stopwords')
from nltk.corpus import stopwords
# Sample documents
documents = [
"The cat is on the mat.",
"The dog sat on the rug.",
"Cats and dogs are great pets."
]
# Load English stop words
stop_words = stopwords.words('english')
# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words)
# Fit the model and transform the documents into TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(documents)
# Display the vocabulary (unique terms)
print("Vocabulary:", vectorizer.get_feature_names_out())
# Display the TF-IDF values for each document
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())
# Optionally: inspect the matrix in a readable way
import pandas as pd
# Create a DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF DataFrame:")
print(tfidf_df)