TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency)¶

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical technique used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (called the corpus).
Unlike simple word count or Bag of Words (BoW), TF-IDF not only considers how often a word appears in a document but also accounts for how common or rare the word is across the entire corpus.
This helps identify the most relevant words while minimizing the impact of commonly occurring words (like "the", "is", "and").

Key Components of TF-IDF¶

Term Frequency (TF):
- This measures how often a term (word) appears in a document. A higher term frequency means that the word is more important in that specific document.
- Formula:
- TF(t,d)= (Total number of terms in document d)/(Number of times term t appears in document d)
- Example: If the word "cat" appears 3 times in a document with 100 words, the TF for "cat" would be:
- TF(cat,document)= 100/3 = 0.03
Inverse Document Frequency (IDF):
- This measures how important a word is across the entire corpus. Words that appear in many documents get a lower score, while words that appear in fewer documents get a higher score.
- Formula:
- IDF(t,D)=log(Total number of documents/Number of documents containing the term t)
- Example: If the corpus contains 1000 documents, and the word "cat" appears in 50 of them, the IDF for "cat" would be:
- IDF(cat,corpus)=log( 1000/50)=log(20)≈1.3
TF-IDF Score:
- F-IDF combines the term frequency (TF) and inverse document frequency (IDF) to assign a weight to each word in a document. Words that are frequent in a document but rare in the corpus receive a higher score, indicating their importance.
- Formula:
- TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
- Example: Using the previous values for TF and IDF of the word "cat", the TF-IDF score for "cat" in a document would be:
- TF-IDF(cat)=0.03×1.3=0.039

Example of TF-IDF¶

Let's calculate TF-IDF for two simple documents:
Document 1: "The cat is on the mat."
Document 2: "The dog sat on the rug."

Step 1: Calculate Term Frequency (TF)

Document 1: "The cat is on the mat." (Total words: 6)
- TF("the") = 2/6 = 0.33
- TF("cat") = 1/6 = 0.17
- TF("mat") = 1/6 = 0.17
- TF("is") = 1/6 = 0.17
- TF("on") = 1/6 = 0.17
Document 2: "The dog sat on the rug." (Total words: 6)
- TF("the") = 2/6 = 0.33
- TF("dog") = 1/6 = 0.17
- TF("sat") = 1/6 = 0.17
- TF("on") = 1/6 = 0.17
- TF("rug") = 1/6 = 0.17

Step 2: Calculate Inverse Document Frequency (IDF)

For a corpus of two documents:
IDF("the"): Appears in both documents, so:
IDF(the)=log(2/2)=log(1)=0
(Common words like "the" get an IDF of 0, meaning they have no distinguishing power.)
IDF("cat"): Appears in Document 1 only:
IDF(cat)=log(2/1)=log(2)≈0.693
IDF("on"): Appears in both documents:
- IDF(on)=log(2/2)=0

Step 3: Calculate TF-IDF Scores

For Document 1:
- TF-IDF("cat") = 0.17 × 0.693 = 0.118
- TF-IDF("the") = 0.33 × 0 = 0
- TF-IDF("on") = 0.17 × 0 = 0
For Document 2:
- TF-IDF("dog") = 0.17 × 0.693 = 0.118
- TF-IDF("the") = 0.33 × 0 = 0
- TF-IDF("on") = 0.17 × 0 = 0
Thus, terms like "cat" and "dog" get a higher TF-IDF score because they are less frequent across the entire corpus, while common words like "the" and "on" get lower scores.

Advantages of TF-IDF:¶

Relevance Weighting: TF-IDF assigns higher scores to words that are more important in a specific document while downplaying the impact of common words that appear frequently across documents.
Effective for Document Search: TF-IDF is widely used in search engines to rank documents based on relevance to a query.
Simplicity: It’s simple to compute and works well as a baseline for text representation.

Limitations of TF-IDF:¶

Does Not Capture Semantics: TF-IDF only considers word frequency and does not capture the meaning or context of the words.
Sparsity: For large vocabularies, the document-term matrix generated by TF-IDF can be sparse (lots of zeros), leading to inefficiencies in memory and computation.
Fixed Vocabulary: TF-IDF requires a fixed vocabulary from the training corpus, making it less suitable for dynamic environments where new words are frequently introduced.

Implementing TF-IDF in Python:¶

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = ["The cat is on the mat.", "The dog sat on the rug."]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Display the vocabulary
print(vectorizer.get_feature_names_out())

# Display the TF-IDF representation
print(tfidf_matrix.toarray())

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK stop words (if you don't have them already)
nltk.download('stopwords')

from nltk.corpus import stopwords

# Sample documents
documents = [
    "The cat is on the mat.",
    "The dog sat on the rug.",
    "Cats and dogs are great pets."
]

# Load English stop words
stop_words = stopwords.words('english')

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Fit the model and transform the documents into TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(documents)

# Display the vocabulary (unique terms)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Display the TF-IDF values for each document
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Optionally: inspect the matrix in a readable way
import pandas as pd

# Create a DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF DataFrame:")
print(tfidf_df)