Skip to content

TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency)

  • TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical technique used in Natural Language Processing (NLP) to evaluate the importance of a word in a document relative to a collection of documents (called the corpus).
  • Unlike simple word count or Bag of Words (BoW), TF-IDF not only considers how often a word appears in a document but also accounts for how common or rare the word is across the entire corpus.
  • This helps identify the most relevant words while minimizing the impact of commonly occurring words (like "the", "is", "and").

Key Components of TF-IDF

  1. Term Frequency (TF):

    • This measures how often a term (word) appears in a document. A higher term frequency means that the word is more important in that specific document.
    • Formula:
    • TF(t,d)= (Total number of terms in document d)/(Number of times term t appears in document d)
    • Example: If the word "cat" appears 3 times in a document with 100 words, the TF for "cat" would be:
    • TF(cat,document)= 100/3 = 0.03
  2. Inverse Document Frequency (IDF):

    • This measures how important a word is across the entire corpus. Words that appear in many documents get a lower score, while words that appear in fewer documents get a higher score.
    • Formula:
    • IDF(t,D)=log(Total number of documents/Number of documents containing the term t)
    • Example: If the corpus contains 1000 documents, and the word "cat" appears in 50 of them, the IDF for "cat" would be:
    • IDF(cat,corpus)=log( 1000/50)=log(20)≈1.3
  3. TF-IDF Score:

    • F-IDF combines the term frequency (TF) and inverse document frequency (IDF) to assign a weight to each word in a document. Words that are frequent in a document but rare in the corpus receive a higher score, indicating their importance.
    • Formula:
    • TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)
    • Example: Using the previous values for TF and IDF of the word "cat", the TF-IDF score for "cat" in a document would be:
    • TF-IDF(cat)=0.03×1.3=0.039

Example of TF-IDF

  • Let's calculate TF-IDF for two simple documents:
  • Document 1: "The cat is on the mat."
  • Document 2: "The dog sat on the rug."

Step 1: Calculate Term Frequency (TF)

  • Document 1: "The cat is on the mat." (Total words: 6)
    • TF("the") = 2/6 = 0.33
    • TF("cat") = 1/6 = 0.17
    • TF("mat") = 1/6 = 0.17
    • TF("is") = 1/6 = 0.17
    • TF("on") = 1/6 = 0.17
  • Document 2: "The dog sat on the rug." (Total words: 6)
    • TF("the") = 2/6 = 0.33
    • TF("dog") = 1/6 = 0.17
    • TF("sat") = 1/6 = 0.17
    • TF("on") = 1/6 = 0.17
    • TF("rug") = 1/6 = 0.17

Step 2: Calculate Inverse Document Frequency (IDF)

  • For a corpus of two documents:
  • IDF("the"): Appears in both documents, so:
  • IDF(the)=log(2/2)=log(1)=0
  • (Common words like "the" get an IDF of 0, meaning they have no distinguishing power.)

  • IDF("cat"): Appears in Document 1 only:

  • IDF(cat)=log(2/1)=log(2)≈0.693
  • IDF("on"): Appears in both documents:
    • IDF(on)=log(2/2)=0

Step 3: Calculate TF-IDF Scores

  • For Document 1:
    • TF-IDF("cat") = 0.17 × 0.693 = 0.118
    • TF-IDF("the") = 0.33 × 0 = 0
    • TF-IDF("on") = 0.17 × 0 = 0
  • For Document 2:

    • TF-IDF("dog") = 0.17 × 0.693 = 0.118
    • TF-IDF("the") = 0.33 × 0 = 0
    • TF-IDF("on") = 0.17 × 0 = 0
  • Thus, terms like "cat" and "dog" get a higher TF-IDF score because they are less frequent across the entire corpus, while common words like "the" and "on" get lower scores.

Advantages of TF-IDF:

  • Relevance Weighting: TF-IDF assigns higher scores to words that are more important in a specific document while downplaying the impact of common words that appear frequently across documents.
  • Effective for Document Search: TF-IDF is widely used in search engines to rank documents based on relevance to a query.
  • Simplicity: It’s simple to compute and works well as a baseline for text representation.

Limitations of TF-IDF:

  • Does Not Capture Semantics: TF-IDF only considers word frequency and does not capture the meaning or context of the words.
  • Sparsity: For large vocabularies, the document-term matrix generated by TF-IDF can be sparse (lots of zeros), leading to inefficiencies in memory and computation.
  • Fixed Vocabulary: TF-IDF requires a fixed vocabulary from the training corpus, making it less suitable for dynamic environments where new words are frequently introduced.

Implementing TF-IDF in Python:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = ["The cat is on the mat.", "The dog sat on the rug."]

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Display the vocabulary
print(vectorizer.get_feature_names_out())

# Display the TF-IDF representation
print(tfidf_matrix.toarray())
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

# Download NLTK stop words (if you don't have them already)
nltk.download('stopwords')

from nltk.corpus import stopwords

# Sample documents
documents = [
    "The cat is on the mat.",
    "The dog sat on the rug.",
    "Cats and dogs are great pets."
]

# Load English stop words
stop_words = stopwords.words('english')

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words=stop_words)

# Fit the model and transform the documents into TF-IDF representation
tfidf_matrix = vectorizer.fit_transform(documents)

# Display the vocabulary (unique terms)
print("Vocabulary:", vectorizer.get_feature_names_out())

# Display the TF-IDF values for each document
print("\nTF-IDF Matrix:")
print(tfidf_matrix.toarray())

# Optionally: inspect the matrix in a readable way
import pandas as pd

# Create a DataFrame for better visualization
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF DataFrame:")
print(tfidf_df)