Inverse Document Frequency
IDF is one of the most basic terms of modern search engine relevance calculation. It is used to determine how rare a term is and how relevant it is to the original query. For example take the query “the Golden State Warriors”. This query is difficult because there is no invidual word that identifies our intention to search for a basketball team. Instead we need to look at groups of words and weigh how relevant each set is to the overall query. This is the basics of flat search query relevance and it all starts with IDF.
Before we can calculate IDF we need to associate each document or query with a set of features. For this tutorial we will use only n-grams. An n-gram is one or more words. We can use python’s string methods to quickly extract features from a document or query.
def extract_features( document ): terms = tuple(document.lower().split()) features = set() for i in range(len(terms)): for n in range(1,4): if i+n <= len(terms): features.add(terms[i:i+n]) return features print(extract_features(â€˜The Golden State Warriors’))
Next we need to calculate Document Frequency, then invert it. The formula for IDF starts with the total number of documents in our database: N. Then we divide this by the number of documents containing our term: tD. This will never result in a number less than 1, because 1 indicates that the term is present in all documents, there is no document frequency more common than that limit. Next we normally take the logarithm of this whole term, because we may be indexing billions of documents, and the IDF can get pretty unwieldy unless we refer to it in terms of order of magnitude.
Here we can calculate the IDF for all of our features in a small database of documents.
def extract_features( document ): terms = tuple(document.lower().split()) features = set() for i in range(len(terms)): for n in range(1,4): if i+n <= len(terms): features.add(terms[i:i+n]) return features documents = [ "This article is about the Golden State Warriors", "This article is about the Golden Arches", "This article is about state machines", "This article is about viking warriors"] def calculate_idf( documents ): N = len(documents) from collections import Counter tD = Counter() for d in documents: features = extract_features(d) for f in features: tD[" ".join(f)] += 1 IDF =  import math for (term,term_frequency) in tD.items(): term_IDF = math.log(float(N) / term_frequency) IDF.append(( term_IDF, term )) IDF.sort(reverse=True) return IDF for (IDF, term) in calculate_idf(documents): print(IDF, term)
As you can see in the output, rare terms are assigned higher IDF and thus can be weighted higher in relevancy calculation.