How to Calculate TF-IDF (Term Frequency–Inverse Document Frequency) in Python

Kemal Toprak Uçar
iyzico.engineering
Published in
3 min readNov 19, 2018

--

As I have mentioned on my previous post, I am going to implement TF-IDF of a text which is a biography of the Beatles.

Bag of Words is an effective model to demonstrate documents as numerical vectors, but it is not enough to go further than enumeration. TF-IDF is a technique that measures how important a word in a given document.

source, the antique seller in Çukurcuma

TF (Term Frequency) measures the frequency of a word in a document.

TF = (Number of time the word occurs in the text) / (Total number of words in text)

IDF (Inverse Document Frequency) measures the rank of the specific word for its relevancy within the text. Stop words which contain unnecessary information such as “a”, “into” and “and” carry less importance in spite of their occurrence.

IDF = (Total number of documents / Number of documents with word t in it)

Thus, the TF-IDF is the product of TF and IDF:

TF-IDF = TF * IDF

In order to acquire good results with TF-IDF, a huge corpus is necessary. In my example, I just used a small sized corpus. Since I removed stop words, result was pleasant.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import pandas as pd
import re

As my previous code piece, we start again by adding modules to use their methods. In this example, we utilize Scikit-learn besides Numpy, Pandas and Regular Expression. Scikit-learn is a free machine learning library for python. We will utilize CountVectorizer to convert a collection of text documents to a matrix of token counts. TfidfTransformers handles transformation of a count matrix to a normalized TF or TF-IDF representation.

sentences = list()
with open("resources/beatles_biography") as file:
for line in file:
for l in re.split(r"\.\s|\?\s|\!\s|\n",line):
if l:
sentences.append(l)

Data is fetched from ‘beatles_biography’ file and we are parse the text in order to obtain sentences. Regular expression helps separation of sentences using marks then sentences are enlisted under sentences object.

cvec = CountVectorizer(stop_words='english', min_df=3, max_df=0.5, ngram_range=(1,2))
sf = cvec.fit_transform(sentences)

We use 4 parameters in CountVectorizer method. First one is stop_words which removes words that occur a lot but do not contain necessary information. ‘None’ can be given if we don’t want to remove any word or we can give a list to choose which words are going to be swept ourselves. In Scikit-learn, English stop word list is provided built-in. min_df parameter is a threshold value where we ignore terms that have a document frequency lower than min_df. max_df is the contrast of min_df parameter. If the document frequency of a word is more than max_df, we ignore it. ngram_range(x,y) is the last parameter which defines the boundary of n values for different n-grams. x is for the minimum n value, y represents the maximum n value for n-grams. fit_transform returns transform version of sentences.

transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(sf)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': cvec.get_feature_names(), 'weight': weights})

We transform a count matrix to a normalized TF or TF-IDF representation to measure weights. As I mentioned above, the word which has the highest weight provides more information about the document. At the end of the transformation, list is acquired which comprises terms and their ranks.

weights_df.sort_values(by='weight', ascending=False).head(10)

Finally, we can print top 10 words through the given document.

TF-IDF is a numerical statistic used in information retrieval and text mining. I wanted share my experience on it. You can find full code from my repository. In our next article we are going to continue with implementing word2vec to make relationships between words.

To say me “hi” or ask me anything:

e-mail: toprakucar@gmail.com

linkedin: https://www.linkedin.com/in/ktoprakucar/

github: https://github.com/ktoprakucar

--

--

Loves research and programming. Besides the technology, analog photography, books, alternative rock, and the architecture are indispensable by his side.