Bag of Words (NLP) implementation in Python

Kemal Toprak Uçar
iyzico.engineering
Published in
4 min readApr 26, 2018

--

I have been doing research in Natural Language Processing (NLP) area for a while and I think it is time to share some of my work with beginners. In short, NLP is a branch of Artificial Intelligence which makes computers understand and process human language. Some example NLP processes are semantic analysis, text classification, and sentimental analysis of text.

source, words on the Wall of Love

In this article, I am going to implement Bag of Words representation of text using a database. The example code will show how to:

  • create a database connection with account parameters,
  • fetch and process text data from the database,
  • finally, calculate frequencies of each word.

This implementation doesn’t concern about word sequences. It generates a single scalar value for each word.

In order to run the following code, Python and MySql are required. To obtain modules in Python, PyPI can be employed. Pandas and pymysql can be downloaded via pip commands below:

pip install PyMySQL
pip install pandas
import pandas as pd;
import operator;
import re;
from pymysql import cursors, connect;

Initially, we are adding modules to call their methods from Python. We will utilize Pandas framework in order to store data in the client. Operator module will be used for sorting and selecting words and word counts from list data structure. Re module is the Regular Expression library that helps tokenizing text into separate words. And finally, pymysql module handles database connectivity.

connection = connect(host='localhost',
user='myusername',
password='mypassword',
db='mydatabasaname',
charset='utf8mb4',
cursorclass=cursors.DictCursor);

Note that we are using UTF-8 codepage to be able to process unicode characters. Fetching text data from the database, requires a connection to an active database. If you are using a database server on your local computer, you can initialize your mysql session by using “mysql.server start” or “mysqld start” commands.

myFrame = pd.read_sql("select first_column_name, second_column_name from table_name", connection)firstColumnData = myFrame['first_column_name']
secondColumnData = myFrame['second_column_name']
texts = firstColumnData + secondColumnData

Text data is going to be stored in the “myFrame” Pandas table. In the example case we are retrieving text data from two different columns. Sometimes, text data such as “first name, last name” are stored separately, and number of columns can change depending on your requirement. In our case, we combine the text pieces into a single Pandas series object.

def clean_turkish_chars(word):edited_word = word.replace("Ş", "s")
edited_word = edited_word.replace("ş", "s")
edited_word = edited_word.replace("Ç", "c")
edited_word = edited_word.replace("ç", "c")
edited_word = edited_word.replace("Ğ", "g")
edited_word = edited_word.replace("ğ", "g")
edited_word = edited_word.replace("İ", "i")
edited_word = edited_word.replace("ı", "i")
edited_word = edited_word.replace("Ö", "o")
edited_word = edited_word.replace("ö", "o")
edited_word = edited_word.replace("Ü", "u")
edited_word = edited_word.replace("ü", "u")
edited_word = edited_word.lower()
return edited_word

In the next step, we clean the text. The function “clean_turkish_chars()” converts Turkish characters to Latin characters, then transforms the text into lower case. By using this function, “Şahap” is defined as “sahap” in order to make sure that we use the same token for a particular word regardless of its position in the sentence.

word_list = {}
for desc in texts:
for word in re.split("\n|,|\s|\r|\t", desc):
if word.isalpha():
edited_word = clean_turkish_chars(word)
if edited_word in word_list:
word_list[edited_word] += 1
else:
word_list[edited_word] = 1

Next, we implement a word count script. In the loop above, we split each text piece into word tokens, clean the token characters, and use a hash table(Python dictionary) to keep and increment the occurrence count of each token.

Please note that, there was no numerical data in our dataset, and only alphabetical characters are handled. Thus, after splitting the word tokens, we utilize “isalpha()” method to eliminate words that contain numerical characters. This part can be adapted according to the contents and goal of the project.

At the end of the loop, the word_list dictionary maps words to word occurrence counts. If a word does not already exist in the dictionary, we initialize the counter as 1.

sorted_word_list =
sorted(word_list.items(),key=operator.itemgetter(1), reverse=True)
sorted_word_list[:5]out: [('ve', 26096), ('cm', 12854), ('ile', 10263), ('icin', 9016), ('urun', 8627)]

At the end, we would like to keep only the words that are not too frequent and not too few. This is because frequent word appear almost everywhere so that they do not specifically change the meaning of a text piece (e.g., stop words such as “and”, “or”, “but”), and rare words do not appear in enough documents to be significant. Hence, we convert the dictionary of words into a word list sorted by their frequencies. For instance, sorted_word_list[:5] gives us the most frequent 5 words in the text repository.

I wanted this article to be an introduction to NLP. Bag of Words is an effective model to represent documents as numerical vectors in order to further utilize Machine Learning algorithms. In our next article we are going to continue with implementing TF-IDF (term frequency–inverse document frequency) vector representation of text repositories.

edit: You can access my article about TF-IDF: https://iyzico.engineering/how-to-calculate-tf-idf-term-frequency-inverse-document-frequency-from-the-beatles-biography-in-c4c3cd968296

To say me “hi” or ask me anything:

e-mail: toprakucar@gmail.com

linkedin: https://www.linkedin.com/in/ktoprakucar/

github: https://github.com/ktoprakucar

--

--

Loves research and programming. Besides the technology, analog photography, books, alternative rock, and the architecture are indispensable by his side.