DEV Community

CertosinoLab
CertosinoLab

Posted on

Basic Text Analysis with Python and MongoDB

Introduction

Have you ever needed to analyze unstructured text data? Python can help you do just that. In this article, we’ll look at a basic example of parsing unstructured text data located in MongoDB using Python.

Let’s Write the Code

First, let’s connect to a MongoDB client and retrieve all the documents from a collection in our database:

import pymongo
import re
from matplotlib import pyplot as plt

# Connect to a MongoDB client
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["your_db"]
col = db["your_collection"]
Enter fullscreen mode Exit fullscreen mode

Next, we’ll create an empty string that will contain all the offer details from each document:

# Create an empty string that contains all texts
all_details_string = ''

# Iterate over all documents in your MongoDB instance
for doc in col.find():
    all_details_string = all_details_string + doc.get('offer_details').upper()

# Create a list containing all words
doc_general = re.split(" |/|\n", all_details_string)

# Get all words count
all_words_count = len(doc_general)
print('Total Occurrences:', all_words_count)

# Set technologies to analise
tec_list = ['Java', 'C#', 'Angular', 'React']
count_list = []
Enter fullscreen mode Exit fullscreen mode

On the list of the technologies we want to analyze, iterate through each one, counting the number of occurrences in the concatenated string and adding the count to our list:

for tec in tec_list:
    count_list.append(doc_general.count(tec.upper()))
    print(tec, 'Total Occurrences:', doc_general.count(tec.upper()))
Enter fullscreen mode Exit fullscreen mode

In my environment, i got the following result:

Finally, we’ll create a bar chart using the Matplotlib library, with the technologies on the x-axis and their respective counts on the y-axis:

# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)

# Add labels and title
ax.set_ylabel('Occurrence')
ax.set_xlabel('Technology')
ax.set_title('Occurrences of the chosen technologies')

# Show the chart
plt.show()
Enter fullscreen mode Exit fullscreen mode

Complete code and final thoughts

By analyzing unstructured text data, we can gain insights into the most common words and topics. This can be useful for a wide range of applications, such as sentiment analysis, topic modeling, and more.

Here the full code:

import pymongo
import re
from matplotlib import pyplot as plt

# Connect to a MongoDB client
client = pymongo.MongoClient(“mongodb://localhost:27017/”)
db = client[“your_db”]
col = db[“your_collection”]

# Create an empty string that contains all texts
all_details_string = ‘’
list_detail_offer = []

# Iterate over all documents in your MongoDB instance
for doc in col.find():
 all_details_string = all_details_string + doc.get(‘offer_details’).upper()

# Create a list containing all words
doc_general = re.split(“ |/|\n”, all_details_string)

# Get all words count
all_words_count = len(doc_general)
print(‘Total Occurrences:’, all_words_count)

# Set technologies to analise
tec_list = [‘Java’, ‘C#’, ‘Angular’, ‘React’]
count_list = []
for tec in tec_list:
 count_list.append(doc_general.count(tec.upper()))
 print(tec, ‘Total Occurrences:’, doc_general.count(tec.upper()))

# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)

# Add labels and title
ax.set_ylabel(‘Occurrence’)
ax.set_xlabel(‘Technology’)
ax.set_title(‘Occurrences of the chosen technologies’)

# Show the chart
plt.show()
Enter fullscreen mode Exit fullscreen mode

Thank you!

Top comments (0)