Introduction
Have you ever needed to analyze unstructured text data? Python can help you do just that. In this article, we’ll look at a basic example of parsing unstructured text data located in MongoDB using Python.
Let’s Write the Code
First, let’s connect to a MongoDB client and retrieve all the documents from a collection in our database:
import pymongo
import re
from matplotlib import pyplot as plt
# Connect to a MongoDB client
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["your_db"]
col = db["your_collection"]
Next, we’ll create an empty string that will contain all the offer details from each document:
# Create an empty string that contains all texts
all_details_string = ''
# Iterate over all documents in your MongoDB instance
for doc in col.find():
all_details_string = all_details_string + doc.get('offer_details').upper()
# Create a list containing all words
doc_general = re.split(" |/|\n", all_details_string)
# Get all words count
all_words_count = len(doc_general)
print('Total Occurrences:', all_words_count)
# Set technologies to analise
tec_list = ['Java', 'C#', 'Angular', 'React']
count_list = []
On the list of the technologies we want to analyze, iterate through each one, counting the number of occurrences in the concatenated string and adding the count to our list:
for tec in tec_list:
count_list.append(doc_general.count(tec.upper()))
print(tec, 'Total Occurrences:', doc_general.count(tec.upper()))
In my environment, i got the following result:
Finally, we’ll create a bar chart using the Matplotlib library, with the technologies on the x-axis and their respective counts on the y-axis:
# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)
# Add labels and title
ax.set_ylabel('Occurrence')
ax.set_xlabel('Technology')
ax.set_title('Occurrences of the chosen technologies')
# Show the chart
plt.show()
Complete code and final thoughts
By analyzing unstructured text data, we can gain insights into the most common words and topics. This can be useful for a wide range of applications, such as sentiment analysis, topic modeling, and more.
Here the full code:
import pymongo
import re
from matplotlib import pyplot as plt
# Connect to a MongoDB client
client = pymongo.MongoClient(“mongodb://localhost:27017/”)
db = client[“your_db”]
col = db[“your_collection”]
# Create an empty string that contains all texts
all_details_string = ‘’
list_detail_offer = []
# Iterate over all documents in your MongoDB instance
for doc in col.find():
all_details_string = all_details_string + doc.get(‘offer_details’).upper()
# Create a list containing all words
doc_general = re.split(“ |/|\n”, all_details_string)
# Get all words count
all_words_count = len(doc_general)
print(‘Total Occurrences:’, all_words_count)
# Set technologies to analise
tec_list = [‘Java’, ‘C#’, ‘Angular’, ‘React’]
count_list = []
for tec in tec_list:
count_list.append(doc_general.count(tec.upper()))
print(tec, ‘Total Occurrences:’, doc_general.count(tec.upper()))
# Create a bar chart
labels = tec_list
values = count_list
fig, ax = plt.subplots()
ax.bar(labels, values)
# Add labels and title
ax.set_ylabel(‘Occurrence’)
ax.set_xlabel(‘Technology’)
ax.set_title(‘Occurrences of the chosen technologies’)
# Show the chart
plt.show()
Thank you!
Top comments (0)