Photo by Eric Krull on Unsplash.
Parsing and processing documents can provide a lot of value for almost every department in a company. This is one of the many use cases where natural language processing (or NLP) can come in handy.
NLP is not just for chatbots and Gmail predicting what you are going to write in your email. NLP can also be used to help break down, categorize, and analyze documents automatically. For example, perhaps your company is looking to find relationships through all of your contracts or you're trying to categorize what blog posts or movie scripts are about.
This is where using some form of NLP processing could come in very handy. It can help break down subjects, repetitive themes, pronouns, and more from a document.
Now the question is, how do you do it?
Should you develop a custom neural network from scratch that will break down sentences, words, meaning, sentiment, etc?
This is probably not the best solution --- at least not for your initial MVP. Instead, there are lots of libraries and cloud services that can be used to help breakdown documents.
In this article, we are going to look at three options and how you can implement these tools to analyze documents with Python. We are going to look into AWS Comprehend, GCP Natural Language, and TextBlob.
AWS Comprehend
AWS Comprehend is one of many cloud services that AWS provides that allows your team to take advantage of neural networks and other models without the complexity of building your own.
In this case, AWS Comprehend is an NLP API that can make it very easy to process text.
What is great about AWS Comprehend is that it will automatically break down what concepts like what entities, phrases, and syntax are involved in a document. Entities are particularly helpful if you are trying to break down what events, organizations, persons, or products are referenced in a document.
There are plenty of Python libraries that make it easy to break down nouns, verbs, and other parts of speech. However, those libraries aren't built to label exactly where those nouns fall as far as categories.
Let's look at an example.
For all the code examples in this article, we will be using the text below:
If you have ever worked at a FAANG or even technology-driven start-up like Instacart, then you have probably realized that data drives everything.
To the point that analysts, PMs, and product managers are starting to understand SQL out of necessity. SQL is the language of data and if you want to interact with data, you need to know it.
Do you want to easily figure out the average amount of time a user spends on your product, but don’t want to wait for an analyst? You better figure out how to run a query.
This ability to run queries easily is also driven by the fact that SQL editors no longer need to be installed. With cloud-based data, warehouses come SaaS SQL editors. We will talk about a SaaS SQL editor more in the next section.
However, the importance here is you don’t have to wait 30 minutes to install and editor and deal with all the hassle of managing it.
Now you can just go to a URL and access your team’s data warehouse. This has allowed anyone in the company easy access to their data.
We know both from anecdotal experience as well as the fact that indeed.com’s tracking in 2019 has shown a steady requirement for SQL skill sets for the past 5 years.
We will take that text example and run it through the code below where the variable plain-text is:
from io import StringIO
import requests
import boto3
import sys
S3_API_PUBLIC = os.environ.get("PUBLIC_KEY")
S3_API_SECRET = os.environ.get("SECRET_KEY")
client_comprehend = boto3.client(
'comprehend',
region_name = 'eu-west-1',
aws_access_key_id = S3_API_PUBLIC,
aws_secret_access_key = S3_API_SECRET
)
plain_text = # PUT TEST TEXT HERE
dominant_language_response = client_comprehend.detect_dominant_language(
Text=plain_text)
dominant_language = sorted(dominant_language_response['Languages'], key=lambda k: k['LanguageCode'])[0]['LanguageCode']
if dominant_language not in ['en','es']:
dominant_language = 'en'
response = client_comprehend.detect_entities(
Text=plain_text,
LanguageCode=dominant_language)
print(response)
AWS Comprehend output
Once you run the code above, you will get an output like the one below. This is a shortened version. However, you can still see the output. For example, you can see QUANTITY
was labeled with 30 minutes and five years --- both of which are quantities of time:
{
"Entities":[
{
"Score":0.9316830039024353,
"Type":"ORGANIZATION",
"Text":"FAANG",
"BeginOffset":30,
"EndOffset":35
},
{
"Score":0.7218282222747803,
"Type":"TITLE",
"Text":"Instacart",
"BeginOffset":76,
"EndOffset":85
},
{
"Score":0.9762992262840271,
"Type":"TITLE",
"Text":"SQL",
"BeginOffset":581,
"EndOffset":584
},
{
"Score":0.997804582118988,
"Type":"QUANTITY",
"Text":"30 minutes",
"BeginOffset":801,
"EndOffset":811
},
{
"Score":0.5189864635467529,
"Type":"ORGANIZATION",
"Text":"indeed.com",
"BeginOffset":1079,
"EndOffset":1089
},
{
"Score":0.9985176920890808,
"Type":"DATE",
"Text":"2019",
"BeginOffset":1104,
"EndOffset":1108
},
{
"Score":0.6815792322158813,
"Type":"QUANTITY",
"Text":"5 years",
"BeginOffset":1172,
"EndOffset":1179
}
]
}
As you can see, AWS Comprehend does a great job of breaking down organizations and other entities. Again, it is not limited to only breaking down entities. However, this feature is one of the more useful ones when attempting to look for relationships between documents.
GCP Natural Language
Google has created a very similar NLP cloud service called Cloud Natural Language.
It offers a lot of similar features, including entity detection, custom entity detection, content classification, and more.
Let's use GCP's version of natural language processing on a string. The code below shows an example of using GCP to detect entities:
from googleapiclient import discovery
import httplib2
from oauth2client.client import GoogleCredentials
DISCOVERY_URL = ('https://{api}.googleapis.com/'
'$discovery/rest?version={apiVersion}')
def gcp_nlp_example():
http = httplib2.Http()
credentials = GoogleCredentials.get_application_default().create_scoped(
['https://www.googleapis.com/auth/cloud-platform'])
http=httplib2.Http()
credentials.authorize(http)
service = discovery.build('language', 'v1beta1',
http=http, discoveryServiceUrl=DISCOVERY_URL)
service_request = service.documents().analyzeEntities(
body={
'document': {
'type': 'PLAIN_TEXT',
'content': # PUT TEXT HERE
}
})
response = service_request.execute()
print(response)
return 0
gcp_nlp_example()
GCP Natural Language output
The GCP output is similar to that of AWS Comprehend. However, you will notice that GCP also breaks down similar words and tries to find metadata that is related to the original word:
//sample
{
"entities":[
{
"name":"SaaS SQL",
"type":"OTHER",
"metadata":{
"mid":"/m/075st",
"wikipedia_url":"https://en.wikipedia.org/wiki/SQL"
},
"salience":0.36921546,
"mentions":[
{
"text":{
"content":"SQL",
"beginOffset":-1
},
"type":"COMMON"
},
{
"text":{
"content":"SQL",
"beginOffset":-1
},
"type":"PROPER"
},
{
"text":{
"content":"language",
"beginOffset":-1
},
"type":"COMMON"
}
]
}
TextBlob And Python
Besides using cloud service providers, there are libraries that can also extract information from documents. In particular, the TextBlob library in Python is very useful. Personally, it was the first library I learned to develop NLP pipelines with.
It is far from perfect. However, it does a great job of parsing through documents.
It offers parts of speech parsing like AWS Comprehend and GCP Natural language as well as sentiment analysis. However, on its own, it won't categorize what entities exist.
It is still a great tool to break down the basic word types.
Using this library a developer can break down verbs, nouns, or other parts of speech and then look for patterns. What words are commonly used? Which specific phrases or words are attracting readers? Which words are common with other nouns?
There are still a lot of questions you can answer and products you can develop depending on your end goal.
Implementing the TextBlob library is very simple.
No need to connect to an API in this case. All you will need to do is import the library and call a few classes.
This is shown in the code below:
from textblob import TextBlob
t=#PUT YOUR TEXT HERE
blob = TextBlob(t)
for i in blob.noun_phrases:
print(i)
TextBlob output
Here is the output of TextBlob. You will see a lot of similar words that are pulled out using both AWS and GCP. However, there isn't all the extra labeling and metadata that come with the APIs. That's what you are paying for (amongst a few other helpful features) with both AWS and GCP:
faang
technology-driven start-up
instacart
pms
product managers
sql
sql
average amount
don ’ t
query.this ability
sql
saas sql
saas sql
don ’ t
url
team ’ s data warehouse
easy access
anecdotal experience
indeed.com ’ s
steady requirement
Sql
And with that, we have covered three different ways you can use NLP on your documents or raw text.
NLP Doesn't Have to Be Hard --- Sort Of
NLP is a great tool to help process documents, categorize them, and look for relationships. Thanks to AWS and GCP, many less technical developers can take advantage of some NLP features.
That being said, there are a lot of hard aspects to NLP. For example, having to develop chatbots that are good at tracking conversations and context isn't an easy task. In fact, there is a great series here on Medium where Adam Geitgey covers just that. You can read more in the article Natural Language Processing Is Fun.
Good luck with whatever your next NLP project is.
If you would like to read more about data science and data engineering. Check out the articles and videos below.
4 SQL Tips For Data Scientists
How To Analyze Data Without Complicated ETLs For Data Scientists
What Is A Data Warehouse And Why Use It
Top comments (0)