Welcome to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article, we saw the definitions of Text Analytics, understand the other important related concepts, and the use of it.
In this post, we will continue to expand on this by knowing some of the applications of it, and then most importantly, see what is a typical text analytics pipeline - one of the hackneyed jargons in the AI and linguist community.
Analyzing customer email, surveys, call center logs, and social media streams such as blogs, tweets, forum posts, and newsfeeds to understand customers better.
Analysis of customer reviews of products or services helps enterprises understand user sentiment or common issues customers are talking about.
Keyword analysis (comparing profiles with job descriptions) helps in short-listing suitable candidates.
Contextual mining of text which identifies and extracts subjective information in the source material, and helping a business to understand the social sentiment of their brand, product, or service while monitoring online conversations.
Identifying and determining what is being said about a brand, individual, or product through different social and online channels.
There are many ways text analytics can be implemented depending on the business needs, data types, and data sources. All share four key steps.
Text analytics begins with collecting the text to be analyzed -- defining, selecting, acquiring, and storing raw data. This data can include text documents, and web pages (blogs, news, etc.) among many other sources.
Once data is acquired, the enterprise must prepare it for analysis. The data must be in the proper form to work with machine learning models that will be used for data analysis. There are four stages in data preparation:
Removes any unnecessary or unwanted information. Text data is restructured to ensure data can be read the same way across the system and to improve data integrity (also known as "text normalization").
Breaks up a sequence of strings into pieces (such as words, keywords, phrases, symbols, and other elements) called tokens. Semantically meaningful pieces (such as words) will be used for analysis.
Assigns a grammatical category to the identified tokens. Familiar grammatical categories include nouns, verbs, adjectives, and adverbs. Also referred to as "PoS".
Creates syntactic structures from the text based on the tokens and PoS models. Parsing algorithms consider the text's grammar for syntactic structuring. Sentences with the same meaning but different grammatical structures will result in different syntactic structures.
Process of analyzing the prepared text data. Machine learning models can be used to analyze huge volumes of data, and the outcome is typically produced as an API in JSON format or a CSV/Excel file. There are many ways data can be analyzed; two popular approaches are text extraction and text tagging.
Simply stated, text extraction is the process of identifying structured information from unstructured text. Text tagging is the process of assigning tags to text data based on its content and relevance.
The process of transforming the analysis into actionable insights, representing the data in graphs, tables, and other easy-to-understand representations.
In the subsequent articles, we shall go through some of the important and frequently used steps in the pipeline and see how exactly the data flows within a typical Text Analytics application.