Hello people,
Another post, another plunder in the realm of AI and the linguists. If you’ve followed till here, you’ve built on good fundamentals of Text Analytics. In the last article we saw some of the business use cases of this technology and most importantly what we saw is a brief overview, a thousand feet view, of a typical Text Analytics pipeline.
That’s a good start. But now what’s critical is this, and the subsequent piece of articles, where we bridge the theory and practical implementation gaps, challenges, and constraints. Beginning with this article, we shall see some of the important stages - individually - in a typical pipeline. Let's start!
Data Acquisition
In contemporary times, getting a dataset is not considered a major task in itself, is it?
It is true that in the good old days when there was a major paucity in data sources across most of the domains, fetching data for a specific use case was considered a herculean task. One had to scrape tons of discrete sites, for very little data. But today with the abundance of data, all consolidated either in one place or available as an API (Tweepy - Twitter, ImgurPython - Imgur, praw - Reddit, Kaggle, Facebook, etc.), data acquisition can be safely discarded off the impediment list before the project even starts. Most of the research papers also provide the link to the source where they picked the data from, maybe an archive, a governmental portal, a conference portal or database, etc. These are the major sources of data among others.
In this stage, the data is consolidated using different scraping techniques and no data is discarded, even based on the quality. This becomes very useful for the preliminary analysis of the textual data and to fetch some insights. One can then subsequently start to clean the data in the following states.
Data Preperation & Data Wrangling/Text Wrangling
"As a data scientist, one spends over 70% of his time cleaning and unifying messy data so that you can perform operations on them."
I'm sure some of us must be passionate about cooking, at least a few of us? Well, what is the basic doctrine, as far as the experts are concerned, for cooking a special dish? It says the more attention you pay in preparing and cooking the ground spices (or as they call it, masala), the more palatable it gets. It holds good in most kinds of AI applications also. The more time you spend on maturing your data sets, the healthier outcome it will yield. The relation as simple as that.
Although data preparation is a multi-faceted task, text wrangling is basically the pre-processing work that's done to prepare raw text data to fit for training and be efficiently used in the successive data analytics stage. In this stage, we convert and transform information (textual data) at different levels of granularity based on the requirement of the application. Text wrangling applies large scale changes to the text, by automating some low-level transformations. The basic approach to text wrangling is to work with lines of text.
Exclusively speaking, data preprocessing focuses on any kind of preprocessing of the textual data before you build an analytical model. On the other hand, data wrangling is used in Exploratory Data Analysis (EDA) and Model building. This is done to adjust the data sets iteratively while analyzing the data and building the model. I'd put them down in one bracket because they are semantically related and the kind of thing we expect at the end of this stage is very much related. Also, the choice of one affects the result of another. So in most cases, they are very tightly bound. It deals with playing with data, getting insights, and bringing it in a particular format that is considered appropriate or suitable for feeding it into the model.
- Basically, some of the essential common steps in text wrangling are:
- Text Cleansing
This can be any simple preprocessing of the text. It can include dumping of some textual data not binding to the requirement or shortening the length, removing emojis (based on the application), and so on. But the gist is this, we get rid of all the basic unwanted things not required in the data from the raw data.
- Sentence Splitting
Splitting the text into individual sentences, a task as simple as that. Now the splitting could be based on various criteria that can be provided in the APIs of various libraries available for this task. There are many state-of-the-art libraries out there to achieve it.
- Tokenization
A token is the smallest text unit a machine can process. As a matter of fact, to run natural language programs on the data it needs to be tokenized. Thus, in most cases, it makes sense that the smallest unit be either a word or a letter. We can further tokenize the word into letters, depending on the application.
- Stemming or Lemmatization
We know many English words in many different forms and many times it has the same semantic sense. Stemming is exactly what it sounds like - stemming or limiting the word to its root form of the inflected words. Let us understand this by an example, consider the word writing
. It is made of its root word write
. It can further be expressed in many different forms based on time and context like wrote, writes, etc. But we, with our general understanding of this language, know that all these form convey the same thing. It is therefore a good idea to reduce the word to its basic form.
Lemmatization is similar to stemming but a little more stringent than it. The difference is that the word obtained from stemming may not be an actual word, like the ones in the dictionary. It can generate some arbitrary words. But in the case of lemmatization, the words thus produced has to necessarily a real word or a word from the dictionary. This makes lemmatization a little slower than stemming as there is this added responsibility of validating the word in the dictionary.
- Stop Word Removal
Let's first understand what are stop words? They are nothing but a set of commonly used words in a language. One major part of preprocessing is filtering out useless data. This useless data in Natural Language Processing is known as "stop-words" or also called filler words, words which have less importance than the other words in a sentence. There are a set of identified words in the English Langauge, and other languages too, that are included in the popular NLP libraries. Some of them are a
, the
, have
, he
, i
, and many more. It makes intuitive sense to drop out these words and focus on the other important words in the sentence in order to do something useful.
Hope you all liked it. In the next article, we are going to see other stages in the pipeline. It will be for the best if you also follow the next part of the series to get a thorough understanding of the topic. Please let me know if you want me to work on a few more details of the topic, I will make sure to include those. Until next time. #BiDen
Top comments (0)