Welcome back to yet another exciting series of narratives in our quest to understand the fundamentals of Text Analytics. In the last article we saw what is data wrangling in the textual context. It was a comprehensive guide to understanding how to prepare data ready for use and be fed to the algorithms. Most of the cases, as we've seen before, is the most time-consuming step. It requires a lot of understanding of the data.
For a data scientist, it is a good thing. You would have a holistic view of the data before putting it to work. But what if you had to paint your understanding in someone else's mind, which is yet another skill quintessential to a data scientist. As they say, "A picture is worth more than a thousand words". In this section, we are going to understand what text visualization is, what are the different ways, how exactly do we do that among other fundamentals. So use this post to satisfy your curiosity. Let's cut to the chase then...
Look at that above graphic. Looks so neat and so full of information. And this is just one of the random images you find when browsing through the net. It has so much story to tell, so much experience to narrate.
Text Visualization, also known as information graphics, is a powerful tool that informs and educates the readers. In a typical Natural Language Processing project, a lot of resources are needed, a lot of data is digested, a lot of communication takes place, a lot of computing is required and all this takes a lot of days, if not months to get it through. But when it comes to explaining to people about your work, could be your professors, your boss, your partner, or any common person for that matter, only the data in visual form is trusted. One could make long stories about the work they've put in but if there isn't something substantial, how would one be able to prove his claims? Not just for the credibility, this has become an integral part of the model building.
This can be helpful when exploring and getting to know a dataset and can help with identifying patterns, corrupt data, outliers, and much more. With a little domain knowledge, data visualizations can be used to harness the real value of the data. Sometimes it is commonly used in the process of Exploratory Data Analysis (EDA) but I feel it is much more than that. It gives you the power to exploit data at any stage of the application. Could be used for debugging the model, cross-validating the claims, presentations among many others.
Let's have a look at some of the basic plots used commonly by Data Analysts and Data Scientists irrespective of numerical data or textual data. Since this is a language-agnostic tutorial, we shall not be taking the help of any language reference.
- Line Chart
- Bar Chart
- Histogram Plot
- Box and Whisker Plot
- Scatter Plot
With knowledge of these plots, you can quickly get a qualitative understanding of most data that you come across.
Use: A line plot is generally used to present observations collected at regular intervals.
Scale: The x-axis represents the regular interval, such as time. The y-axis shows the observations, ordered by the x-axis and connected by a line.
Example: Line plots are useful for presenting time series data as well as any sequence data where there is an ordering between observations.
Use: A bar chart is generally used to present relative quantities for multiple categories.
Scale: The x-axis represents the categories and is spaced evenly. The y-axis represents the quantity for each category and is drawn as a bar from the baseline to the appropriate level on the y-axis.
Example: Bar charts can be useful for comparing multiple point quantities or estimations.
Use: A histogram plot is generally used to summarize the distribution of a data sample.
Scale: The x-axis represents discrete bins or intervals for the observations. For example observations with values between 1 and 10 may be split into five bins, the values [1,2] would be allocated to the first bin, [3,4] would be allocated to the second bin, and so on. The y-axis represents the frequency or count of the number of observations in the dataset that belong to each bin.
Example: Histograms are valuable for summarizing the distribution of data samples.
Note: Often, careful choice of the number of bins can help to better expose the shape of the data distribution. The number of bins can be specified by setting the “bins” argument.
Use: A box and whisker plot, or boxplot for short, is generally used to summarize the distribution of a data sample.
Scale: The x-axis is used to represent the data sample, where multiple boxplots can be drawn side by side on the x-axis if desired. The y-axis represents the observation values. Lines called whiskers are drawn extending from both ends of the box calculated as (1.5 x IQR) to demonstrate the expected range of sensible values in the distribution. Observations outside the whiskers might be outliers and are drawn with small circles.
Example: Boxplots are useful to summarize the distribution of a data sample as an alternative to the histogram. They can help to quickly get an idea of the range of common and sensible values in the box and in the whisker respectively.
Note: A box is drawn to summarize the middle 50% of the dataset starting at the observation at the 25th percentile and ending at the 75th percentile. This is called the interquartile range or IQR.
Use: A scatter plot (or ‘scatterplot’) is generally used to summarize the relationship between two paired data samples.
Scale: The x-axis represents observation values for the first sample, and the y-axis represents the observed values for the second sample. Each point on the plot represents a single observation.
Example: Scatter plots are useful for showing the association or correlation between two variables. A correlation can be quantified, such as a line of best fit, that too can be drawn as a line plot on the same chart, making the relationship clearer.
All of these could be used at different stages while performing an Exploratory Data Analysis. Some examples could be a frequency plot, a list of top 10 n-grams or parts of speech, a Sentiment polarity boxplot of a particular class using a box plot, and so on. These could be applied only if we either have numerical data or at least are able to represent the textual data somehow in a numerical format.
These are the general ones that, irrespective of numerical or textual data, is used very often.
There are few others used extensively for the textual data. For example, a Word Cloud. Let's see it with an example.
This has to be inarguably the most common visualization method used when dealing with data that is textual in nature. It brings a lot of information to the reader. Think of it, how do you make out what kind of text you are reading when you have to be quick? You glance through and look for the most common words in a passage and find out its context. This gives you a fair bit of idea of what you are reading or what it might be about. This is the exact kind of idea this chart gives you in a cleaner way.
It makes words stand out either by means of font size or color according to their usage frequency. Text analysis results in the form of a word cloud can show the theme of texts obviously if the presumption that more important words appear more often is taken to be true.
Consider this example:
The several most important words are “literature”, “project”, “media”, “texts” and “data”. One can quickly make a decision if he wants to clean the data more, or selecting a subsequent strategy just with a look at this chart. One can even make a conclusion on the quality of data. Such handy it could turn out to be!
There are many different similar kinds of visualization techniques that could be used as the word maps and network chats but the crux remains the same.
I hope this was helpful and was able to put things down in a simple way. Please feel free to reach to me on Twitter @AashishLChaubey in case you need more clarity or have any suggestions.
Thanks for being with me, until next time...