Anna Varzina for Lighthouse

Posted on Aug 10, 2023

Grammar of Graphics: how it helps us to create clear visualizations and tell stories with data

#datascience #dataviz #tableau

Introduction

Before we begin with the description, let me ask you, the reader of this blog post, a question: Have you ever heard about the grammar of graphics (GoG)? We assume that not many people are familiar with it. And those who are, probably learned about it through the tidyverse packages in R. This is precisely how the author of this blog post discovered it as well. And it was a remarkable revelation how effortlessly one can create complex visualizations by using the concept of layered graphics. In essence, the grammar of graphics provides a unique level of flexibility in visualization creation that many other tools simply lack. Therefore, the objective of this article is to introduce the grammar of graphics and present a concise recipe for producing clean visualizations tailored to our company projects.

The history of GoG officially begins with the book written by Leland Wilkinson (first published in 1999) [1], which focussed “on rules for constructing graphs mathematically and then representing them as graphics aesthetically”. While prior research had explored clean graphics, Wilkinson's approach established a robust foundation for numerous visualization tools and frameworks that are now widely used in Data Science projects and beyond. Shortly after the publication of Wilkinson's book, the Polaris system was introduced, further expanding his approach. This system later evolved into Tableau’s VizQL technology [2].

In 2005, the concept of layered graphics within the GoG framework emerged with the introduction of the ggplot2 package in R (where gg actually stands for the grammar of graphics) [3], quickly becoming the most popular graphics package in the R ecosystem. Within our team, we use plotnine, the Python version of ggplot2, along with other visualization modules that also extend the GoG concept. These include Bokeh, Seaborn and Plotly, all of which seamlessly integrate with Pandas data frames for our visualization needs. Furthermore, the GoG concept has been implemented for visualizations in various programming languages, such as Flutter, Javascript, Matlab, and Julia.

Layered grammar

The GoG concept is the representation of a plot with layers, initially referred to as elements by Wilkinson. Imagine it as Photoshop layers, where each layer contains instructions to create a specific plot element. Furthermore, all these specifications can be organized within a single function. Unfortunately, in the basic Python Matplotlib library, an imperative approach is used for creating a figure, axis, scaling the plot, adjusting labels and title. This default approach can be less convenient and intuitive when compared to the GoG, which offers greater flexibility for updating and modifying complex graphs.

The GoG framework provides a systematic and clear way of thinking about data visualization by breaking down a graph into its fundamental components (layers). By modifying these components separately, it becomes much easier and less error-prone to improve the visualization. Combining graph specifications within a function facilitates reusability and ensures consistent style when creating multiple visualizations, such as for a dashboard or presentation.

The basic GoG layers include data, aesthetics, geometry, scale, statistics, facets (subplots), coordinates and style. When presenting visualization digitally we can also add an interactivity layer to this list. These layers also have hierarchy, which we will investigate in more details.

Data. Data is absolutely the essential part of any visualization, which comes from different formats, from simple CSV files to query connections. While we don’t aim to discuss data preparation in this context, it is crucial to emphasize that clean and preferably long-format data is very important for effective visualization.
Aesthetics. An aesthetics layer in visualization refers to the features such as position on the x and y axes, colors, shapes, sizes, labels, and more. Usually we map variables to various aesthetics to display information, provide additional context or highlight specific regions of the plot.
Geometry. A geometry layer corresponds to a visual representation of data such as points, lines, bars, shaded regions, boxplots, histograms, tiles, text, and many more. To create a plot, at least one geometry is required. However, two or more geometries can be combined together, such as lines and points, to make more comprehensive plots.
Scale. It is often necessary to rescale or transform data to adjust the data representation. This can be achieved through the use of a scale/transformation layer. For example, using log transformation when dealing with log-linear relationships. Additionally, by zooming and focusing on specific areas of the graph with most accumulated data points would help to gain more insights.
Statistics. Another type of transformation is the statistical transformation, which corresponds to a statistical layer. It allows applying statistical operations, such as calculating confidence intervals, to the data before it is mapped to the aesthetics of the plot.
Facets. Faceting is a way to divide data into subplots based on another factor present in the dataset. Rather than mapping aesthetics such as size, a more effective approach is to segregate data into subplots along the x or y axis.
Coordinates. Usually, we plot data on the familiar Cartesian x-y axes. However, you can also use polar coordinates for specific visualizations, or Mercator projection to plot geographic data, or draw other shapes such as trees or maps. Therefore, we need an additional coordinate layer to make a plot. Actually, if for some eccentric reason you want to use a pie chart, you just need to draw a bar chart in polar coordinates.
Style (theme). Finally, you would need to style your plot with color schemes, line thicknesses, tick marks, legend and font style. The style can be defined by personal preferences, a journal theme, or the organization you are creating the plot for. However, it is sometimes essential to incorporate some small style improvements driven by aesthetic sensitivity as well.
Interactivity. There are also more features that don’t fit the layers described above such as interactivity. In case if you present your plots digitally, the libraries, such as Plotly or Bokeh, or tools like Tableau offer a range of interactive capabilities such as zooming, hovering information, cross-filtering, and linking to other plots or even webpages.

The order of adding layers may vary depending on the dataset and type of analysis being performed. Nevertheless, this list of layers serves as an excellent reference to create visualizations and ensures that all essential graph components are in place.

Grammar of graphics step by step with example

In this chapter, we will guide you through the process of creating a plot in Tableau while examining the GoG layers we modified along the way. As an example, we will replicate a graph displayed on page 9 of the recently published whitepaper by OTA Insight. This whitepaper demonstrates the effectiveness of forward-looking search data in predicting and identifying early signs of market demand.

Step 0: data layer

Connecting to data (1) is an important starting point in our project. Fortunately, we have already preprocessed the dataset in Python that is now stored in CSV format. It is worth noting that researching and preparing data was an iterative process, requiring multiple attempts before reaching the optimal form. Once the data is properly formatted, it becomes much easier to work with it, understand it, explore it and add GoG layers, which will result in a beautiful visualization.

The dataset we connected to contains the following fields:

Column	Type	Description
Destination	String	City or region
Stay date	Date	Arrival date for which search or room reservation was done
Leadtime	Int	The difference between the arrival date and the reservation or search date
Category	String	This field can take 3 values: On-the-books (OTB), flight or hotel searches
Value	Float	Value of the categories above

Step 1: aesthetics and geometry layers

Once we have gained access to the dataset, we should select aesthetics for the x and y axis. In this example, our goal is to investigate the relationship between flight or hotel searches for a particular city or region (which we refer to as a destination further on) and its on-the-books (OTB). OTB refers to the percentage of rooms that is already booked and confirmed for future dates [4]. Our focus lies in examining the changes of these values over the lead time, which represents the difference between the arrival date and the reservation or search date.

To achieve this, we assign lead time to the x axis (Columns in Tableau) and use the search and OTB pickup values for the y axis (Rows in Tableau). It is important to note that Tableau automatically aggregates measures when incorporated into the graph, and we have to make adjustments to the aggregations for desired representation. Therefore, we have to adjust the lead time as a continuous axis to obtain a line graph, which fits well with time series data. At this stage, we have successfully added two fundamental aesthetics (2) and geometry (3) layers.

Step 2: aesthetics and coordinates adjustments

We make our graph clearer by introducing color aesthetics to distinguish occupancy from flight/hotel searches. This allows for easy comparison and analysis per category.

To make it easier to understand, we reverse the lead time axis. On the left side of the graph, we see points further in the past from the arrival date (lead time = 0), while the right side represents more recent data. This arrangement aligns with our natural way of perceiving time, where it progresses from left to right, resulting in a more intuitive and logical representation of the information.

Step 3: add filters

Filtering the destination and stay date actually means adding a data transformation layer (4) to our graph. It is important to note that we don’t modify the underlying dataset, but select different values interactively in Tableau, which can be further adjusted during data exploration. In ggplot or other static plot modules, you would typically need to add this step while assigning the data layer. However, Tableau offers a separate functionality dedicated to filtering. This enables a more seamless and user-friendly interaction with data.

Step 4: scale and style layers

As we progress, we fine-tune the axes of the graph. Specifically, we set the limits of x axis from 0 to 270 lead time and transform y axis format from decimal to percentage by simply adjusting data format in Tableau. Additionally, we refine the style by adjusting tick marks and removing vertical grid lines. Moreover, we increase the line thickness for better visibility.

Step 5: further adjustments

In the final stage of polishing our graph, we make style adjustments to the axis names, title, and legends. Additionally, we include the filtered destination and stay date to the subtitle. Adding final occupancy on the selected stay date to the subtitle requires an additional calculation “market final occupancy”. This can be accomplished by using the level of details (LOD) feature in Tableau. The specified final occupancy value instead of a simple average over all lead time points provides more insights to the reader.

As we can see, we didn't use all GoG layers in this work due to their lack of necessity. For example, we could add facet layer to compare values for various destinations, or we could add confidence intervals to estimate value distributions for all arrival dates per destination. However, our primary goal was to investigate the relationship between the categories. Therefore, the visualization we created fitted this purpose perfectly.

Step 6: storytelling

Now, with our visualization ready, we can tell a story about the data we have. The primary purpose of creating graphs is to achieve effective communication with others. Understanding the GoG concept assists the presenter in emphasizing the most important details and seamlessly incorporating them into the graph. However, it is important to remember that the audience may read your graph a bit differently than you would expect, as highlighted in the book “Storytelling with Data” by C.N. Knaflic [5]. Nevertheless, we can still direct their attention by such graph attributes as color, size or element position, while eliminating non-informative elements and using appropriate visual displays.

Analyzing the graph we just created, we can clearly observe that market occupancy follows the same pattern as flight and hotel searches. About 170 days prior to arrival date (lead time 0), searches started to surge, whereas market actual occupancy was still pacing at 4%. It started to increase at about 120 days before the arrival date and continued to rise until reaching its final occupancy of 94%. We can also notice that flight searches rise faster than hotel searches. It can be explained that people tend to search for a flight earlier than for a hotel.

Prior exploration notes

The steps described above allowed us to produce a visually appealing graph for the whitepaper. However, during the exploration phase, we worked with more layers (especially transformation ones) than mentioned above. For example, we explicitly compared the same increase points (referred to as "pickup" points) for OTB, flight and hotel searches. This involved creating additional calculated fields to determine the lead times for these corresponding growths.

In one of the plots, we even incorporated a statistical layer featuring an exponential fit of the data. However, this particular layer was later omitted in the final version. As shown below, we compared exponential models of the values and included supplementary information to the title for better analysis. During the research phase, we emphasized mainly on the analytical content rather than on style formatting, which we finalized at the data publication stage only.

In conclusion

The concept of grammar of graphics introduced in this article works as a valuable guide to create complex and meaningful visualizations. By breaking down the graph into its fundamental components, it facilitates easy modification and reusability of visualizations. Within our Data Science team, we leverage the grammar of graphics approach during data exploration and creating reports or research summaries. The main advantage for us lies in its time-saving capability for visualization creation and the convenience of reusability.

Nevertheless, crafting a visualization can be considered as a form of art. It often requires personalized approach based on such factors as context, audience and intent. As underscored by the author of layered GoG, Hadley Wickham, "The grammar is powerful and useful, but not all encompassing". There will be cases where we might need to exceed the scope of the defined layers to create the most fitting and insightful graphs.

References

Wilkinson, Leland. The grammar of graphics. Springer Berlin Heidelberg, 2012.
Hanrahan, Pat. Vizql: a language for query, analysis and visualization. Proceedings of the 2006 ACM SIGMOD international conference on Management of data, 2006.
Wickham H. A layered grammar of graphics. Journal of Computational and Graphical Statistics, 2010.
The ultimate guide to on-the-books data for hoteliers. OTA Insight blog, 2023.
Knaflic, Cole Nussbaumer. Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons, 2015.

DEV Community