Lisa Jung

Posted on Sep 7, 2022

Part 5: Plan for efficient data storage and search performance in Elasticsearch

#elasticsearch #beginners #node #database

Table of Content | Read Next: Part 6 - Set up Elasticsearch for data transformation and data ingestion

Now that our server is connected to Elasticsearch hosted on Elastic Cloud, it is time to think about the data that we want to ingest into Elasticsearch.

Review from part 1

For our project, we will be retrieving data from the USGS API and ingesting it into Elasticsearch.

Before ingesting data into Elasticsearch, assessing the data structure and planning for desired mapping are essential in ensuring efficient storage and search of data.

In this blog, we will assess the data to:

determine what data we need
discover if we need to transform the data to fit our use case
decide on the desired mapping for efficient storage and search of data

Resources

Would you rather watch a video to learn this content? Click on the link below!

Episode 5: Plan for efficient data storage and search performance in Elasticsearch

The following USGS Earthquake API home page contains the link to the API that contains all earthquake data from the past 30 Days(red arrow). We will be retrieving data from this API!

USGS Earthquake API Home Page

The following documentation contains explanations of the terms(field names) included in the API. Refer to this if you need clarifications on the acronyms or want details about certain fields.

Earthquake Catalog Documentation

The current blog builds upon concepts covered in the blog and video above. Refer to these links if some of the jargons don't make sense to you or if you need a little refresher.

We will be referring to the following Elasticsearch documentations on field data types and numeric field types while coming up with the desired mapping for our data.

You will find these resources helpful when assigning field types to your data.

Now that you have access to all resources, let's get to work!

Step 1: Review the final outcome

Before we examine the data structure, let's review the final outcome of the app we are building.

Our app allows the user to search for earthquakes using the following criteria.

type
magnitude
location
date range

When the user clicks on the search button, the search results are displayed as cards.

Each card displays following information about one earthquake.

The information outlined above is what we need from the USGS API.

We will store this information in Elasticsearch in the form of documents. Each document will contain information about one earthquake.

Step 2: Examine the data structure of earthquake API

Go to the USGS Earthquake Home Page and scroll down to the Output section(red box).

The output shows the data structure of a typical earthquake object found in the API.

Scroll down on the output to view the field features(green box).

The features field contains an array of objects. Each object contains information about one earthquake. It lists the name and data type of fields contained in a typical earthquake object.

In step 1, we determined what information we need to store in Elasticsearch.

You will see that the object properties(orange box) and geometry(blue box) contain the information we seek(underlined in pink).

If you need clarification on the acronyms or want details about certain fields, refer to the Earthquake Catalog Documentation.

Take a look at the image below.

You will see that the earthquake object from the API contains more information than we need.

To save storage, we will only index the fields mag, place, time, url, sig(significance), type, and coordinates array which includes longitude, latitude, and depth in that order.

The API fields that correspond to the info on the card are highlighted in same colors.

When you compare the two, for majority of these, the info from the API is identical to the info displayed in the search results.

However, there are a few that are not the same.

Step 3: Determine whether we need to transform any data before ingestion

Data transformation task 1: time
Let's compare the field time in the API earthquake object with the field Time on our card.

You will see that the field time in API earthquake object is in Unix epoch time(1651522073266). However, Time on the card displays a human readable timestamp(2022-05-02T20:07:53.266Z).

To achieve this outcome, we will convert the Unix epoch time in the API field time to human readable timestamp. Then, store transformed information in the field @timestamp in Elasticsearch(more on that later).

Data transformation task 2: coordinates

You will see that the search results card has fields called latitude, longitude, and depth(pink boxes).

In the API earthquake object, the values of these fields are contained in an array called coordinates(pink box) and are not labeled as such.

To make it easier to identify this information, we will create fields for lat(latitude), lon(longitude), and depth in Elasticsearch.

Then, we will store the corresponding info from the API's coordinates array into its respective fields.

Then, we will store lat and lon into an object called coordinates to keep this information together as a pair as shown below.

Note that in Elasticsearch, the abbreviation lat should be used for latitude and lon should be used for longitude.

Step 4: Determine the desired mapping

We just figured out how we should transform the retrieved data from the API before we store it in Elasticsearch.

Next, we will figure out how to store this data using the smallest disk space while maximizing our search performance.

This is when customizing your mapping come into play!

Mapping defines how a document and its fields are indexed and stored.

It does that by assigning types to fields being indexed. Depending on the assigned field type, each field is indexed and primed for different types of requests(full text search, exact searches, aggregations, sorting & etc).

This is why mapping plays an important role in how Elasticsearch stores and searches for data.

We will be glossing over a lot of the concepts we have covered in Understanding mapping with Elasticsearch and Kibana. Check out this resource if you need more in depth review of these concepts!

Take a look at the table below.

To make this process easier, I have created a table that displays(pink box) the name of the field, description of the field, typical values contained in the field, the purpose this field will serve, and the desired mapping I have chosen for the field.

Let's go through each field and determine why I chose certain field type for each field.

When you take a look at the Typical values column in the table, you will see that these fields either contain numeric(green box) or string(orange box) values.

Let's take a look at the fields that contain numeric values first.

Numeric field types

There are various field types that can be assigned to numeric fields. The field type you choose will depend on the value type a field contains and for what purpose you will be using the field.

Coordinates

In step 3, we decided that we want to store the fields lat(latitude) and lon(longitude) within an object called coordinates.

We are planning on using the fields lat and lon for two tasks:

to display this information on a search results card
to use these coordinates to mark the location of earthquakes on a heat map(part 10).

The second task requires running geo-based queries.

Therefore, the field coordinates should be typed as geo_point in order for this to work.

If you want to learn more about the field type geo_point, check out this documentation.

depth and mag

The typical values for the fields depth and mag are in decimals.

As the values of these fields will only be displayed in the search result cards, we will assign the field type float for these fields.

sig

When you look at the typical values for the field sig, it consists of integers that range between 0 to 1000.

The value of this field will be displayed in the search results card.

We want to choose the field type that will store integers using the smallest disk space.

If you take a look at the documentation for numeric field types, the field type that will allow us to store this data using the smallest disk space is short.

time

The value of this field will be displayed in the search results card. It will also be used to search for earthquakes that have occurred within a chosen date range.

To do so, we will run range queries on this field so we will assign the field type date.

Text field types

Let's go over the fields that contain string data types(place, type, and url).

By default, every field that contains string data type gets mapped twice as a text field and as a keyword multi-field.

Each field type is primed for different types of requests.

Text field type is used for full text search.

keyword field type is used for aggregations, sorting, and exact searches.

In scenarios where you do not need both field types, the default setting is wasteful. It will slow down indexing and use up more disk space.

When deciding on a string field type, make sure you know for what purpose this field will be serving so you can choose the correct field type.

place

The field place will be used for three purposes.

The value of this field will be displayed on the search results card.
The field will be used for full text search(when a user types in the location, the user input will be searched against this field to retrieve relevant data)
Aggregation will be performed on this field to yield a table of 10 locations with the highest frequency of earthquakes (part 10)

Since we need to run both full text search and aggregations on the field place, we will assign both field types text and keyword.

type

The value of this field will be displayed on the search results card.

This field will also be used for exact searches.

When a user searches for a specific type of quake, the user input is searched against the field type to retrieve relevant search results.

The user is prompted to select a type from a list of options. Therefore, we can perform exact searches on this field and will map this field as keyword.

url

The value of the field url is only displayed on a card and it is not used for search.

Therefore, there is no need to create search data structures(inverted index or doc values) for this field so we will disable this field(enabled:false).

Summary
In this blog, we figured out:

how we want to transform the retrieved data before ingesting it into Elasticsearch
the desired mapping to efficiently store and search data in Elasticsearch

Move on to Part 6 to set up Elasticsearch for data transformation and data ingestion!