Syed Sirajul Islam Anik

Posted on Mar 24, 2021

Elasticsearch Sample Data Generator

#elasticsearch #bulk #data #dump

Recently, I am trying to learn Elasticsearch once again. I used "once again" because I wanted to learn it since late 2016 and in between the time frame, I tried learning it several times and as always I have failed myself to learn it. And just like every other time, I am motivated this time as well 😉

Motive

To learn elasticsearch, you need lots of data to make queries as you want. I searched a few places to get a valid dump. But I couldn't find any dump that I can go with. Also what I found online, I am not familiar with the types of data. So, I thought to make a generator of my own. I have already used Artisan Console and fzaninotto/Faker, that's why I thought to make a generator that anyone can use with their terminal and generate the dump the way they wish.

The repository

This is the repository that you can use to generate the dump.

ssi-anik / elasticsearch-sample-data-generator

Sample data generator and writes in file to upload to Elasticsearch for bulk upload

elasticsearch-sample-data-generator

The purpose of the project is to generate a dump for Elasticsearch Bulk API.

Requirements

Either your local machine should have composer or docker installed to get it working. And the local PHP version should be >=7.3 and <8.0

Installation

Clone the repository.
If you have composer installed on your local machine and satisfies the requirement, then run composer install to install the project dependencies.
If you don't know php or the local php requirement is not satisfied on your machine, then uncomment the COPY . /app and RUN composer install lines in Dockerfile. So, they'll look like the following.

# It'll copy the project in the PHP container.
COPY . /app

# It'll install the project dependencies.
RUN composer install

Run cp docker-compose.yml.example docker-compose.yml.
Make changes in your docker-compose.yml file. If you don't need the elasticsearch & kibana, remove those services.
If you made the…

View on GitHub

Installation [without docker]

Clone the repository.
If your machine has PHP version >=7.3 and <8.0 and composer installed, then just run composer install being in the root of the repository. It'll install the project dependencies.

That's all.

Installation [with docker]

Clone the repository.
Uncomment the line Copy . /app in the Dockerfile.
Uncomment the line RUN composer install in the Dockerfile.
Copy the docker-compose.yml.example to docker-compose.yml.
Comment the line .:/app in your docker-compose.yml's services.php.volumes.
Uncomment the line ./dumps:/app/dumps in your docker-compose.yml's services.php.volumes.
If you don't need elasticsearch and kibana services, then just delete them.
Run docker-compose up -d --build to run your containers.
To exec into the PHP service, run docker-compose exec php bash.

That's all for the docker-based installation. If you're good at docker, you can tweak these things as well by going through the Dockerfile and the docker-compose.yml.

Usage

The repository contains one executable elasticsearch-dump in the root of it. We'll have to use this to run commands and generate dumps.

./elasticsearch-dump generate is the base command. Let's have a look at the available arguments and options.

./elasticsearch-dump generate --help

Description:
  Generate dump for elasticsearch bulk API upload

Usage:
  generate [options] [--] <fields>

Arguments:
  fields               Enter the fields definition (required)

Options:
  --file[=FILE]        Enter the file name [default: "dumps/dump.json"]
  --entries[=ENTRIES]  Enter the number of entries [default: "1"]
  --action[=ACTION]    Enter the action name [index or create] [default: "index"]
  --index[=INDEX]      Enter the index name [default: "my-index"]
  --id[=ID]            Enter the sequence start value [default: "1"]
  --append             Append to existing file
  --force              Does not ask for confirmation
  --uuid               UUID based ID generation

Options

Before we check the required argument, let's explore the options first. There are few options that expect values and a few are boolean flags. And all the options are optional. You'll override the common values passing these options.

--file - Default is dumps/dump.json. You can pass the file name where you want to save the dump. You can pass a relative or absolute path. If the path starts with / then it'll use it as an absolute path. Otherwise, it'll always dump in the dumps directory and considers the file name only.
--entries - Default is 1. The number of entries you want to generate.
--action - Default is index. The type of the action. Either it can be index or create.
--index - Default is my-index. The name of the index where you'll put these values.
--id - Default is 1. The start position of the sequence. It can only generate a numeric sequence.
--append - A boolean flag. If exists then it'll append to the existing file. If the file doesn't exist, then it'll create the file and put contents on it.
--force - A boolean flag. By default, the command will ask you for confirmation. By providing this flag, you can bypass the confirmation.
--uuid - A boolean flag. If passed, the --id will not be considered and will generate the UUID-based IDs.

Arguments

The command generates data utilizing the PHP's Faker library. We have to pass the fields that we want to generate with the fake data.

Suppose we want to generate name and address fields. When you pass the fields, you can use the pipe | to separate each field. So, the command looks like the following.

Example:

./elasticsearch-dump generate --entries 10 "name|address"

Here, both the name and address fields are resolved to the Faker's name and address properties. If we have to have a different key for the objects, we can use a colon : to separate them. So, if we want to have firstName in our name fields, and streetAddress in our address field, then we can simply use the following.

Example:

./elasticsearch-dump generate --entries 10 \
  "name:firstName|address:streetAddress"

# Generates
# {"name":"Roosevelt","address":"45647 Judy Isle"}

Here, the name key will be in the object, containing the firstName as well as the streetAddress value in the address key. Now, firstName and the streetAddress are resolved to the faker's property.

If the faker wants you to pass a method, you can also do it by passing as a method.

Example:

./elasticsearch-dump generate --entries 10 \
  "name:firstName|id:numerify('ID-####')|amount:numberBetween(1000, 9000)"

# Generates
# {"name":"Lourdes","id":"ID-4912","amount":1004}

Object nesting

When passing your fields to the command's argument, you can pass nest objects using the dot notation.

Example:

./elasticsearch-dump generate --entries 10 \
"student.name:firstName|student.age:numberBetween(20, 27)|id:numerify('ID-####')"

# Generates
# {"student":{"name":"Chandler","age":20},"id":"ID-4386"}

Check the JSON. The student object contains the name and age within it. The ID field is outside the student object.

Extending the faker functionality

If the faker doesn't provide the type of data you want and you want to extend it, you can also do so by providing an array of values in the project's config/source.php file. The file already contains designation as an example. You can call the custom provider using the custom('key') format.

Example:

./elasticsearch-dump generate "name|designation:custom('designation')"

# Generates
# {"name":"Annabelle Balistreri","designation":"HR Managers"}

So, for our case custom('designation'), where designation is the key in the config/source.php file.

Hope this helps you to generate lots of data.

Happy coding. ❤️

DEV Community

Elasticsearch Sample Data Generator

Motive

The repository

ssi-anik / elasticsearch-sample-data-generator

Sample data generator and writes in file to upload to Elasticsearch for bulk upload

elasticsearch-sample-data-generator

Requirements

Installation

Installation [without docker]

Installation [with docker]

Usage

Options

Arguments

Object nesting

Extending the faker functionality

Top comments (0)

Read next

🚀 Introducing Amazon Aurora DSQL: The Next Evolution in Databases 📊

ACID Properties in Databases - Part 2: Your Shield Against Transactional Chaos

A Perfect Synergy Between Data Integration Technology and Vector Databases!

Synchronizing Data from InfluxDB to Doris with SeaTunnel