Recently, I am trying to learn Elasticsearch once again. I used "once again" because I wanted to learn it since late 2016 and in between the time frame, I tried learning it several times and as always I have failed myself to learn it. And just like every other time, I am motivated this time as well ๐
Motive
To learn elasticsearch, you need lots of data to make queries as you want. I searched a few places to get a valid dump. But I couldn't find any dump that I can go with. Also what I found online, I am not familiar with the types of data. So, I thought to make a generator of my own. I have already used Artisan Console and fzaninotto/Faker, that's why I thought to make a generator that anyone can use with their terminal and generate the dump the way they wish.
The repository
This is the repository that you can use to generate the dump.
ssi-anik / elasticsearch-sample-data-generator
Sample data generator and writes in file to upload to Elasticsearch for bulk upload
elasticsearch-sample-data-generator
The purpose of the project is to generate a dump for Elasticsearch Bulk API.
Requirements
- Either your local machine should have
composer
ordocker
installed to get it working. And the local PHP version should be>=7.3
and<8.0
Installation
- Clone the repository.
- If you have
composer
installed on your local machine and satisfies the requirement, then runcomposer install
to install the project dependencies. - If you don't know
php
or the localphp
requirement is not satisfied on your machine, then uncomment theCOPY . /app
andRUN composer install
lines inDockerfile
. So, they'll look like the following.
# It'll copy the project in the PHP container.
COPY . /app
# It'll install the project dependencies.
RUN composer install
- Run
cp docker-compose.yml.example docker-compose.yml
. - Make changes in your
docker-compose.yml
file. If you don't need theelasticsearch
&kibana
, remove those services. - If you made theโฆ
Installation [without docker]
- Clone the repository.
- If your machine has PHP version
>=7.3
and<8.0
and composer installed, then just runcomposer install
being in the root of the repository. It'll install the project dependencies.
That's all.
Installation [with docker]
- Clone the repository.
- Uncomment the line
Copy . /app
in theDockerfile
. - Uncomment the line
RUN composer install
in theDockerfile
. - Copy the
docker-compose.yml.example
todocker-compose.yml
. - Comment the line
.:/app
in your docker-compose.yml'sservices.php.volumes
. - Uncomment the line
./dumps:/app/dumps
in your docker-compose.yml'sservices.php.volumes
. - If you don't need elasticsearch and kibana services, then just delete them.
- Run
docker-compose up -d --build
to run your containers. - To exec into the PHP service, run
docker-compose exec php bash
.
That's all for the docker-based installation. If you're good at docker, you can tweak these things as well by going through the Dockerfile
and the docker-compose.yml
.
Usage
The repository contains one executable elasticsearch-dump
in the root of it. We'll have to use this to run commands and generate dumps.
./elasticsearch-dump generate
is the base command. Let's have a look at the available arguments and options.
./elasticsearch-dump generate --help
Description:
Generate dump for elasticsearch bulk API upload
Usage:
generate [options] [--] <fields>
Arguments:
fields Enter the fields definition (required)
Options:
--file[=FILE] Enter the file name [default: "dumps/dump.json"]
--entries[=ENTRIES] Enter the number of entries [default: "1"]
--action[=ACTION] Enter the action name [index or create] [default: "index"]
--index[=INDEX] Enter the index name [default: "my-index"]
--id[=ID] Enter the sequence start value [default: "1"]
--append Append to existing file
--force Does not ask for confirmation
--uuid UUID based ID generation
Options
Before we check the required argument, let's explore the options first. There are few options that expect values and a few are boolean flags. And all the options are optional. You'll override the common values passing these options.
-
--file
- Default isdumps/dump.json
. You can pass the file name where you want to save the dump. You can pass a relative or absolute path. If the path starts with/
then it'll use it as an absolute path. Otherwise, it'll always dump in thedumps
directory and considers the file name only. -
--entries
- Default is1
. The number of entries you want to generate. -
--action
- Default isindex
. The type of the action. Either it can beindex
orcreate
. -
--index
- Default ismy-index
. The name of the index where you'll put these values. -
--id
- Default is1
. The start position of the sequence. It can only generate a numeric sequence. -
--append
- A boolean flag. If exists then it'll append to the existing file. If the file doesn't exist, then it'll create the file and put contents on it. -
--force
- A boolean flag. By default, the command will ask you for confirmation. By providing this flag, you can bypass the confirmation. -
--uuid
- A boolean flag. If passed, the--id
will not be considered and will generate the UUID-based IDs.
Arguments
The command generates data utilizing the PHP's Faker library. We have to pass the fields that we want to generate with the fake data.
Suppose we want to generate name
and address
fields. When you pass the fields, you can use the pipe |
to separate each field. So, the command looks like the following.
Example:
./elasticsearch-dump generate --entries 10 "name|address"
Here, both the name
and address
fields are resolved to the Faker's name
and address
properties. If we have to have a different key for the objects, we can use a colon :
to separate them. So, if we want to have firstName
in our name fields, and streetAddress
in our address field, then we can simply use the following.
Example:
./elasticsearch-dump generate --entries 10 \
"name:firstName|address:streetAddress"
# Generates
# {"name":"Roosevelt","address":"45647 Judy Isle"}
Here, the name
key will be in the object, containing the firstName
as well as the streetAddress
value in the address
key. Now, firstName
and the streetAddress
are resolved to the faker's property.
If the faker wants you to pass a method, you can also do it by passing as a method.
Example:
./elasticsearch-dump generate --entries 10 \
"name:firstName|id:numerify('ID-####')|amount:numberBetween(1000, 9000)"
# Generates
# {"name":"Lourdes","id":"ID-4912","amount":1004}
Object nesting
When passing your fields to the command's argument, you can pass nest objects using the dot notation.
Example:
./elasticsearch-dump generate --entries 10 \
"student.name:firstName|student.age:numberBetween(20, 27)|id:numerify('ID-####')"
# Generates
# {"student":{"name":"Chandler","age":20},"id":"ID-4386"}
Check the JSON. The student
object contains the name
and age
within it. The ID field is outside the student
object.
Extending the faker functionality
If the faker doesn't provide the type of data you want and you want to extend it, you can also do so by providing an array of values in the project's config/source.php
file. The file already contains designation
as an example. You can call the custom provider using the custom('key')
format.
Example:
./elasticsearch-dump generate "name|designation:custom('designation')"
# Generates
# {"name":"Annabelle Balistreri","designation":"HR Managers"}
So, for our case custom('designation')
, where designation
is the key in the config/source.php
file.
Hope this helps you to generate lots of data.
Happy coding. โค๏ธ
Top comments (0)