ugo landini

Posted on May 7, 2023 • Edited on Jun 1, 2023

JR, quality Random Data from the Command line, part I

#kafka #datagen #cli #streaming

What is JR?

JR is a cli tool which helps to stream quality random data. We all know what streaming is and why it is so important nowadays. Now, let's try to define what "quality random data" is.
A simple - and not too scientific - way of defining it is whatever data is good enough to look real.

Examples:

Is 1.2.3.4 a good IP? And 10.2.138.203?
Is

  {
    "ID": "ABCDEFG1234"
    "name": "Ugo Landini",
    "gender": "F",
    "company": "Confluent",
    "email": "john.wayne@ibm.com"
  }

a good random user?

What about this one instead?

  {
    "ID": "69167997-0253-4165-a17d-9ef896124426"
    "name": "Laura Kim",
    "gender": "F",
    "company": "Boston Static",
    "email": "laura.kim@bostonstatic.com"
  }

Defining quality data

There are essentially two different dimensions:

things that must be realistic "in themselves", like an IP address, or a credit card number
things that are realistic if coherent to other data, like names, companies, emails, cities, zip codes, mobile phones, locale, etc.

Sometimes we may need to generate random data of type 2 in different streams, so the "coherency" must also spread across different entities, think for example to referential integrity in databases. If I am generating users, products and orders to three different Kafka topics and I want to create a streaming application with Apache Flink, I definitely need data to be coherent across topics.

What is JR?

So, is JR yet another faking library written in Go? Yes and no. JR indeed implements most of the APIs in fakerjs and Go fake it, but it's also able to stream data directly to stdout, Kafka, Redis and more (Elastic and MongoDB coming). JR can talk directly to Confluent Schema Registry, manage json-schema and Avro schemas, easily maintain coherence and referential integrity. If you need more than what is OOTB in JR, you can also easily pipe your data streams to other cli tools like kcat thanks to its flexibility.

Why it's called JR?

Just a Random generator, Json Random generator, or, better just JR from the famous 80's Dallas character are all valid answers. JR can generate everything and not only JSON, so I definitely prefer the last one.

The use case that generated the generator.

I work as Staff Solutions Engineer in Confluent: some weeks ago I was talking with a prospect customer and he told me that they needed to send json documents like this (among others) to Confluent Cloud.

{
"VLAN": "DELTA",
"IPV4_SRC_ADDR": "10.1.41.98",
"IPV4_DST_ADDR": "10.1.137.141",
"IN_BYTES": 1220,
"FIRST_SWITCHED": 1681984281,
"LAST_SWITCHED": 1682975009,
"L4_SRC_PORT": 81,
"L4_DST_PORT": 80,
"TCP_FLAGS": 0,
"PROTOCOL": 1,
"SRC_TOS": 211,
"SRC_AS": 4,
"DST_AS": 1,
"L7_PROTO": 443,
"L7_PROTO_NAME": "ICMP",
"L7_PROTO_CATEGORY": "Application"
}

They needed to send many of these (and similar) documents to Kafka and so it was important to measure how good Kafka client compression would have been at their rate and with their data.

When you use fully managed services like Confluent Cloud it's very important to understand how much your data will be compressed: price is directly proportional to throughput and kafka batches messages in the producers, so producer compression can easily save you a lot of bandwidth and therefore a lot of money. Now, producing data in real time and analysing it in real time is pretty easy with Confluent Cloud. But answering to the prospect question in real time (i.e. during the conference call) it wasn't as easy as it should. Which compression algorithm is better? Would it be fast enough? Is batch size important for the compression?

Datagen is the de-facto standard to generate random data for Kafka. But customising what's generated is not something you can do in 30 seconds, and enabling compression is currently not an option with the managed connectors. So I decided to write a tool which you could use to easily start streaming random data to kafka in seconds, and that's why JR was born. With the help of some friends and colleagues we packed JR with a lot of features (and many more coming!)

Basic JR usage

JR is very straightforward to use. Let's look at all the preinstalled templates:

jr template list

All the templates should be green: that means that their syntax is correct and they compile.

Let's see the net_device template, which is what I should have written if I had JR during the conference call to randomise what they gave me:

> jr template show net_device

{
"VLAN": "{{randoms "ALPHA|BETA|GAMMA|DELTA"}}",
"IPV4_SRC_ADDR": "{{ip "10.1.0.0/16"}}",
"IPV4_DST_ADDR": "{{ip "10.1.0.0/16"}}",
"IN_BYTES": {{integer 1000 2000}},
"FIRST_SWITCHED": {{unix_time_stamp 60}},
"LAST_SWITCHED": {{unix_time_stamp 10}},
"L4_SRC_PORT": {{ip_known_port}},
"L4_DST_PORT": {{ip_known_port}},
"TCP_FLAGS": 0,
"PROTOCOL": {{integer 0 5}},
"SRC_TOS": {{integer 128 255}},
"SRC_AS": {{integer 0 5}},
"DST_AS": {{integer 0 2}},
"L7_PROTO": {{ip_known_port}},
"L7_PROTO_NAME": "{{ip_known_protocol}}",
"L7_PROTO_CATEGORY": "{{randoms "Network|Application|Transport|Session"}}"
}

The net-device template is pretty easy to write: these are all "Type 1" fields with no relations. You can easily generate a good IP starting from its CIDR with the ip function. There are other networking functions used in this template, all pretty straightforward, like ip_known_port, integer and unix_time_stamp. Running this template is just a matter of typing

> jr template run net-device 

{
"VLAN": "DELTA",
"IPV4_SRC_ADDR": "10.1.175.220",
"IPV4_DST_ADDR": "10.1.148.210",
"IN_BYTES": 1553,
"FIRST_SWITCHED": 1680183839,
"LAST_SWITCHED": 1682746947,
"L4_SRC_PORT": 443,
"L4_DST_PORT": 81,
"TCP_FLAGS": 0,
"PROTOCOL": 0,
"SRC_TOS": 195,
"SRC_AS": 0,
"DST_AS": 0,
"L7_PROTO": 22,
"L7_PROTO_NAME": "SFTP",
"L7_PROTO_CATEGORY": "Network"
}

When you write your own templates you'll probably need to look at all the available functions. Let's see for example how to ask JR which networking functions are available:

> jr man -c network

...

Name: ip_known_protocol
Category: network
Description: returns a random known protocol
Parameters:
Localizable: false
Return: string
Example: jr run --template '{{ip_known_protocol}}'
Output: tcp

Name: http_method
Category: network
Description: returns a random http method
Parameters:
Localizable: false
Return: string
Example: jr run --template '{{http_method}}'
Output: GET

Name: mac
Category: network
Description: returns a random mac Address
Parameters:
Localizable: false
Return: string
Example: jr run --template '{{mac}}'
Output: 7e:8e:75:a5:0a:85

you can also immediately test the function without writing a template, directly from jr man:

> jr man ip --run

Name: ip
Category: network
Description: returns a random Ip Address matching the given cidr
Parameters: cidr string
Localizable: false
Return: string
Example: jr run --template '{{ip "10.2.0.0/16"}}'
Output: 10.2.55.217

10.2.240.243

Elapsed time: 0s
Data Generated (Objects): 1
Data Generated (bytes): 12
Number of templates (Objects): 5
Throughput (bytes per second):       118

Create more random data

Using -n option you can create more data in each pass. You can use jr run or jr template run, they are equivalent.
This example creates 3 net_device objects at once:

jr run net_device -n 3

Using --frequency option you can repeat the whole creation pass as you like:

This example creates 2 net_device every second, for ever:

jr run net_device -n 2 -f 1s

Using --duration option you can time bound the entire object creation.
This example creates 2 net_device every 100ms for 1 minute:

jr run net_device -n 2 -f 100ms -d 1m

Results are by default written on standard out (--output "stdout"), but streaming to Kafka is as simple as that.

If you have Confluent Cloud, you can just download the client configuration, put the file in a kafka dir and start streaming. If you don't have Confluent Cloud, give it a try: no credit card needed, a basic cluster to test JR is super cheap and you'll also get 400$ of traffic included.

Anyway, here is the configuration template if you need to configure it manually. It's just a standard librdkafka configuration

# Kafka configuration
# https://github.com/confluentinc/librdkafka/blob/master/CONFIGURATION.md

bootstrap.servers=
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username=
sasl.password=
compression.type=gzip
compression.level=9
statistics.interval.ms=1000

Streaming to Kafka

Once Kafka is configured, streaming to it with JR is straightforward

jr run -n 5 -f 500ms -d 5s net-device -o kafka
2023/05/07 20:03:07         0 bytes produced to Kafka
2023/05/07 20:03:08      5250 bytes produced to Kafka
2023/05/07 20:03:09      8765 bytes produced to Kafka
2023/05/07 20:03:10     12260 bytes produced to Kafka
2023/05/07 20:03:11     15763 bytes produced to Kafka

Elapsed time: 5s
Data Generated (Objects): 50
Data Generated (bytes): 17364
Number of templates (Objects): 1
Throughput (bytes per second):      3172

By default, JR writes to a topic named test, but you can change that with with -t option.

Conclusions

We have seen how to use JR in simple use cases, streaming quality random data from predefined templates to standard out and Kafka on Confluent Cloud.
In the second part of this series, we will see how to produce your own templates and manage integrity of generated data.
In the meanwhile, happy streaming!

DEV Community

JR, quality Random Data from the Command line, part I

What is JR?

Defining quality data

What is JR?

Why it's called JR?

The use case that generated the generator.

Basic JR usage

Create more random data

Streaming to Kafka

Conclusions

Top comments (0)

Read next

The Easiest Way to Package Your Python Files(Turn to .exe Files)

Mastering PostgreSQL Performance: Tips and Best Practices

How PostgreSQL Powers the Future of Data-Driven Applications

[Rust Self-Study] 1.1. Install Rust