DEV Community

Girish Talekar
Girish Talekar

Posted on

How to download large amount of datadog logs in parallel

How it started

We had our application logs in datadog and there was a requirement to analyze large amount of logs for some delays in the application

Issues we faced

1) Downloading logs directly from datadog has the max limit of 100k

2) If we want to calculate delay between 2 logs or from frontend to backend, there is no easy way of doing that

3) If you want to download large amount of logs say for 1 week which has 1 million records, downloading that by writing simple application takes time

The approach

When i started i have created a simple utility which downloads the logs synchronously, but the problem is it takes time to download i.e. 15 to 30 min, by downloading 5000k logs per sec. To reduce that time i have created the shell script which runs this above utility in parallel it solved that issue significantly but created another i.e. it created multiple csv files. Combining them is still a tedious task

The Solution

I decided to write a simple tool to solve the above issues hence dd-downloader CLI tool. i have tried to explain the approach in below diagram

datadog-downloader

CLI tool

Step 1: Clone the repo

$ git clone https://github.com/girishg4t/dd-downloader.git

Enter fullscreen mode Exit fullscreen mode

Step 2: Build binary using make

$ make

Enter fullscreen mode Exit fullscreen mode

Step 3: Run the command

$ dd-downloader generate config --name=config.yaml # generates the sample yaml file with date range of 10min
$ dd-downloader validate --config-file=./sample_templates/event_sent.yaml # just validate if the mapping and template is correct
$ dd-downloader run sync --config-file=templates/queued_event.yaml --file=output.csv # will download logs one after the other in chucks of 5000
$ dd-downloader run parallel --config-file=templates/private_event.yaml --file=output.csv  # will run 10 parallel threads to reduce the time of download

Enter fullscreen mode Exit fullscreen mode

Prerequisite

You need to create the yaml config file as per examples

Things to keep in mind

auth:
- dd_api_key => need to specify datadog api key
- dd_app_key => need to specify datadog app key

Enter fullscreen mode Exit fullscreen mode

more details are here datadog

datadog_filter:
- mode => it can be `synchronous` or `parallel`, in `parallel` mode large time frame more than 10min is converted into 10 parallel chuck to reduce the time of download
- query => logs will be filtered based on this query, verify it in datadog before using
- from => from which date the logs need to be downloaded
- to => to which date

Enter fullscreen mode Exit fullscreen mode

more details are here datadog

mapping:
- field: This is used for header in csv file
- dd_field: datadog log field need to be mapped to above csv header, (check the logs in datadog and get the fields you want to map)
- inner_field: Since plane data can be mapped easily, however for mapping the Array you need to use this field

eg.
for below yaml mapping
field 'date' is taken from 'log.Attributes.Attributes' same for 'session_id'
for inner object we need to specify '.' and for array we need to specify '-'
in below log from datadog we need to map reqId which is inside the array of data
{
  "data": [
    {
      "event": {
        "snid": "dasgadsgasdgasd",
        "data": {
          "act": "ASDGASDGDDD",
          "srcId": "dsgdsgdgsdg",
          "dstid": "dasgdasdgdgdgg",
          "pid": "adgasdhsdhh",
          "quality": "DAG",
          "sid": "dddadahsdfhdfh"
        },
        "ets": 1686307199869,
        "etyp": "ASDGASD",
        "rqid": "AAAAA"
      },
      "reqId": "AAAAA"
    }
  ]
}

Enter fullscreen mode Exit fullscreen mode

is mapped like this

- field: "-"
    dd_field: "data"
    inner_field:
    - field: "req_id"
        dd_field: "reqId"
    - field: "event_ts"
        dd_field: "event.ets"
    - field: "event_type"
        dd_field: "event.etyp"
    - field: "dest_id"
        dd_field: "event.data.dstid"
    - field: "source_id"
        dd_field: "event.data.srcId"

Enter fullscreen mode Exit fullscreen mode

Sample YAML file :

apiVersion: datadog/v1
kind: DataDog
spec:
  auth:
    dd_site: "datadoghq.com"
    dd_api_key: "xxxxxxxxxx"
    dd_app_key: "xxxxxxxxxx"
  datadog_filter:
    mode: synchronous
    query: 'service:frontend "socket: not able to connect to server" @type:SERVER_EVENT '
    from: 1686306900000
    to: 1686306960000
  mapping:
    - field: "date"
      dd_field: "date"
    - field: "session_id"
      dd_field: "session_id"
    - field: "-"
      dd_field: "data"
      inner_field:
        - field: "req_id"
          dd_field: "reqId"
        - field: "event_ts"
          dd_field: "event.ets"
        - field: "event_type"
          dd_field: "event.etyp"
        - field: "dest_id"
          dd_field: "event.data.dstid"
        - field: "source_id"
          dd_field: "event.data.srcId"

Enter fullscreen mode Exit fullscreen mode

Top comments (0)