DEV Community

loading...

Parse Nginx access log with nimble_parsec

Daniel Kukula
・4 min read

Lately I played with some string parsing and found nimble_parsec a library which does the job perfect. I will show you how to convert nginx logs to something readable in elixir. A typical log line looks like:

127.0.0.1 - - [25/Dec/2020:08:15:53 +0000] 
"GET /img/key.svg HTTP/1.1" 200 8305 
"http://localhost/styles/css/app.css" 
"Mozilla/5.0 Firefox/84.0"
Enter fullscreen mode Exit fullscreen mode

this fields are:

remote_addr - remote_user [time_local] 
"request" status bytes_sent 
"referer" 
"user_agent"
Enter fullscreen mode Exit fullscreen mode

I found the description on stack overflow

To start matching on our string we need to define our parser in a new file:

defmodule NginxParsex do
  import NimbleParsec

  defparsec( :ngnix_parser, integer(1))
end
Enter fullscreen mode Exit fullscreen mode

this just matches a string starting with an integer

iex(1)> log = "127.0.0.1 - - "
iex(2)> NginxParsex.ngnix_parser(log)
{:ok, [1], "27.0.0.1 - - ", %{}, {1, 0}, 1}
Enter fullscreen mode Exit fullscreen mode

What happens here our parser matches the first integer and returns a tuple with :ok, a list of items that matched [1], and the rest of the string. The other items in the tuple are additional data used by the parser.
We can extract our full ip address creating a parser like this:

  ip = 
    integer(min: 1, max: 3)
    |> ignore(string("."))
    |> integer(min: 1, max: 3)
    |> ignore(string("."))
    |> integer(min: 1, max: 3)
    |> ignore(string("."))
    |> integer(min: 1, max: 3)

  defparsec( :ngnix_parser, ip)
Enter fullscreen mode Exit fullscreen mode

and after running it we got

iex(1)>   NginxParsex.ngnix_parser(log)
{:ok, [127, 0, 0, 1], " - - ", %{}, {1, 0}, 9}
Enter fullscreen mode Exit fullscreen mode

as you can see we are extracting all the numbers and omitting the dots, as we don't need all the integers a string is enough so we may also extract it as a string that consists of numbers and dots:

ip = ascii_string([?., ?0..?9], min: 7, max: 15) 
Enter fullscreen mode Exit fullscreen mode

now we get back:

iex(1)> NginxParsex.ngnix_parser(log) 
{:ok, ["127.0.0.1"], " - - ", %{}, {1, 0}, 9}
Enter fullscreen mode Exit fullscreen mode

the next part is not really useful for us " - - " we may either omit it using the ignore macro

  ip = 
    ascii_string([?., ?0..?9], min: 7, max: 15)
    |> ignore(string(" - - "))
Enter fullscreen mode Exit fullscreen mode

and our first test string is parsed

iex(1)>   NginxParsex.ngnix_parser(log)
{:ok, ["127.0.0.1"], "", %{}, {1, 0}, 14}
Enter fullscreen mode Exit fullscreen mode

the problem is the second dash may be a remote_user which we will skip and match on the opening [

  ip = 
    ascii_string([?., ?0..?9], min: 7, max: 15)
    |> ignore(eventually(ascii_char([?[])))
Enter fullscreen mode Exit fullscreen mode

as our current log was missing this part I'll add it:

log = ~s(127.0.0.1 - - [25/Dec/2020:08:15:53 +0000] )
iex(1)>   NginxParsex.ngnix_parser(log)
{:ok, ["127.0.0.1"], "25/Dec/2020:08:15:53 +0000]", %{}, {1, 0}, 15}
Enter fullscreen mode Exit fullscreen mode

Next part we need to match is the date string - this is very well explained in the documentation we just need to make some adjustments. also we will extract our date and time parsers. Now our module looks like this:

defmodule NginxParsex do
  import NimbleParsec

  ip = 
    ascii_string([?., ?0..?9], min: 7, max: 15)

  date =
    integer(2)
    |> ignore(string("/"))
    |> ascii_string([?a..?z, ?A..?Z], 3)
    |> ignore(string("/"))
    |> integer(4)

  time =
    integer(2)
    |> ignore(string(":"))
    |> integer(2)
    |> ignore(string(":"))
    |> integer(2)
    |> ignore(string(" "))
    |> ignore(ascii_char([?-, ?+]))
    |> ignore(integer(4))


  defparsec( :ngnix_parser,
    ip
    |> ignore(eventually(ascii_char([?[])))
    |> concat(date)
    |> ignore(string(":"))
    |> concat(time)
    |> ignore(string("] "))
  )
end
Enter fullscreen mode Exit fullscreen mode

When running the code we got:

iex(1)> NginxParsex.ngnix_parser(log)                        
{:ok, ["127.0.0.1", 25, "Dec", 2020, 8, 15, 53], "", %{}, {1, 0}, 43}
Enter fullscreen mode Exit fullscreen mode

Cool our date and time is parsed, we can add more stuff from the line

log = ~s(127.0.0.1 - - [25/Dec/2020:08:15:53 +0000] "GET /img/key.svg HTTP/1.1" 200 8305 "http://localhost/styles/css/app.css" "Mozilla/5.0 Firefox/84.0")
Enter fullscreen mode Exit fullscreen mode

Now we need to match the string inside quotes:

  string_in_quotes =
    ignore(ascii_char([?"]))
    |>  ascii_string([not: ?"], min: 1)
    |> ignore(ascii_char([?"]))

  defparsec( :ngnix_parser,
    ip
    |> ignore(eventually(ascii_char([?[])))
...
    |> concat(string_in_quotes)
  )
Enter fullscreen mode Exit fullscreen mode

our result:

NginxParsex.ngnix_parser(log)                                                                            
{:ok, ["127.0.0.1", 25, "Dec", 2020, 8, 15, 53, "GET /img/key.svg HTTP/1.1"],
 " 200 8305 \"http://localhost/styles/css/app.css\" \"Mozilla/5.0 Firefox/84.0\"",
 %{}, {1, 0}, 70}
Enter fullscreen mode Exit fullscreen mode

whats left are some spaces, numbers and two quoted string - we can reuse the parts we already have and our full parser looks now:

  defparsec( :ngnix_parser,
    ip
    |> ignore(eventually(ascii_char([?[])))
    |> concat(date)
    |> ignore(string(":"))
    |> concat(time)
    |> ignore(string("] "))
    |> concat(string_in_quotes)
    |> ignore(string(" "))
    |> integer(min: 1)
    |> ignore(string(" "))
    |> integer(min: 1)
    |> ignore(string(" "))
    |> concat(string_in_quotes)
    |> ignore(string(" "))
    |> concat(string_in_quotes)
Enter fullscreen mode Exit fullscreen mode

the result is now:

{:ok,
 ["127.0.0.1", 25, "Dec", 2020, 8, 15, 53, "GET /img/key.svg HTTP/1.1", 200,
  8305, "http://localhost/styles/css/app.css", "Mozilla/5.0 Firefox/84.0"], "",
 %{}, {1, 0}, 144}
Enter fullscreen mode Exit fullscreen mode

to retrieve the data we can pattern match on the result:

  {:ok, 
[ ip, day, month, year, hour, minute, seconds, request, code, size, referrer, user_agent ],
 _, _, _, _} = NginxParsex.ngnix_parser(log)
{:ok,
 ["127.0.0.1", 25, "Dec", 2020, 8, 15, 53, "GET /img/key.svg HTTP/1.1", 200,
  8305, "http://localhost/styles/css/app.css", "Mozilla/5.0 Firefox/84.0"], "",
 %{}, {1, 0}, 144}
iex(111)> user_agent
"Mozilla/5.0 Firefox/84.0"
Enter fullscreen mode Exit fullscreen mode

now when we got all variables we can process them and wrap it in a map

  @month_map %{
    "Jan" => 1,
    "Feb" => 2,
    "Mar" => 3,
    "Apr" => 4,
    "May" => 5,
    "Jun" => 6,
    "Jul" => 7,
    "Aug" => 8,
    "Oct" => 9,
    "Sep" => 10,
    "Nov" => 11,
    "Dec" => 12
  }
  %{
    ip: ip,
    date: Date.new!(year, @month_map[month], day),
    time: Time.new!(hour, minute, seconds),
    request: request,
    code: code,
    size: size,
    referrer: URI.decode(referrer),
    user_agent: user_agent
 }
%{
  code: 200,
  date: ~D[2020-12-25],
  ip: "127.0.0.1",
  referrer: "http://localhost/styles/css/app.css",
  request: "GET /img/key.svg HTTP/1.1",
  size: 8305,
  time: ~T[08:15:53],
  user_agent: "Mozilla/5.0 Firefox/84.0"
}

Enter fullscreen mode Exit fullscreen mode

Thanks for reading - I tested it on a 2MB file that I have on my local machine and it can parse it all to the end.
A full file we wrote today can be found below:

Discussion (0)