DEV Community

Divyesh Aegis
Divyesh Aegis

Posted on • Edited on

How to Build Log Data Analytics using Apache Spark?

We are going to see how to process log data in Spark in this blog. Let us understand the log structure first and write a regular expression to match the log pattern and then extract the value and do some data analytics services on the log data.

First, let us see how the log data we are going to process looks like and understand it.

Sample log data:


83.149.9.216 - - [17/May/2015:10:05:03 +0000] "GET 

/presentations/logstash-monitorama-2013/images/kibana-search.png 

HTTP/1.1" 200 203023 

"http://semicomplete.com/presentations/logstash-monitorama-2013/" 

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"

Now let us break this log data and understand what each of them represents:


83.149.9.216 – IPAddress
-            - ClientId
-            - userId
[17/May/2015:10:05:03 +0000] – dateTime
GET – Method
/presentations/logstash-monitorama-2013/images/kibana-search.png 
– endpoint
HTTP/1.1 – protocol
200 – responseCode
203023 – contentSize
"http://semicomplete.com/presentations/logstash-monitorama-2013/" – URL
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 

(KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36" – browser

Now let us write code to process this log data using Apache Spark Databricks environment.

I have a log file located in the below location.
1
Now let us read few lines of this log data using Spark and see how it looks like.
2
3
closest garage to me

Now let us write a case class to have all the fields we have in the logs and later we will create RDD of type Access Log.
4
Now lets us create a Pattern that matches the log data using a regular expression like below.

5
Now let’s write a Scala function that parses the log data and creates an object of case class that we can use to create an RDD.

6
Now let us read this log data and parse it to create RDD[AccessLog].

7
Now let us find out the top 10 response codes with the counts in descending order.

Top comments (0)