DEV Community

ksn-developer
ksn-developer

Posted on

Parsing nginx logs using python

Introduction

Nginx is a popular web server software used to serve web pages and other content on the internet. Nginx produces logs that contain information about the requests it receives and the responses it sends. Parsing these logs can provide valuable insights into website traffic and usage patterns. In this article, we will explore how to parse Nginx logs using Python.

Step 1: Understanding Nginx Log Format

Nginx logs are stored in a file, usually located in the /var/log/nginx directory. The log format can be configured using the nginx.conf file. The default log format for Nginx is the Combined Log Format, which includes the following fields:

The remote IP address
The time of the request
The request method (GET, POST, etc.)
The requested URL
The HTTP version
The HTTP status code
The size of the response sent to the client
The referrer URL
The user agent string
Enter fullscreen mode Exit fullscreen mode

The log format can be customized to include or exclude specific fields, or to add custom fields.

Step 2: Installing Required Libraries

To parse Nginx logs using Python, we need to install the following libraries:

pandas: used for data manipulation and analysis.
Enter fullscreen mode Exit fullscreen mode

You can install these libraries using the following command:
pip install pandas

Step 3: Parsing Nginx Logs Using Python

To parse Nginx logs using Python, we can use the pandas library. The pandas library provides a powerful data structure called a DataFrame that allows us to manipulate and analyze data easily.

Here's an example Python script that reads an Nginx log file and creates a DataFrame:

import re
import shlex
import pandas as pd

class Parser:
    IP = 0
    TIME = 3
    TIME_ZONE = 4
    REQUESTED_URL = 5
    STATUS_CODE = 6
    USER_AGENT = 9

    def parse_line(self, line):
        try:
            line = re.sub(r"[\[\]]", "", line)
            data = shlex.split(line)
            result = {
                "ip": data[self.IP],
                "time": data[self.TIME],
                "status_code": data[self.STATUS_CODE],
                "requested_url": data[self.REQUESTED_URL],
                "user_agent": data[self.USER_AGENT],
            }
            return result
        except Exception as e:
            raise e


 if __name__ == '__main__':
    parser = Parser()
    LOG_FILE = "access.log"
    with open(LOG_FILE, "r") as f:
        log_entries = [parser.parse_line(line) for line in f]

    logs_df = pd.DataFrame(log_entries)
    print(logs_df.head())
Enter fullscreen mode Exit fullscreen mode

Step 4: Data Analysis

Once we have the Nginx log data in a DataFrame, we can perform various data analysis tasks.for example :

All requests with status code 404
logs_df.loc[(logs_df["status_code"] == "404")]

Requests from unique ip addresses
logs_df["ip"].unique()

Get all distinct user agents
logs_df["user_agent"].unique()

Get most requested urls
logs_df["requested_url"].value_counts().to_dict()

Conclusion

Parsing Nginx logs using Python can provide valuable insights into website traffic and usage patterns. By using the pandas library, we can easily read and manipulate the log data. With the right analysis, we can gain insights into website performance, user behavior, and potential security threats.

Github link :https://gist.github.com/ksn-developer/4072a9e092bccf68559c21f1c5ac2de2

Top comments (2)

Collapse
 
chrisgreening profile image
Chris Greening

Thank you for sharing! I love all the different ways we can leverage pandas, I'm constantly finding new applications for it

Collapse
 
ksndeveloper profile image
ksn-developer • Edited

You're welcome!.pandas is a great tool to analyse requests.i am planning to write more about it