Karma IDS: An Intrusion Detection System using eBPF and LSTM

#rebase01 #gdsciiitkalyani #ai #python

Introducing Team Marauders

Krutika Gundecha - @capture_flint on devfolio, @capture_flint on dev.to
Bhargavi Sudarsan - @glorp on devfolio, @glorp on dev.to
Vijayapreethi S R - @pree2111 on devfolio, @pree2111 on dev.to
Destiny Sharma - @destiny_123 on devfolio, @destiny_123 on dev.to

Intro

Have you ever downloaded an app from the playstore and had your antivirus ask to perform a scan? That's basically your system's Intrusion Detection System at work!

With the advent of the 21st century and the subsequent increase in network traffic, we've seen a rise of cyber threats. Traditional intrusion detection systems struggle to keep up, leading to delayed responses and missed threats. Our aim is to address this challenge. We, team marauders, have built a highly specialised Intrusion Detection System which also classifies if a particular instance is a threat to the safety of your system using eBPF for packet capturing and an LSTM model, a kind of complex neural network.

Why we chose this project

We observed the problems with traditional network intrusion detection systems and instead of using user-space-based packet tracers, we figured that kernel-level eBPF technology will have less overhead as well as helps detect the treats much earlier, keeping the system safer..

Understanding Networking at a deeper kernel or OS level, and Deep Learning was a major part of this project, and we're very curious and interested in the same. So we thought, why not integrate all our interests!

How It Works

eBPF

eBPF or extended Berkeley Packet Filter is a revolutionary technology in the Linux kernel that can run sandboxed programs in the operating system kernel.
What's a sandbox program you ask? Basically, a sandboxed program is an application that operates within a restricted environment, called a "sandbox," which limits its access to system resources and sensitive data. This can prevent it from causing harm or accessing unauthorized resources on the system, like a safety net.

When we implement eBPF to capture and extract data from the incoming packets, we facilitate a low-overhead, non-intrusive mechanism for capturing and filtering network packets at a granular level. Unlike traditional methods, which often incur significant performance overhead, eBPF allows us to extract network data with low impact on system resources.

The data we extracted using eBPF is shown below:

LSTM

An LSTM or Long Short-Term Memory is a complex RNN (Recurrent Neural Network) which can store memory and establish long-range dependencies or patterns in the data we supply. In this context, we are using this powerful model to perform Classification of the eBPF data to identify whether they are threats.

We have used the open-source UNSW-NB15 dataset which contains network intrusion data. It has multiple types of network-related features such as basic features such as connection description features, time features such as packet arrival time, flow features such as transaction protocol etc. We've pruned the dataset and used only those features which are relevant to our model. The following are snippets from our LSTM code:

As for the loss function, we'd initially used cross-entropy loss function. Due to the low amount of samples for particular data categories, we got very subpar accuracy values. The following is the accuracy we got with CE loss (note that the accuracy is not in percentages, rather from a range of 0 to 1):

To rectify this, we switched over to Focal Loss, and we got way better results! Focal loss is a loss function which gives additional weight to those data categories which have low instances in the dataset in order to balance it out. (Here accuracy is in percentage):

Final Integration

We've passed the captured eBPF packet data through the pre-trained LSTM to classify different threats. This is a screenshot of the output we've designed:

Tech Stack

eBPF

Python (base code for eBPF), BCC (library which helps us embed BPF code in python from C), C (BPF code is written in C), Linux (eBPF technology is currently only available in Linux) and Bash (for scripting).

LSTM

Python (the model is built on python), Pytorch (library for building neural network, training etc.), Numpy (library to work with arrays etc.), Pandas (library to read csv file of dataset etc.), Scikit Learn (library to work with dataset for splitting, pruning etc.).

Challenges We Faced

The challenges started with learning about the low level programming of the eBPF technology. We encountered issues with probing and the implementation of the count feature for the captured packets. Figuring out where exactly in the network stack should we attach the bpf probe to. We encountered many, many errors along the way. Some were solvable - like some segmentation faults and key errors, while some like the bit-field requested errors for the C program were something we had to work our way around. eBPF is a sandboxed technology and the JIT compiler means that there are certain limitations to what all we can access. The integration of the eBPF technology with the AI model was prone with challeneges as we had to figure out what was the best way to integrate it so that the security risks are minimized. The time for which each program had to run and the order had to be figured out.
There were also challenges we faced in the LSTM modelling part. The dataset itself that we used had great differences in the categorization of the attacks. We had to figure which loss function to minimize this and increase the accuracy to an good percentage.
There were, of course, a lot of dimensional errors runtime errors that we figured out along the way.
Kernel programs are extremely hard to debug so it was challenging to identify each problem and debug at a kernel/OS level as we have limited prior experience.

Tracks

AI/ML

A major part of our project was to build an AI based LSTM Model to classify threats for the detection of malicious network traffic. This model itself with its praise-worthy accuracy carries the detection part of the project. This model helps us predict whether a network flow can constitute an attack which has great potential for the network security as it reduces the overhead and resources by a huge extent.
So, we feel it fits right in the AI/ML track.

Open Innovation

Our project is building upon a number of research papers as well as some problems that we noticed in traditional Network Intrusion Detection Systems. We feel that the Open Innovation Track is an appropriate track for our project as it is aiming to solve a real world problem that most of us are facing, even without being constantly aware of it.
Our project offers a more robust intrusion detection and prevention system, leading to better protection against cyberattacks. This translates to greater security for individuals and society, as cyberattacks can target critical infrastructure and personal data.