DEV Community

Sylvain Hellegouarch for Reliably

Posted on

eBPF for SRE with Reliably

eBPF is a funny piece of technology, it is based of a BPF which is almost as old as Linux itself and yet eBPF has been trending heavily for the past couple of years.

image

In my book, eBPF is a system event generator. By tapping into that event pool and listening at the right level, you can gain tons of insights of your system.

Funnily enough, SRE has also been trending for a couple of years and it precisely talk about how the health of the system.

So, could these two be made for each other? Well, maybe not in such candid terms, but there is something very appealing to bring them closer.

SRE introduced Service Level Objective (SLO) and Service Level Indicators (SLI). SLO encode what good looks like for aspects that matter to you in your system. SLI encode the metrics that aggregates as the SLO's target. eBPF is an interesting data source for indicators.

For instance, say you have this eBPF program (via BCC):

#include <uapi/linux/ptrace.h>
#include <net/sock.h>
#include <bcc/proto.h>

#define IP_TCP  6

int http_filter(struct __sk_buff *skb) {
    u8 *cursor = 0;

    // let's not care for anything not Ethernet or TCP
    struct ethernet_t *ethernet = cursor_advance(cursor, sizeof(*ethernet));
    if (!(ethernet->type == 0x0800)) {
        return 0;
    }

    struct ip_t *ip = cursor_advance(cursor, sizeof(*ip));
    if (ip->nextp != IP_TCP) {
        return 0;
    }

    return -1;
}
Enter fullscreen mode Exit fullscreen mode

Simple socket filtering really. We only keep TCP packets to inspect them in user-land and ignore the rest.

From user-land, we now have a Python program that injects this program into the kernel:

class MySocketHndl(psocket.SocketHndl):
    def __init__(self, b: BPF, timeout: int = None, iface: str = "lo"):
        """
        BCC gives the us a socket to listen from. Bind to it.
        """
        function_http_filter = b.load_func("http_filter", BPF.SOCKET_FILTER)
        BPF.attach_raw_socket(function_http_filter, iface)
        socket_fd = function_http_filter.sock

        self._socket = socket.fromfd(
            socket_fd, socket.PF_PACKET, socket.SOCK_RAW,
            socket.IPPROTO_IP)

        # blocks forever when timeout is None
        self._socket.settimeout(timeout)

@contextmanager
def bpfsock(iface: str = "lo"):
    """
    Loads the BPF program and starts listening on the socket we attach
    to the interface. Cleanup when finished.
    """
    try:
        b = BPF(src_file = "ebpf.c")
        psock = MySocketHndl(b=b, iface=iface)
        yield psock
    finally:
        psock.close()
        b.cleanup()
Enter fullscreen mode Exit fullscreen mode

We simply attach to the interface and add our filter to the socket used to listen on the interface. Now we can process packets as we see them. We then dismiss any packet we don't care about, here anything not HTTP, and we parse valid packets as HTTP requests/responses, using the awesome pypacker.

def filter_pkt(eth: ethernet.Ethernet, target_port: int = 8000) -> bool:
    """
    Process only packets that are going to or from the target server.
    """
    if eth[ethernet.Ethernet, ip.IP, tcp.TCP] is not None:
        tcp_p = eth[tcp.TCP]
        if tcp_p.dport == target_port or tcp_p.sport == target_port:
            return True
    return False

def process():
    with bpfsock(iface="lo") as psock:
        for pkt in psock.recvp_iter(filter_match_recv=filter_pkt):
            h = http.HTTP(pkt[tcp.TCP].body_bytes)
Enter fullscreen mode Exit fullscreen mode

Boom, we're gold!

From there on, all we have to do is collect some information about requests/responses we see (duration, status code, path requested...) and aggregate ratios over time window we are interested in tracking.

total_count = class_2xx = good_latency_count = 0
for req in requests:
    if window_start <= req["end"] < window_end:
       total_count += 1
    # our SLO latency is 150ms
    if req["duration"] <= 0.15:
       good_latency_count += 1
    if req["status"] == 200:
       class_2xx += 1

    if total_count == 0:
       continue

   indicators.put(
     (
         "availability", last_push, next_push, path,
         100.0 * (class_2xx / total_count)
     )
   )
   indicators.put(
     (
         "latency", last_push, next_push, path,
         100.0 * (good_latency_count / total_count)
     )
   )
Enter fullscreen mode Exit fullscreen mode

Nothing fancy here.

Now that we have our indicators, we can send them to Reliably to generate our SLO results:

def send_indicators():
    indicator_type, from_ts, to_ts, path, value = indicators.get()

    headers = {"Authorization": f"Bearer {TOKEN}"}
    indicator = {
        "metadata": {
             "labels": {
                 "category": indicator_type,
                 "path": path
             }
         },
         "spec": {
             "from": f"{from_ts.isoformat()}Z",
             "to": f"{to_ts.isoformat()}Z",
             "percent": value
         }
    }

    if indicator_type == "latency":
       indicator["metadata"]["labels"]["percentile"] = "100"
       indicator["metadata"]["labels"]["latency_target"] = "150ms"

    httpx.put(reliably_url, headers=headers, json=indicator)
Enter fullscreen mode Exit fullscreen mode

Again, nothing fancy.

At this stage, Reliably can now generate SLO results you can start viewing using the Reliably CLI:

$ reliably slo report
Refreshing SLO report every 3 seconds. Press CTRL+C to quit.
                                                                  Current  Objective  / Time Window     Type             Trend      
  Service #1: ebpf-2021-demo                                 
  ❌ 99% of the responses are under 150ms                          98.99%      99.5%  / 10s             Latency          ✕ ✕ ✕ ✓ ✕  
  ❌ 99% of the responses to our users are in the 2xx class        98.66%        99%  / 10s             Availability     ✓ ✓ ✕ ✓ ✕  
Enter fullscreen mode Exit fullscreen mode

Kaboom! We have now successfully mapped low-level ebpf events to high-level SLO constructs.

Obviously, this is a rather trivial showcase but it's promising. Nevertheless, there is some rout to cover before the whole process becomes more attractive as eBPF's UX is perhaps not as transparent as one could hope for.

Still, so much fun!

The code can be found at https://github.com/Lawouach/ebpf-2021-talk.

Discussion (0)