DEV Community

Cover image for How Stripe is actioning the osquery API at scale [osquery@scale]

How Stripe is actioning the osquery API at scale [osquery@scale]

Edoardo Tenani
Developer by day, curious by night. What I do care most are people. I started as Full Stack developer on a Ruby On Rails project. Now focusing on bridging development with operations and security.
Updated on ・2 min read

OSquery is an operating system instrumentation framework for Windows, OS X (macOS), Linux, and FreeBSD. The tools make low-level operating system analytics and monitoring both performant and intuitive.
From osquery website

I recently watched this talk from osquery@scale.

My interest in OSquery started some years ago, as the premises are super cool: convert your system in a SQL database you can query with usual SQL syntax.

This integrates very well with standard data analytics tooling and process, so is a powerful way to gather data at scale for review, analysis and alerting.

Want to know which processes are executing on a machine? Or how many users there are? Any intrusion detection rule you want to check?
✔ OSquery can do this, and a lot more

As such I was really interested in how OSquery can is leveraged in the real world for security monitoring at scale.

The talk present how OSquery fits into a global scale security effort both at laptop and server level. Highly rewarding!


Challenge at scale: creating a security ecosystem where OSquery is a critical component

  • Every environment has different tools, sensors, logs and storage
  • most tools have their GUI and query language
  • OSquery usage of SQL relates easily to other business flow that have analytics value

But that needs to play together with other elements. OSquery is a piece of Security @ Stripe.

3 goals:

  1. eliminate need to recreate detection logic for different tools
  2. enable other teams who need information without access to security tools
  3. centralize methodology for detection that everyone can activate without access to all machines and with reduced skill-set (less to learn = more effective)

OSquery on laptop managed using Chef, where logs are collected.
OSquery on servers managed by puppet, queries pushed through puppet and logs are stored in AWS S3 forwarded to Splunk.

Design criteria:

  • generalize security for our environment
  • extensible framework
  • detection logic agnostic from tooling
  • skills required: Python and SQL
  • easy to collect metrics

Taking a practice from Data Science, they started using Jupiter Notebooks, writing libraries that codified the logic. They have a centralized notebook server (sensitive information can be protected) and use GitHub repos (collaborative peer review).

OSquery is cool but get better when you can correlate other data.

The second part of the presentation is about how to detect meaningful events and reduce alert fatigue.

They started building a baseline (average size of CLI commands) to detect anomalies. Grab data in a timeframe and decorate them.
Once you have a baseline, detecting anomalies is easier.

They created a repo with classified attack stages (loosely based on the MITRE Attack framework) and extended to include custom detections. This allow to provide README as mini-runbooks for specific detection groups. (example rule in the video)

They generalized the metadata for detections (independently from the system). Each rule is automatically deployed to a detection engine, so all rules are centralized in a single repo.

They prefer JS-based rules (with fallback to SQL when is not possible). (example engine in the video).

Discussion (0)