What did you learn this week? July 12, 2019

#twil #todayilearned #discuss

What did y'all learn this week?

Oldest comments (2)

Benjamin Trent • Jul 12 '19

How Cuckoo Filters work

Cuckoo filters are a probabilistic data structure named after the cuckoo bird due to the propensity for some sub-species stealing other birds nests.

Cuckoo Filters provide the ability to REMOVE entries from their hash table. This is something that has been attempted through extensions to Bloom Filters. But the attempts usually end up with a complicated, or sub-optimal solution. In addition to being able to delete entries, Cuckoo Filters have a small memory foot print, comparative false positive rates, and are intuitively easy to understand.

You can read the paper here. Some important implementation highlights:

A cuckoo filter is a compact variant of a cuckoo hashtable [21] that stores only fingerprints—a bit string derived from the item using a hash function—for each item inserted

cuckoo filters use partial-key cuckoo hashing to find an item’s alternate location based on only its fingerprint

For an item x, our hashing scheme calculates the indexes of the two candidate buckets as follows: h1(x)=hash(x),h2(x)=h1(x)⊕hash(x’s fingerprint). The xor operation...ensures an important property: h1(x) can also be calculated from h2(x) and the fingerprint using the same formula. In other words, to displace a key originally in bucket i (no matter if i is h1(x) or h2(x)), we directly calculate its alternate bucket j from the current bucket index i...Hence, an insertion only uses information in the table, and never has to retrieve the original item x.

I recommend reading the whole paper. Its well written, pretty easy to grok, and throws some serious shade at Bloom filters.
P.S. Dig this nifty visualization

Better searching of Bash history

I have known about ctrl+r for a while as a way to reverse search your bash history. But, I did not know I could press ctrl+r again to scroll FURTHER back in history. I was just making my reverse search more specific all these years when I could just be pressing ctrl+r again! Also, ctrl+s allows you to search forward in history as well. No more history | grep <my search> for me :D.

How the Linux auditing system logs events

At my current company we have a project called auditbeat that tails the Linux Audit log and parses into a useful format for data shipping. Curiosity struck and I looked into how audit logs are written and formatted. As with the majority of interesting and useful things in Linux related tech; there is an amazing Digital Ocean article explaining the auditing system. I ❤️ Digital Ocean and their community tutorials.

How to calculate the Coefficient of determination (R²⁾

I was recently tasked to write up an efficient way to calculate the R² metric for evaluating regression type models. While trying to get this done with existing Elasticsearch aggregation, I found out that R^2 = 1 - (ResidualSumOfSquares/TotalSumOfSquares). Both of these variables can be calculated in two unique aggregations:

"aggs": {
    "sum_of_residuals": {
      "sum": {
        "script": "def diff = doc['field_actual'].value - doc['field_predicted'].value;return diff * diff;"
      }
    },
    "statistics": {
      "extended_stats": {
        "field": "field_predicted"
      }
    }
  }

Now, all you need is 1 - sum_of_residuals.value / statistics.variance * statistics.count since TotalSumOfSquares(X) = variance(X) * |X|