DEV Community

tmns
tmns

Posted on

Using Bitmapist for Super Fast Analytics

Background

Let's say we have built a CRM application, which keeps track of the various metadata logged by users when they use one of our other apps. For example this data could be that the user is using an iPhone, that they have logged in, or that they are within the age range of 30-40.

If we have millions of users and consequently millions of pieces of metadata - what is an efficient way for us to find groups of our users that meet certain criteria (also called cohorts)? For example, say we want to send a newsletter out to all of our users who are within the age range of 30-40 and who have logged in today - how do we determine this group of users?

One solution that tends to be very memory-efficient is to use Redis bitmaps. The concept of bitmaps, specifically in regards to Redis, has been explained rather thoroughly quite a few times already so I won't go into depth here. If you are hungry for all the juicy details - I suggest the following well-known articles as a good place to start:

The tl;dr though, is that every time a user registers and lets us know they are in the age range of 30-40, we can set a key-value pair in Redis like so: SETBIT('agerange#3040', user_id, 1).

Similarly, once the user logs in, we can set another key-value pair: SETBIT('login:today', user_id, 1).

Then, finding the cohort becomes a matter of performing a bitwise AND operation on the two keys, like so: BITOP('AND', '3040logintoday', 'agerange#3040', 'login:today')

And just like that we have determined all the users we want to send our newsletter to!

At this point, you may be saying something along the lines of, "Ya, that's nice and all but Redis.. Bitmaps.. keys.. values.. Bitwise operations.. my head hurts already!". This is a totally understandable reaction - with a totally awesome rebuttal. Enter Bitmapist.

Bitmapist

Bitmapist is a Python library developed by the folks at Doist that simplifies working with Redis bitmaps. Further, they have even gone as far as also developing their own optimized Bitmapist server that implements a subset of Redis operations used by Bitmapist while relying on compressed bitmaps representation and only keeping hot dataset in memory to improve memory efficiency by 443 times!

What does setting all this up look like? Let's give it a spin!

First, let's install the bitmapist server

$ git clone https://github.com/Doist/bitmapist-server.git
$ cd bitmapist-server
$ go build
go: downloading github.com/artyom/autoflags v1.1.0
go: downloading github.com/artyom/resp v1.0.0
go: downloading github.com/golang/snappy v0.0.1
[...]
Enter fullscreen mode Exit fullscreen mode

Once go finishes building the server, you should see a bitmapist-server executable in the same directory. You can run it as it is, or supply an optional -addr option if you want it to listen somewhere other than the default localhost:6379. Running the executable should result in some output like so:

$ ./bitmapist_server
2020/01/25 15:22:09 loading data from bitmapist.db
2020/01/25 15:22:09 loaded in 4.922263ms
2020/01/25 15:25:09 periodic saving...
2020/01/25 15:25:09 saved in 11ms
Enter fullscreen mode Exit fullscreen mode

At this point you have your bitmapist server all set up! You can even use redis-cli to connect to it:

$ redis-cli
127.0.0.1:6379> keys **
(empty list or set)
Enter fullscreen mode Exit fullscreen mode

As we can see, we don't have any data in our server yet - let's change that by installing and running Bitmapist!

$ pip install redis bitmapist4
$ python
>>> import bitmapist4
>>> b = bitmapist4.Bitmapist('redis://localhost:6379', key_prefix='bitmapist__')
Enter fullscreen mode Exit fullscreen mode

The key_prefix parameter above ensures that every key you set with this session will be prepended with the value you choose, in this case bitmapist__. We currently use this functionality as a means of compartmentalizing different applications. That is, we set the key_prefix to be the application's uuid so that way we only deal with that application's data (key-value pairs when we perform our Bitmapist routines.

Let's perform the keys ** equivalent in Bitmapist:

>>> b.get_event_names()
[]
Enter fullscreen mode Exit fullscreen mode

As expected, it's simply an empty list. Let's mark some events!

>>> b.mark_unique('agerange#3040', 1)
>>> b.mark_unique('agerange#3040', 2)
>>> b.mark_unique('agerange#2030', 3)
>>> b.mark_event('login', 1)
>>> b.get_event_names()
['_agerange#2030', '_agerange#3040', '_login']
Enter fullscreen mode Exit fullscreen mode

So now we've told our server that user 1 and 2 is within the age of 30 and 40, user 3 is within the age of 20 and 30, and user 1 has logged in today. On the redis-cli side, keys ** now returns:

127.0.0.1:6379> keys **
1) "bitmapist__agerange#2030"
2) "bitmapist__agerange#3040"
3) "bitmapist__login_2020-1"
4) "bitmapist__login_W2020-4"
5) "bitmapist__login_2020-1-25"
6) "bitmapist__login"
Enter fullscreen mode Exit fullscreen mode

You may be wondering why there are extra keys in our server and also why we only stored the key login and not login:today. The answer to both these questions is part of what makes Bitmapist so awesome. When using mark_event(), Bitmapist automatically stores a key with just the event name, a key with the event name and with appended date values for the current day, week, month, and year.

This in turn allows you to perform calls such as the following to find out if an event has taken place within a certain time period:

>>> b.YearEvents('login')
YearEvents("login", 2020)
>>> b.MonthEvents('login')
MonthEvents("login", 2020, 1)
>>> b.WeekEvents('login')
weekEvents("login", 2020, 4)
>>> b.DayEvents('login')
DayEvents("login", 2020, 1, 25)
>>> b.HourEvents('login')
HourEvents("login", 2020, 1, 25, 15)
Enter fullscreen mode Exit fullscreen mode

This is great, as it significantly decreases our mental load when creating and managing our keys. Now, going back to our original problem, let's find all our users who are in within the age range of 30 and 40 and who have logged in today:

>>> cohort = b.BitOpAnd(b.UniqueEvents('agerange#3040'), b.DayEvents('login'))
>>> for user in cohort:
...    print(user)
1
Enter fullscreen mode Exit fullscreen mode

Okay, so it turns out we will only be sending our newsletter to user 1 today, which is a bit anticlimactic.. BUT the cool thing to note here is that Bitmapist also provides bitwise operation helper functions, such as BitOpAnd used above. Alternatively, you can also use the operator itself like so:

>>> cohort = b.UniqueEvents('agerange#3040') & b.DayEvents('login')
Enter fullscreen mode Exit fullscreen mode

Pretty cool, eh? That's really just the tip of the iceberg though; check out the project's github page for more examples and features, like the cohort table!

Bonus

What if you've read the gitub page, scoured the source code, and still feel like Bitmapist is missing something? Luckily, the people behind it seem to be pretty into clean and organized code, because customizing it is a breeze.

For example, we have use cases where we need to calculate populations of users like so:

  • How many users were using an android device between the dates of 16-01-20 and 19-01-20?
  • How many users were using an ios device between February and May?

Bitmapist doesn't provide a class to handle this type of event search, so I added my own! I accomplished this by editing only two of the project's files: bitmapist4/events.py and bitmapist4/core.py. First, in events.py:

class RollingEvents(BaseEvents):
    """
    Events over a specified amount of time.

    Example:

        date = datetime.strptime('19-02-19', '%d-%m-%y')
        RollingEvents('event_name', 'day', date, 36)
    """

    def __init__(self, event_name, scale, dt=None, delta=0):
        self.event_name = event_name
        self.scale = scale
        self.dt = dt or datetime.datetime.utcnow()
        self.delta = delta

        events = []
        if scale == 'day':
            events.append(self.bitmapist.DayEvents.from_date(event_name, dt))
            for i in range(delta):
                events.append(self.bitmapist.DayEvents.from_date(event_name, dt).delta(-(i + 1)))
        elif scale == 'week':
            events.append(self.bitmapist.WeekEvents.from_date(event_name, dt))
            for i in range(delta):
                events.append(self.bitmapist.WeekEvents.from_date(event_name, dt).detla(-(i + 1)))
        elif scale == 'month':
            events.append(self.bitmapist.MonthEvents.from_date(event_name, dt))
            for i in range(delta):
                events.append(self.bitmapist.MonthEvents.from_date(event_name, dt).delta(-(i + 1)))
        elif scale == 'year':
            events.append(self.bitmapist.YearEvents.from_date(event_name, dt))
            for i in range(delta):
                events.append(self.bitmapist.YearEvents.from_date(event_name, dt).delta(-(i + 1)))
        if events:
            if len(events) == 1:
                self.redis_key = events[0].redis_key
            else:
                or_op = self.bitmapist.BitOpOr(*events)
                self.redis_key = or_op.redis_key
[...]
Enter fullscreen mode Exit fullscreen mode

Then, in core.py:

class Bitmapist(object):
[...]
    def __init__(self,
[...]
        kw = {'bitmapist': self}
        self.UniqueEvents = type('UniqueEvents', (ev.UniqueEvents, ),
                                  kw)  # type: Type[ev.UniqueEvents]
        self.RollingEvents = type('RollingEvents', (ev.RollingEvents, ),
                                  kw) # type; Type[ev.RollingEvents]
[...]
Enter fullscreen mode Exit fullscreen mode

With those changes, I could solve the first problem mentioned above by using the RollingEvents class like I do any other:

>>> len(b.DayEvents('os:android').delta(-6))
66
>>> len(b.DayEvents('os:android').delta(-7))
240
>>> len(b.DayEvents('os:android').delta(-8))
319
>>> len(b.DayEvents('os:android').delta(-9))
255
>>> six_days_ago = datetime.datetime.utcnow() - datetime.timedelta(6)
>>> len(b.RollingEvents('os:android', 'day', six_days_ago, 3))
880
Enter fullscreen mode Exit fullscreen mode

Great success!

Conclusion

Bitmapist and the Bitmapist server both make reasoning about and implementing bitmap-based datasets straight-forward and fun. They also lower the knowledge barrier required to start experimenting with such datasets, as they abstract away all the Redis commands that are typically quite daunting to newcomers.

Further, they are written well and comprise code easy to navigate, which provides for quick customization if needed.

Keep up the great work Doist!

Discussion (0)