Artyom Keydunov for Cube

Posted on Feb 21, 2019 • Edited on Jul 2, 2019 • Originally published at statsbot.co

Building a Serverless Mixpanel Alternative. Part 1: Collecting and Displaying Events

#serverless #node #opensource #javascript

This is the first part of a tutorial series on building an analytical web application with Cube.js. It expects the reader to be familiar with Javascript, Node.js, React, and have basic knowledge of SQL. The final source code is available here and the live demo is here. The example app is serverless and running on AWS Lambda. It displays data about its own usage.

There is a category of analytics tools like Mixpanel or Amplitude, which are good at working with events data. They are ideal for measuring product or engagement metrics, such as activation funnels or retention. They are also very useful for measuring A/B tests.

Although all these tools do a job, they are proprietary and cloud-based. That could be a problem when privacy is a concern. Or if one wants to customize how funnels or retention work under the hood. While traditional BI tools, like Tableau or Power BI, could potentially be used to run the same analysis, they can not offer the same level of user experience. The problem is that they are designed to be general business intelligence tools, and not specific for funnels, retention, A/B tests, etc.

With recent advancements in frontend development, it became possible to rapidly develop complex user interfaces. Things which took a week to build five years ago could be built in an afternoon nowadays. On the backend and infrastructure side, cloud-based MPP databases, such as BigQuery and Athena, are dramatically changing the landscape. The ELT approach, when data is transformed inside the database, is getting more and more popular, replacing traditional ETL. Serverless architecture makes it possible to easily deploy and scale applications.

All of these made it possible to build internal alternatives to established services like Mixpanel, Amplitude, or Kissmetrics. In this series of tutorials, we’re going to build a full-featured open-source event analytics system.

It will include the following features:

Data collection;
Dashboarding;
Ad hoc analysis with query builder;
Funnel analysis;
Retention analysis;
Serverless deployment;
A/B tests;
Real-time events monitoring;

The diagram below shows the architecture of our application:

In the first part of our tutorial, we’ll focus more on how to collect and store data. And briefly cover how to make a simple chart based on this data. The following parts focus more on querying data and building various analytics reporting features.

Collecting Events

We’re going to use Snowplow Cloudfront Collector and Javascript Tracker. We need to upload a tracking pixel to Amazon CloudFront CDN. The Snowplow Tracker sends data to the collector by making a GET request for the pixel and passing data as a query string parameter. The CloudFront Collector uses CloudFront logging to record the request (including the query string) to an S3 bucket.

Next, we need to install Javascript Tracker. Here is the full guide.

But, in short, it is similar to Google Analytics’s tracking code or Mixpanel’s, so we need to just embed it into our HTML page.

<script type="text/javascript">      
  ;(function(p,l,o,w,i,n,g){if(!p[i]){p.GlobalSnowplowNamespace=p.GlobalSnowplowNamespace||[];
   p.GlobalSnowplowNamespace.push(i);p[i]=function(){(p[i].q=p[i].q||[]).push(arguments)
   };p[i].q=p[i].q||[];n=l.createElement(o);g=l.getElementsByTagName(o)[0];n.async=1;
   n.src=w;g.parentNode.insertBefore(n,g)}} .  (window,document,"script","//d1fc8wv8zag5ca.cloudfront.net/2.10.2/sp.js","snowplow"));

  window.snowplow('newTracker', 'cf', '<YOUR_CLOUDFRONT_DISTRIBUTION_URL>’, { post: false });
</script>

Here you can find how it is embedded into our example application.

Once we have our data, which is CloudFront logs, in the S3 bucket, we can query it with Athena. All we need to do is create a table for CloudFront logs.

Copy and paste the following DDL statement into the Athena console. Modify the LOCATION for the S3 bucket that stores your logs.

CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
  `date` DATE,
  time STRING,
  location STRING,
  bytes BIGINT,
  requestip STRING,
  method STRING,
  host STRING,
  uri STRING,
  status INT,
  referrer STRING,
  useragent STRING,
  querystring STRING,
  cookie STRING,
  resulttype STRING,
  requestid STRING,
  hostheader STRING,
  requestprotocol STRING,
  requestbytes BIGINT,
  timetaken FLOAT,
  xforwardedfor STRING,
  sslprotocol STRING,
  sslcipher STRING,
  responseresulttype STRING,
  httpversion STRING,
  filestatus STRING,
  encryptedfields INT
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://CloudFront_bucket_name/AWSLogs/Account_ID/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )

Now we are ready to connect Cube.js to Athena and start building our first dashboard.

Building Our First Chart

First, install Cube.js CLI. It is used for various Cube.js workflows.

$ npm install -g cubejs-cli

Next, сreate a new Cube.js service by running the following command. Note, we are specifying Athena as a database here (-d athena) and template as serveless (-t serverless). Cube.js supports different configurations, but for this tutorial, we will use the serverless one.

$ cubejs create event-analytics-backend -d athena -t serverless

Once run, the create command will create a new project directory that contains the scaffolding for your new Cube.js project. This includes all the files necessary to spin up the Cube.js backend, example frontend code for displaying the results of Cube.js queries in a React app, and some example schema files to highlight the format of the Cube.js Data Schema layer.

The .env file in this project directory contains placeholders for the relevant database credentials. For Athena, you'll need to specify the AWS access and secret keys with the access necessary to run Athena queries, and the target AWS region and S3 output location where query results are stored.

CUBEJS_DB_TYPE=athena
CUBEJS_AWS_KEY=<YOUR ATHENA AWS KEY HERE>
CUBEJS_AWS_SECRET=<YOUR ATHENA SECRET KEY HERE>
CUBEJS_AWS_REGION=<AWS REGION STRING, e.g. us-east-1>
# You can find the Athena S3 Output location here: https://docs.aws.amazon.com/athena/latest/ug/querying.html
CUBEJS_AWS_S3_OUTPUT_LOCATION=<S3 OUTPUT LOCATION>

Now, let’s create a basic Cube.js Schema for our events model. Cube.js uses Data Schema to generate and execute SQL; you can read more about it here.

Create a schema/Events.js file with the following content.

const regexp = (key) => `&${key}=([^&]+)`;
const parameters = {
  event: regexp('e'),
  event_id: regexp('eid'),
  page_title: regexp('page')
}

cube(`Events`, {
  sql:
    `SELECT
      from_iso8601_timestamp(to_iso8601(date) || 'T' || "time") as time,
      ${Object.keys(parameters).map((key) => ( `url_decode(url_decode(regexp_extract(querystring, '${parameters[key]}', 1))) as ${key}` )).join(", ")}
    FROM cloudfront_logs
    WHERE length(querystring) > 1
    `,

  measures: {
    pageView: {
      type: `count`,
      filters: [
        { sql: `${CUBE}.event = 'pv'` }
      ]
    },
  },

  dimensions: {
    pageTitle: {
      sql: `page_title`,
      type: `string`
    }
  }
});

In the schema file, we create an Events cube. It is going to contain all the information about our events. In the base SQL statement, we’re extracting values from the query string sent by the tracker by using the regexp function. Cube.js is good at running transformations such this and it could also materialize some of them for performance optimization. We’ll talk about it in the next parts of our tutorial.

With this schema in place, we can run our dev server and build the first chart.

Spin up the development server by running the following command.

$ npm dev

Visit http://localhost:4000, it should open a CodeSandbox with an example. Change the renderChart function and the query variable to the following.

const renderChart = resultSet => (
  <Chart height={400} data={resultSet.chartPivot()} forceFit>
    <Coord type="theta" radius={0.75} />
    <Axis name="Events.pageView" />
    <Legend position="right" name="category" />
    <Tooltip showTitle={false} />
    <Geom type="intervalStack" position="Events.pageView" color="x" />
  </Chart>
);

const query = {
  measures: ["Events.pageView"],
  dimensions: ["Events.pageTitle"]
};

Now, you should be able to see the pie chart, depending on what data you have in your S3.

In the next part, we’ll walk through how to build a dashboard and dynamic query builder, like one in Mixpanel or Amplitude. Part 3 will cover how to build Funnels and Part 4—Retention. In the final part, we will discuss how to deploy the whole application in the serverless mode to AWS Lambda.

You can check out the full source code of the application here.

And the live demo is available here.