Things I Learned Building an Analytics Engine

Doug Black on October 20, 2018

Oh man, am I excited. This side project has been awhile coming. I just released Engauge Analytics (https://engaugeanalytics.com/), a web analytics... [Read Full]
markdown guide
 

Both recent versions of MySQL and Postgres come with multi-database sharding support. I recommend looking into it sooner rather than later, unless you already have everything properly sharded in which case carry on.

 
 

Any feedback for a relational database that scales large and gives quick queries?

You can migrate to MariaDB and scale your DB. It has multiple storage engines, and a master-to-master replication, sharding or the traditional master-slave. With the columnar storage I think it can handle PB of data.

But all the big analytics players uses simpler key-value (columnar) solutions, so they can scale horizontally. Collecting events and running crunching jobs to aggregate and enrich them is better than squeezing performance from a SQL query.

Side-projects are fun for us, devs, the problem arises when we want money out of it. Then all the stuff come that we do not want to handle, from laws to marketing, from customer support to hosting bills.

 

Thank you! I went with Percona off the bat for this one...it's just soooooo fast.

Tell me more about the key-value solutions! This may be just what I'm looking for!

 

At an abstract level:

Getting rid of the relationships, and using simple documents, you can shard better, with specific Storages like Cassandra.

Sharding an SQL, most of the times, it requires to get rid of the relationships and Joins. Even if it does not, it will add an overhead because it will query and group data from different shards, in a cascading effect.

If the "sharding" algorithm has to take into consideration data relationships, and wants to keep data as local as possible, then you will have "Hot" spots and unbalanced shards.

I don't say it is impossible to scale SQL, I say that it will be harder and more expensive, if you can afford Spanner from Google or a big setup of Vintess, or 5-8 big servers behind a Galera go ahead!

Bottom line, if you want to go beyond a few TBs of data, I would suggest rethink your structure in a Columnar way, and less SQLish.

code of conduct - report abuse