Clickhouse over Postgresql?

#clickhouse #discuss #olap #postgres

About this post

All right, fellows, I'm quite new to the data world, my entire career was development itself, but aren't we always trying to get better? That's why I'm taking my chances now, and my first challenge in my adventure is to decide between Postgres and Clickhouse (I'll explain at the problem section). Anyway, you can use this article as a source of knowledge but I beg you to don't skip the comments section because the main point of this writing is to help me discover all concept mistakes I have in my mind! So read this, but go check what te good coders have to say about it :).

The problem

Now, the most important part of every task, understanding the problem.

Although I need to store data for transactions and analysis, two different things, I still have all my application info stored in one Postgres DB. I guess for an app with a small amount of data that would be nothing to worry about, I mean, Postgres was not made for OLAP but it can handle the job. But in my case scenario, I have a Bigdata challenge, we're saving a considerable quantity of historical data along with daily transactional info and this is inflicting our performance.

Another worthy thing to remember about my problem: The end-user can create his own queries, so it's a little bit difficult to predict how our DB it's gonna be used.

The solution

Here is where I expose myself, I thought about some workarounds and the point is for you guys read and tell me when and if I got it wrong, if you agree and got extra tips or own experience stories, please tell me all about! Appreciate!

With Postgres

I need to, at least, try to maintain Postgres in my system, after all, it seems easier when you compare to migrate your entire application to Clickhouse. Let's skip obvious things, such as updating hardware, isolating DB from application etc.

The only way I figured out to continue with Postgres is by changing our system architecture. Since is quite impossible to predict all types of queries our users will create, indexing and SQL structures will not solve 100% of the problem. That's why I'm thinking about an approach where we store historical and transactional data into two different environments, this way when our end-user create his own query it wouldn't need to scan an entire big table, only a small piece of it and perhaps if he needs to access the oldest data we can redirect him to the other slower environment.
But this will apply more complexity to our application and it goes against a hybrid OLAP/OLTP architecture, that I believe is a good bet to the future!
Any tips here?

Clickhouse

If you don't know much about Clickhouse, please check this link, in short ClickHouse is a columnar DBMS for OLAP, it is great to run queries on a big amount of data, here you can see a benchmark comparison between Postgres and Clickhouse.
Clickhouse itself would easily solve our performance problem, bigdata is definitely not a problem for this guy, but still, not 100% good.

"Specialized tools are specialized: just remember the limitations! - Bad with heterogeneous hardware (Cloudflare experience) - Non-throttled recovery (source replicas flooded with replication load) - No real delete/update support, and no transactions - No secondary keys - Own protocol (no MySQL protocol support) - Limited SQL support, and the joins implementation is different. If you are migrating from MySQL or Spark, you will probably have to re-write all queries with joins" - hnmullany from Hackernews

All right, we kinda have a solution here, Clickhouse is for sure a good choice, but will not solve everything, that's the point when I started to think about combining the pros of Postgres and Clickhouse. Maybe a Foreign Data Wrapper could save me here. This way I can use Clickhouse to handle all massive data queries, even those that my end-user will create himself and for operations like delete and update I can continue to use Postgres features.

Using clickhousedb_fdw may be a good choice for that, but... Is not perfect as well, I don't have much experience using an FDW but I guess it would apply complexity to our application, mainly when we're talking about synching our foreign tables.

That's what I have so far, what you guys think about?

Thanks for stopping by! Good Luck and Good Code.

Top comments (2)

Gustavo Gardusi • Jan 15 '20

Just my 2 cents here.

I think that database cost is quite cheaper than developer cost. Have you considered having multiple databases? Maybe you can have a lot of queries with big data, but you could optimize it by pre processing useful data.

Suppose that you can do some sort of search, trying to find most used queries. Then you can find out what are the bottlenecks, like: structure A is quite slow with query type X, but is quite fast for query type Y. Then you think about a structure B that is quite fast for query type X, but slow for query type Y. Why not using two databases?

If storage is not a problem, you can have multiple structures, each focused on a specific type of query that occurs more. The remaining ones does not occurs much, so you total complexity wont be affected too much.

Imagine something like spotify. You can have a tree where the key is artist name. If you want to query a song and you don't know the artist, it is going to be quite costly. But if you have another structure that can answer what is the artist of the song, then it becomes much faster :D

Just create a good back-end that can talk with all databases and check their sync from time to time, like on weekends, when there is not much usage.

Nicolas Lima • Jan 16 '20

Hey dude, long time no see! thanks for the reply, I guess the cheapest way is, indeed, to focus on the DB structures. And DB cost is way cheaper than developer cost, that's a valid point haha.