Choosing Your Data Warehouse

#data #nonprofit #devops #dataengineering

Top comments (4)

evantbyrne_38 • Jun 19 '17

Sena,

Nice writeup! Have you run into any issues with RDS's eventually consistent nature? I've been thinking about using it, but I'm not sure how to deal with the lack of a consistency guarantee.

– Evan

Sena Heydari • Jun 19 '17

Evan,

I might be misunderstanding your question, but let me provide 2 answers:

When you mention RDS, I'm guessing you mean MySQL, since RDS has PostgreSQL, MySQL, MariaDB, etc. as products you can use. For MySQL, we don't have consistency issues for the ETL work that writes to the master node, since MySQL (in most cases) errors out if a transaction can't be consistently completed.
If you're referring to the eventual consistency of slave DB replication which powers our analytics and dashboards, we've run into a few issues, but we have a few safeguards against it:

Our data team knows that the replication is not always real-time or close to real-time, especially when large amounts of records ingestion or processing is taking place.
As long as the replication delay falls within our data delivery SLA i.e. our dashboards are no more than X hours out of date, we don't escalate to notifying the entire org.
If our data delivery SLA is exceeded, we notify staff and make sure to notify them again when things are back to normal.

Granted, these are operational safeguards, and there could be scenarios when these might fail us. But this is our first pass. Let me know if I didn't answer your question fully, and I'm happy to jump back on this thread!

evantbyrne_38 • Jun 20 '17

Sena,

I was referring to the eventual consistency on the slave DB replication. This clears things up. Thanks for the detailed response!

– Evan

Noah ISO8601 • Jul 22 '17

"Our particular data schema didn't lend itself to easily use the lowest tier Redshift instances. This meant a 40x price increase for a cluster using the next instance size up."

That's... no. I've been using Redshift since it was released, and I'm here to help.

If you're able to run your reporting workload on MySQL at all, you could almost certainly run it on one (1) dense-storage Redshift node for $.85/hr. If for some reason your data set is smaller but your compute needs more intense, a small cluster of 4-8 dense-compute nodes (at $.25/hr each) would work. There is no way you would ever need either of the XL node types.

What you say about the schema not fitting suggests you got the wrong idea about how hard it is to tune data distribution in Redshift. It's not that difficult. For a typical workload involving large facts and small dimensions, just set everything to EVEN and you'll be fine.

It may ultimately be fine to run this workload on MySQL. If that's what your team has capability with, great. And since MySQL finally added hash joins, it's not literally impossible to run serious analytic queries on it, like it used to be. Just recognize that it's not the right tool for the job, and you're giving up a ton of functionality (and potentially performance) compared to Postgres, Redshift, Greenplum, Vertica ($$), or (blech) Oracle.