TL;DR notes from articles I read today.
- Beware the schemaless nature of NoSQL systems, which can easily lead to sloppy data modeling at the outset. Start with an RDBMS in the first place, preferably with a JSON data type and indices on expressions, so you can have a single database for both structured and unstructured data and maintain ACID compliance.
- Bring ETL closer to the data and be wary of decentralized data cleaning transformation. Push data cleaning to the database level wherever possible - use type definitions, set a timestamp with timezone policy to enable ‘fail fast, fairly early’, use modern data types such as date algebra or geo algebra instead of leaving that for Pandas and Lambda functions, employ triggers and stored procedures.
- Create more features at the query level to gain flexibility with different feature vectors, so that model selection and evaluation are quicker.
- Distributed systems like MongoDB and ElasticSearch can be money-hungry (both in terms of technology and human resources), and deployment is harder to get right with NoSQL databases. Relational databases are cheaper, especially for transactional and read-heavy data, more stable and perform better out of the box.
- Be very meticulous as debugging is quite difficult for SQL, given its declarative nature. Also, be mindful of clean code and maintainability.
Full post here, 13 mins read
- Ensure you replicate data for storage - in the case of databases, redundancy introduces reliability.
- For consistency across multiple database replicas, a write request to any node should trigger write requests for all replicas.
- In an ‘eventual consistency’ model, you can achieve low latency for read requests by delaying the updates to replicas, but you will risk returning stale data to read requests from some nodes if the update has not reached them yet.
- With a ‘strong consistency’ model, write requests to replicas will be triggered immediately. However, they will delay subsequent read/write requests to any of the databases until the consistency is reached.
Full post here, 4 mins read