- The aim of this guide is to provide an easy to understand, yet comprehensive introduction to distributed systems, which have become fundamental to modern data processing systems.
- The target audience for this guide are the people who desire, or are professionally starting to work in distributed systems. The guide might also be useful as a refresher for experienced professionals.
As the name implies, it's a system that is distributed in nature.
The components of this system works together as one cohesive unit.
It is also fault tolerant and horizontally scalable comparatively much more easily when compared to a non-distributed system
Broadly speaking, they allow us to achieve following
- Horizontal scalability
- High efficiency for given infra costs
- High availability
No. At least not without incurring heavy costs
- Complexity increases many fold
- Understanding overall system takes expertise from multiple domains
- Data duplicacy occurs, generally
- Data migrations become more difficult
- Networking costs increses
- Securing the system becomes more difficult
- Deployments and troubleshooting becomes difficult
- Whenever a non-distributed system starts to incur heavy
- You effectively trade-off simplicity for performance at a given cost
- Till you're able to maintain acceptable cost/performance ratio. The lower, the better
- Reliability: Ability of a system to perform its required functions under stated conditions for a specific period of time. It is a measure of continuity of correct service. Reliability is a safety guarantee stating that nothing bad will happen (under the stated conditions)
Availability: The proportion of time for which a system can perform its function as seen from a client's perspective. It is measured in percentage units w.r.t. time. It is defined as probability value. A
99.999% availablesystem means it can go down for
max 5.26 minutes/year, or
max 26.30 seconds/month. This is also know as
5 ninesof availability.
- Scalability: The property of a system to be able to meet increased load by adding proprtional amount of resources. In simple sense, this means the system can handle increased amount of load by simply consuming more resources, without letting it's performance get impacted in a negative manner.
- Fault Tolerance: The ability of a system to detect a fault, and instantaneously switch to the redundant copy of the component with almost negligible downtime. Loss may includes network failures, cpu, ram, disk, power failures, etc.. This mostly relates to the hardware part.
- High Availability: Similar to fault tolerance, but more cost effective at the expense of comparatively more, but acceptable downtime. This works on the software side, and uses redundant systems and smart fault detection and correction strategies for it to function.
- Consistency: Ability of a system to maintain a single, up-to-date copy of the data, irrespective of how widely distribited it is.
- Atomicity: Ability of a system to either correctly apply all given operations, or none.
- Durability: Ability of a system to be able to persist information correctly even across hardware failures, once the information has been committed to its's persistence backend, whatever that may be.
- Latency: Time delay between cause and effect. In distributed system terms, it generally refers to delay in propagating any change in data across one part of the system to other.
- Replication: Act of making multiple copies of a subject. In our case, the data. Used to improve data availability and accessibility, and to improve system resilience and reliability.
- Transaction: A single logical unit of work. It either completes fully, or none.
- Sharding(Partitioning): Partitioning of related data across more than one locations (machines/nodes) to either achieve higher concurrency, or allow holding more data as allowed on one location, or both. It is also known as horizontal partitioning of data as a database table is split horizontally (across rows) as opposed to across columns (as in vertical partitioning).
- Accuracy and Correctness: Accuracy means your clock is changing at the same rate as the perfect clock. Correctness means the clocks register the same time
This is a well know theorem used to help system designers make informed decisions about trade-offs between available system resources and desired functionality while designing networked shared-data systems (distributed systems)
- CAP refers to Consistency, Availability, Partition tolerance. The theorem, in it's very basic sense, states that in a distributed system, one can only have either the consistent system, or the available system, in a partitioned network state.
Partitioned networksimply means a component is not reachable due to network failure.
- The theorem helps us to decide what is more importannt for our current use-case. For transactional data, the choice is consistency, while for high throughput data that can deal with a little lag (like analytics data), consistency can be traded off to achieve better availability, for given associated costs.
- A system that is partition-tolerant can sustain any amount of network failure that doesn’t result in a failure of the entire network. Data records are sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
- When dealing with modern distributed systems, Partition Tolerance is not an option. It’s a necessity. Hence, we have to trade between Consistency and Availability.
- A variant of this is BASE theorem which effectively trades away
consistencyfor other properties, effectively, eventual consistency while gaining far greater availability and scalability. This stands for
Basically Available Soft State and Eventually Consistent.
This is an extension to CAP theorem and helps to make better decision wherein network is NOT partitioned.
- It states that in case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C) (as per the CAP theorem), but else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).
- What this is trying to say is that even when network is functionally in desired state, one has to chose between latency and consistency, and can't have both, to their best possible values.
It is a set of properties of database transactions intended to guarantee data validity, despite errors, power failures, and other mishaps. Relational databases are primary candidates of such properties, though other types might also support (partially or fully)
- A: Atomicity: This property states that a transaction must be treated as an atomic unit, that is, either all of its operations are executed or none. There must be no state in a database where a transaction is left partially completed
- C: Consistency: The database must remain in a consistent state after any transaction. No transaction should have any adverse effect on the data residing in the database. If the database was in a consistent state before the execution of a transaction, it must remain consistent after the execution of the transaction as well
- I: Isolation: In a database system where more than one transaction are being executed simultaneously and in parallel, the property of isolation states that all the transactions will be carried out and executed as if it is the only transaction in the system. No transaction will affect the existence of any other transaction.
- D: Durability: The database should be durable enough to hold all its latest updates even if the system fails or restarts.
Dictionary definition of the term says:
Consensus is a group discussion where everyone's opinions are heard and understood, and a solution is created that respects those opinions. Consensus is not what everyone agrees to, nor is it the preference of the majority. Consensus results in the best solution that the group can achieve at the time.
In distributed systems, same definition applies. The only difference is in the meaning of the word
opinion. The above definition for distributed systems can be stated as:
The goal of a distributed consensus algorithm is to allow a set of computers to all agree on a single value that one of the nodes in the system proposed (as opposed to making up a random value). The challenge in doing this in a distributed system is that messages can be lost or machines can fail.
Note: Above algorithms are pretty detailed in nature and require their own separate space for them to be properly explained and understood.
- Strong consistency: After an update completes, all further operations correctly use the new value
- Weak consistency: Operations after an update completes may not correctly use the new value
- Eventual consistency: If no new updates are made to an object, at some point in future it will return correct value
- Casual consistency: Once an updated version is communicated to a process, it is guaranteed to not use older versions
- A collision is said to occur when two activities, which may or may not be full-fledged transactions, attempt to change entities within a system of record.
- There are three fundamental ways (Celko 1999) that two activities can interfere with one another:
- 1. Dirty Read: Activity 1 (A1) reads an entity from the system of record and then updates the system of record but does not commit the change (for example, the change hasn’t been finalized). Activity 2 (A2) reads the entity, unknowingly making a copy of the uncommitted version. A1 rolls back (aborts) the changes, restoring the entity to the original state that A1 found it in. A2 now has a version of the entity that was never committed and therefore is not considered to have actually existed.
- 2. Non-repeatable Read: A1 reads an entity from the system of record, making a copy of it. A2 deletes the entity from the system of record. A1 now has a copy of an entity that does not officially exist.
- 3. Phantom Read: A1 retrieves a collection of entities from the system of record, making copies of them, based on some sort of search criteria such as “all customers with first name Bill.”A2 then creates new entities, which would have met the search criteria (for example, inserts “Bill Klassen” into the database), saving them to the system of record. If A1 reapplies the search criteria it gets a different result set.
- It deals with the issues involved with allowing multiple people simultaneous access to shared entities, be they objects, data records, or some other representation.
- It provides concepts and technologies to synchronize multiple transactions in a way that their interleaved execution does not violate ACID properties.
These strategies use the concept of locking data items. A lock is a variable associated with a data item that determines whether read/write operations can be performed on that data item.
- 1. Pessimistic Locking:
- 1. Pessimistic locking is an approach where an entity is locked in the database for the entire time that it is in application memory (often in the form of an object). A lock either limits or prevents other users from working with the entity in the database.
- 2. A write lock indicates that the holder of the lock intends to update the entity and disallows anyone from reading, updating, or deleting the entity. A read lock indicates that the holder of the lock does not want the entity to change while the hold the lock, allowing others to read the entity but not update or delete it.
- 3. The scope of a lock might be the entire database, a table, a collection of rows, or a single row.
- 4. The advantages of pessimistic locking are that it is easy to implement and guarantees that your changes to the database are made consistently and safely.
- 5. The primary disadvantage is that this approach isn’t scalable
- 2. Optimistic Locking:
- 1. The basic idea is that you accept the fact that collisions occur infrequently, and instead of trying to prevent them you simply choose to detect them and then resolve the collision when it does occur.
- 2. In this, a database typically marks read rows in some way (unique id, timestamp, incremental counter, etc.) when loading into memory for manipulation, and checks before writing them back if the marked values are same or not. Collision is detected if the marked values are not same.
- 3. Overly Optimistic Locking:
- 1. In this, you neither try to avoid nor detect collisions, assuming that they will never occur.
- 2. This strategy is appropriate for single user systems, systems where the system of record is guaranteed to be accessed by only one user or system process at a time, or read-only tables.
- 3. This obviously gives best performance.
- These strategies use a transaction’s timestamp to coordinate concurrent access to a data item to ensure serializability. A timestamp is a unique identifier given by DBMS to a transaction that represents the transaction’s start time.
- They ensure that transactions commit in the order dictated by their timestamps. An older transaction should commit before a younger transaction, since the older transaction enters the system before the younger one.
- 1. Basic timestamp ordering algorithm
- 2. Conservative timestamp ordering algorithm
- 3. Multiversion algorithm based upon timestamp ordering
A database storage engine is an internal software component that a database server uses to store, read, update, and delete data in the underlying memory and storage systems. Broadly speaking there are 2 types that are used extensively:
B-Tree Based Engine: Uses
b-treedata structure to hold data. Performs better with reads. Example databases: MySQL, Postgre, MongoDB, Couchbase, etc.
Log Structured Merge (LSM) Tree Based Engine: Uses
lsmstructure to organize data. Performs better with writes. Example databases: Cassandra, BigTable, Elasticsearch, RocksDB, etc.
Log-Based Data Replication
- 1. Involves making use of transaction log kept by databases. This log file keeps track of every change in the database, right from the very beginning, in chronological order.
- 2. It can be implemented in following 2 ways:
- 1. Statement Based Log Replication: Statement-based replication keeps track and stores all such commands, queries or actions that modify the database and bring about updations. Procedures that have the statement-based mechanism in place generate the replicas by re-running all these statements in the order of their occurrence
- 2. Row Based Log Replication: Row-based replication keeps track of all the new rows of the database and stores them in the form of a record in the log file. The unique position Id of a row acts as a bookmark, allowing the database to continue the replication process with ease.
Full Table Data Replication
- 1. It involved replicating the database and its tables completely.
- 2. It will carry out replication for all the data records, irrespective of whether they are old or new.
- 3. It can be implemented in following ways:
- 1. Snapshot Replication: It generates a replica of your database by taking a “snapshot” of how your tables, data, relationships, etc. look like at a particular point in time and then replicates the same on the other database.
- 2. Transactional Replication: It achieves replication by first monitoring the updates as they occur on the master database and then carrying out sync to make all these changes in the replicas. It ensures transactional consistency by carrying out the updates in the same order as the original database.
- Key-Based Incremental Data Replication: This replication strategy leverages the replication key column to identify the new and updated data. A replication key column is a column holding unique values across all dataset. It then carries out the replication process for records that house the updated replication keys.
Note: Look at the section
A few imortant termsabove to learn about sharding, in case you need to.
- Algorithmic Sharding: This involved computing hash of the key in a record and computing modulo-n of that hash where n is the number of nodes. The advantage is of elimation of the need for a centralized coordinator. Disadvantage is that of massive shuffling of data needed when the number of nodes change.
Linear Hash Sharding: Almost similar to algorithmic sharding with the only difference in hash function used. Here, the hash function used generates linear hashes, i.e., it maintains the relative ordering of input values while changing their distribution spacing. In other words, it minimizes
cross-node jumpsto minimal, thus, reducing data shuffling that's needed. The problem of this technique is the formation of hotspots due to the impossibility of picking up good shard split boundaries ahead of time.
- Consistent Hash Sharding: This technique makes use of consistent hashing technique to distribute data across nodes. With this, there are many more shards than the actual number of nodes, and a separate mapping table tracks the assignment of these shards to the nodes. When adding new nodes, a subset of shards from existing nodes can be efficiently moved into the new nodes without requiring a massive data reassignment. Disadvantage of this is in making range queries, which become inefficient. This hashing technique is widely used across many nosql databases.
- Range Sharding: This sharding technique involves splitting the rows of a table into contiguous ranges that respect the sort order of the table based on the primary key column values. The tables that are range sharded usually start out with a single shard. As data is inserted into the table, it is dynamically split into multiple shards because it is not always possible to know the distribution of keys in the table ahead of time. This sharding technique, though, suffers from the problem of initially-warm-nodes which, initially, take-in the bulk of the query load. Also, this do lead to hotspots as some of the rows for keys(primary) and definitely much more used than others, and since these, with high probability are on same node, leads to hotspots, degrading overall performance of the system.
Taken directly from wikipedia:
The fallacies of distributed computing are a set of assertions made by L Peter Deutsch and others at Sun Microsystems describing false assumptions that programmers new to distributed applications invariably make.
- The network is reliable: The fallacy of making an assumption that the network is reliable, and thus, no effort being put into error handling part by the developer, leading to application stalling in the case of network outage.
- Latency is zero: The fallacy of assuming the latency between coordinating parties to be zero leading to network congestion, and/or application logic not working as expected, in transactional workloads.
- Bandwidth is infinite: The fallacy of assuming the availability of infinite bandwidth between cooperating parties. This can lead to overestimating the load handling capacity of the cooperating systems, leading to failure under actual overloaded situations.
- The network is secure: This fallacy can lead to data being intercepted by hackers.
- Topology doesn't change: This fallacy can lead to changes in bandwidth and latency, thus, affecting SLA's.
- There is one administrator: This can lead to changes that can make cooperating nodes alltogether unreachable.
- Transport cost is zero: This is many a times the ignored one. But, at least with cloud vendors, do constitute significantly to overall cloud costs.
- The network is homogeneous: This is not so relevant today, but was at a time when proprietary protocols were prominently used in communications across variety of clients that can connect to your services. This might be much more prominent in devices that belong to very niche category of services, for example, satellite communications.
- We all trust each other: This is becoming more and more relevent with governments these days actively spoofing over communications by it's citizens. This effectively means to make an assumption that the party you are talking to is the one stating who it is, but, is in fact, is someone else, impersonating the one you actually would like to communicate.
- Configuration Management: Configuration for a microservice application needs to be externalized from the code and be retrievable via a simple service call
- Service Discovery: Maintain a list of service instances that are available for work within a microservice domain
- Load Balancing: How to distribute load across a cluster of nodes to enable scalability of the system
- API Gateway: Provides additional services like proxying, protocol translation, and other management functions on top of API's exposed by participating (micro)services
- Security Concerns: Access-policy definition and implementation in a consistent manner across all participating (micro)services
- Centralized Logging: Need to have a centralized system to gather application/DB logs, and allow querying on top of it to assist with troubleshooting of problems and help with overall optimization of the system
Centralized Metrics: Same as
centralized loggingabove, but, with metrics. Allows monitoring health and performance of overall system from a central place. Alerts can then be setup and triggered as and when there is any deviation from expected behavior of any component of the system
- Distributed Tracing: Allows to trace paths taken by data while it traverses participating services. These paths can grow pretty complex, and hence, the need
- Rasilience and Fault Tolerance: Distributed systems must be capable of auto-routing around failures, and be capable of routing requests to the service instance that will provide an optimum response
- Autoscaling and Self-Healing: Distributed systems respond to higher load by scaling horizontally: the platform must detect and auto-respond to such conditions. Furthermore, the system needs to detect failures and attempt auto-restarts without operator input
- Packaging, Deployment and Scheduling: Large-scale systems require robust package management, and deployment systems to manage rolling or blue-green deployments, and rollbacks if necessary. A scheduler helps determine which particular execution node a new set of services can be deployed to based on current conditions
- Job Management: Scheduled computations disconnected from any individual user requests. These basically form background jobs, for example, ETL jobs
- Singleton Applications: Limit a specific service to run as the only instance of that service within the entire system. This is the case, for example, when you are migrating data from location to other, and the process is not idempotent.
- Consistency, availability, and partition tolerance (CAP)
- Robustness and reliability
- Performance and speed
- Partitioning ability
- In-database analytics and monitoring
- Operational and querying capabilities
- Storage management
- Talent pool and availability of relevant skills
- Database integrity and constraints
- Data model flexibility
- Database security
- Database vendor/system funding, stability, community, and level of establishment
- A lot more...
- Load Balancer
- Caching Layers
- Globally Unique ID Generation Service(s)
- Horizontally Scalable Databases
- Schema Registry
- Authentication and Authorization Service
- Service Discovery Provider
- API Gateway
- Message Queues
- Reverse Proxies
- Object Stores
- Batch Jobs (Reports, ETL Pipelines, etc.)
- Logging and System Health Monitoring Dashboards
- Design for at least 100% extra scale.
- Avoid bleeding edge technologies. They generally have corner cases and bugs hard to discover in initial days.
- Optimize the design for most important features and/or needs.
- Avoid dependence on specialized softwares/hardwares. Use as much of generally available components as possible. Build abstractions if not available.
- Use caches extensively. They generally provide highest performance/cost ratio.
- Use queues for transient data.
- Avoid transactions as much as possible. They are one of the costliest operations that can be performed at scale.
- Avoid IO as much as possible. Do as much processing in-memory.
- Try to achieve idempotence while building distributed services. It makes self-healing much more simple.
- Figure out the numbers of the scale before you set out to design the system. Make sure to have numbers for following:
- 1. Needed Bandwidth
- 2. Desired Throughput
- 3. Highest tolerable latency
- 4. Persistent storage needed
- 5. Request handling capacity (reads/write)
- 6. Rate of change of all above, measured over a defined period of time (quarter, half-yearly, yearly, etc.)
Although this article gives good introduction to distributed systems, there are many more concepts that are useful in building distributed systems, better. I'm working on part 2 of this guide that will try to include those in a similarly easy to understand manner.
Also, do visit my blog https://nitinbansal.dev if you like this. I have much more useful content planned to be added.
That's all, folks ¯\(ツ)/¯