Maneesh Chaturvedi

Posted on Sep 8, 2019

Time and Order in Distributed Systems

#distributedsystems

What is Order and Why is it important in distributed systems

Before we jump into order, lets make some assumptions about the nodes in a distributed system. Nodes(either physical or logical),

communicate by message passing
over an unreliable network
with no shared memory

What is Order

Order defines the sequence in which events occur. That is we are interested in a property which defines a relationship of happens before. Given two events A and B, we want to understand whether

A happens before B or vice versa
Or we cant say anything about the order in which A or B occur

Why do we care about order in distributed systems

A system which can do only thing at a time will have a well defined total order of operations. if people pass through a door which allows only one person at a time, then each person will have a well defined predecessor and successor. However in a distributed system since there are multiple nodes, there may be conflicts which may be introduced.

Assume a scenario where successive writes go to different nodes for the same shopping cart. Which node has the correct view of the cart?

Hence it becomes important in distributed systems to reason about ordering, to resolve conflicts.

Total and Partial Order

A total order is a binary relationship which defines an order for each element in a set. Two distinct elements are comparable if one of them is greater than the other. In a partially ordered set, there are some elements that cannot be compared, and hence a partial order does not specify the exact order for each element.

The natural state in a distributed system is about partial order. Neither the node nor the network makes any guarantee about relative order, but at each node you can observe local total order.

Mathematically, total and partial orders have the following properties

Total Order

Total Order is

Transitive -- if a <= b and b <= c then a <= c
Antisymmetric -- if a <= b and b <= a then a = b
Totality -- a <= b or b <= a for all a,b in X

Partial Order

Partial Order is

Transitive -- if a <= b and b <= c then a <= c
Antisymmetric -- if a <= b and b <= a then a = b
Reflexive -- a <= a for all a in X

Totality implies reflexivity, however for some elements in a partial order, the totality does not apply, so some elements of the set cannot be compared.

Time

Time is a source of order. It allows us to define the order of operations. In a sense time is like any other counter. Time has two useful interpretations when applied to programs.

Order
Duration

Time and Clocks

Global Clocks

The global clock is a globally available clock with perfect accuracy. A global clock provides a total order of events across nodes even if the nodes never communicate. However this is an idealized view of the world. In reality clock synchronization is only possible to a certain level of accuracy. Fundamentally, global clocks are restricted in accuracy by the nature of spacetime

Global clocks assume that clocks on a node all start at the same value and never drift. These assumptions are inherently weak and there are multiple cases, where the assumption breaks, like an out of sync node joining a cluster, or the clocks on nodes drifting apart. However there are systems which assume clock synchronization and apply semantics of last write wins, Cassandra being one of them. Another is Google's Spanner which assumes a worst case clock drift.

Local Clocks

Local clocks are clocks on each node. Local clocks cannot be used to determine whether a remote timestamp happens before a local timestamp. Hence with local clocks, you cannot compare timestamps from two nodes. Local clocks assign a partial order; there is total ordering of events on a node, however events cannot be ordered across nodes.

Causal/Logical Clocks

Causal or Logical clocks assume that there are no clocks and use other mechanisms of tracking causality. Logical clocks use counters and communication between nodes to determine whether something happened before, after or concurrently with something else. Counters can be used to order events locally on a node(timestamps of local clocks can be considered as a special case of counters). In the absence of communication, there would be partial order across nodes. However for ordering across nodes we require communication or message passing.

Lamport clocks and Vector clocks are logical clocks which are used as a replacement for physical clocks. they rely on counters and communication between nodes to determine the order of events in distributed systems. These clocks provide a counter that is usable across nodes. Voldemort and Riak are distributed systems that use vector clocks.

Lamport Clock

Lamport formulated the happens before construct for ordering events across nodes in his paper in 1978, Time, Clocks and Ordering of Events . To understand Lamport clocks lets first look at the how counters are maintained for a process/node
-- Whenever a process does work, increment the counter
-- Whenever a process sends a message, include the counter in the message
-- Whenever a process receives a message, set the counter to max(local_counter, received_counter) + 1

Lamport clocks allow counters to be compared across nodes using a partial order. For two events A and B, if counter(A) < counter(B)

-- Then A may have happened before B
-- Or A and B cannot be compared
In simplicity what this says is that if an event comes before another, then the logical clock of the event comes before the other.
Looking at it another way, if A < B IF

-- A and B both occur on the same node/process.
-- A and B are synchronization pairs. A synchronization pair is response to a previous event. So B is a response for A. As an example, A is a GET for performing an update(B).

Vector Clock

A Vector clock is an extension of Lamport clocks. A vector clock maintains an array of N logical clocks, one for each node. Rather than incrementing a common counter, each node updates its logical clock in the array. The rules defining Vector clocks are

-- Whenever a process does work, increment the logical clock on that node in the array.
-- Whenever a process sends a message, it includes the complete array in the message.
-- When a process receives a message, update each element in the array to max(local,received) and increment the logical clock value representing the current node in the array.

The way to determine whether A happens before B is based on
if A < B IFF VectorClock(A) < VectorClock(B). Assuming three nodes N1, N2 and N3 and the vector clock at A to be [1,0,0] and the vector clock at B being [2,1,0] implies that A happens before B.

DEV Community