DEV Community

Pradeep Chhetri
Pradeep Chhetri

Posted on • Originally published at Medium on

CSE138 Lecture 2 Notes

CSE138 Lecture 2 Notes

What is a Distributed System ?

According to Leslie Lamport:

A system where I can’t get my work done because some computer which I never heard of crashed.

According to Martin Kleppmann:

A system running on several nodes and characterised by partial failures.

What is Partial Failure ?

  • Machine crashing
  • Network failure
  • Messages being dropped
  • Software misbehaviour

Part of the computation happened while another don’t.

In the event of such failure, you don’t want the system to degrade in the case of partial failure. Your system must continue working.

Cloud Computing vs HPC Philosophy

HPC Philosophy:

Choosing to take partial failure as total failure i.e. if something does fail, it will start the computation over completely from scratch. It involves the process of checkpointing (every so often it saves the progress and if the failure happens, it will rollback to the last checkpoint).

Cloud Computing Philosophy:

It involves working around such partial failures and expecting those kinds of failures.

Two Nodes Scenario

System of two machines

Possible Failure Scenarios:

  • Request from Machine A gets lost: maybe someone removed the cable.
  • Request from Machine A is slow and Machine B never receives it: network congestion or some sort of message queue either on Machine A side or Machine B side. Machine A thought that it sent the request but Machine B never received it.
  • Machine B crashed: Machine A sent the message and Machine B received the message and crashed.
  • Machine B is slow in precessing.
  • Response from Machine B is slow and Machine A never receives it.
  • Response from Machine B gets lost.

Does Machine A has any way to distinguish any of the three situations ? No.

If you send a request to another machine and don’t receive a response, it is impossible to know WHY (without having global knowledge of the system).

Other possible failure scenarios:

  • Machine B is lying or refusing to answer.
  • Cosmic Rays flipping the bits.

These kinds of scenarios are called Byzantine Faults.

How do real systems deal with it ?

  • When a machine sends a message, it needs to have some sort of timeout which means if the message has no response, give up and assume failure.

Why it might be a mistake to assume failure due to timeout ?

System of two machines

In this case, there is a possible that value of x is incremented twice because Machine A asked Machine B to increment the value of x twice because it never got ok for the first message and timeout happened.

Let assume the following:

Maximum delay between Machine A and Machine B (and vice-versa) is d

Maximum time processing a request is r

How long should Machine A waiting ? 2d + r

It will rule out the uncertainties due to slowness but still leaves other kinds of uncertainties which we need to deal with. Further more, most of the time we don’t have this sort of guarantee of maximum delay, sometimes we make assumptions about this maximum delay.

According to Prof. Peter Alvaro:

In distributed systems, not only do we have to deal with partial failures but we also have to deal with unbounded latency.

Why do we want to have a distributed system ?

  • Data too big to fit on a single machine.
  • You want things to be faster even though the data can fit on a single machine.

Time and Clocks

What are clocks for ?

  • Mark points in time: Eg. this item in my browser will expire on April 10, 2020 at 08:00 hours.
  • Durations or Intervals of time: Eg. this user spent 4 minutes and 55 seconds on our websites.

Computers have two types of clocks:

  • time-of-day clocks: tells you what time it is. It is usually synchronised between machines using NTP (network time protocol). They are bad for measuring durations or intervals because time-of-day clocks can jump backward due to daylight savings or leap seconds. They are okayish (but not good) for marking points in time because clock synchronisation is only so good and we need more fine-grained resolution to prevent certain kinds of bugs. Hence we aren’t going to use them much in distributed systems.
  • monotonic clocks: they always go further i.e. its certain kind of timer, maybe it counts the milliseconds since the machine restarted. In python, if you use time module, you can get the monotonic counter. It’s completely useless for marking points in time but it’s good for measuring duration or intervals of time. We can use these types of clocks to implement timeouts.

Checkout Cloudflare Blog on Leap Second. They tried implementing timeouts using time-of-day clock.

Both of these kinds of clocks are physical clocks but in distributed systems, we need to have a different notion of clocks which are logical clocks. Logical clocks don’t measure the time-of-day and elapsed time, instead they only measure the ordering of events (which events happened before another).

Suppose A happened before B.

A — — — → B

What does it tells us ?

  • A could have caused B.
  • B could not have caused A.

This notion of potential causality is very important. Why ?

  • Debugging: Figuring out possible causes of bug.
  • Designing systems.

Resources:

Thank you Prof. Lindsey Kuper for keeping the lectures online.

Top comments (0)