In my previous post, Live notetaking as I learn Spark, I learned some of the basics of Spark:
"Spark is a distributed programming model in which the user specifies transformations. Multiple transformations build up a directed acyclic graph of instructions. An action begins the process of executing that graph of instructions, as a single job, by breaking it down into stages and tasks to execute across the cluster. The logical structures that we manipulate with transformations and actions are DataFrames and Datasets. To create a new DataFrame or Dataset, you call a transformation. To start computation or convert to native language types, you call an action." Spark: The Definitive Guide
I liked the live notaking format because it pushed me to write down what I was learning and kept me accountable. I have so many drafted posts that I haven't published because I haven't finished my thoughts. Live notetaking takes the pressure off from constantly thinking about a possible pitch as I'm learning. I am using this live learning pattern again here.
I am now up to chapter 4 of Spark: The Definitive Guide. In the time between my last live notaking post and this one I have done some reading on my own about the anatomy of database systems and git
as a distributed system. Seeing Spark described as a distributed programming model got my attention since I haven't seen Spark described in this way before so I definitely want to be clear on what the term means. Julia Evans describes the importance of identifying what you don't understand in her post on how to teach yourself hard things.
In this post, I will put together an outline of concepts related to distributed computing programming models. I have an additional goal as well: recognize patterns I use to learn a new concept. Is this the most effective way to learn? Can I optimize these patterns? What motivates me to use these patterns?
First thing I did was enter "distributed programming model" into Google search engine. I chose the third result because it looks like it could be the syllabus to a class on this topic. I like syllabi. They usually contain readings and homework assignments I can complete for a topic. I also typically compare the textbooks that appear in different syllabi. If a textbook is used in several courses, I look into it to see if it is the book on a topic.
The syllabus is for a class titled "SPECIAL TOPICS IN COMPUTER SYSTEMS:
Programming Models for Distributed Computing" at Northeastern University. I'd definitely take this course if I could:
"Topics we will cover include, promises, remote procedure calls, message-passing, conflict-free replicated datatypes, large-scale batch computation à la MapReduce/Hadoop and Spark, streaming computation, and where eventual consistency meets language design, amongst others."
I got some feedback from a colleague to focus on the Spark data structures next and this course covers conflict-free replicated datatypes. I don't know what that means yet but I want to know. The class works on authoring a literature review on the landscape of programming models for distributed computation during the semester. So cool.
RPC
I have already read a bit on RPC in Distributed Systems by Tanenbaum. Tanenbaum describes the Remote Procedure Call proposal made by Birrell and Nelson in Implementing Remote Procedure Calls (1984):
"When a process on machine A calls a procedure on machine B, the calling process on A is suspended, and execution of the called procedure takes place on B. Information can be transported from the caller to the callee in the parameters and can come back in the procedure result. No message passing at all is visible to the programmer." p. 173
- Implementing Remote Procedure Calls (1984)
- A Distributed Object Model for the Java System (1996)
- A Note on Distributed Computing (1994)
- A Critique of the Remote Procedure Call Paradigm (1988)
- Convenience Over Correctness (2008)
Futures, promises
- Multilisp: A language for concurrent symbolic computation (1985)
- Promises: linguistic support for efficient asynchronous procedure calls in distributed systems (1988)
- Oz dataflow concurrency. Selected sections from the textbook Concepts, Techniques, and Models of Computer Programming. Sections to read: 1.11: Dataflow, 2.2: The single-assignment store, 4.93-4.95: Dataflow variables as communication channels ...etc.
- The F# asynchronous programming model (2011)
- Your Server as a Function (2013)
Message passing
- Concurrent Object-Oriented Programming (1990)
- Concurrency among strangers (2005)
- Scala actors: Unifying thread-based and event-based programming (2009)
- Erlang (2010)
- Orleans: cloud computing for everyone (2011)
Distributed Programming Languages
- Distributed Programming in Argus (1988)
- Distribution and Abstract Types in Emerald (1987)
- The Linda alternative to message-passing systems (1994)
- Orca: A Language For Parallel Programming of Distributed Systems (1992)
- Ambient-Oriented Programming in AmbientTalk (2006)
Consistency, CRDTs
- Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services (2002)
- Conflict-free Replicated Data Types (2011)
- A comprehensive study of Convergent and Commutative Replicated Data Types (2011)
- CAP Twelve Years Later: How the "Rules" Have Changed (2012)
- Cloud Types for Eventual Consistency (2012)
Languages & Consistency
- Consistency Analysis in Bloom: a CALM and Collected Approach (2011)
- Logic and Lattices for Distributed Programming (2012)
- Consistency Without Borders (2013)
- Lasp: A language for distributed, coordination-free programming (2015)
Languages Extended for Distribution
- Towards Haskell in the Cloud (2011)
- Alice Through the Looking Glass (2004)
- Concurrency Oriented Programming in Termite Scheme (2006)
- Type-safe distributed programming with ML5 (2007)
- MBrace
- MBrace: cloud computing with monads (2013)
- MBrace Programming Model (Tutorial)
Large-scale parallel processing (batch)
- MapReduce: simplified data processing on large clusters (2008)
- DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language (2008)
- Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing (2012)
- Spark SQL: Relational Data Processing in Spark (2015)
- FlumeJava: Easy, Efficient Data-Parallel Pipelines (2010)
- GraphX: A Resilient Distributed Graph System on Spark (2013)
- Dremel: Interactive Analysis of Web-Scale Datasets (2010)
- Pig latin: a not-so-foreign language for data processing (2008)
Large-scale parallel processing (streaming)
- TelegraphCQ: continuous dataflow processing (2003)
- Naiad: A Timely Dataflow System (2013)
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters (2012)
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing (2015)
From these tips on how to read a research paper and some common sense, I know I won't be able to read all these papers quickly nor am I interested in doing that right now.
From https://blog.ably.io/what-is-a-distributed-systems-engineer-f6c1d921acf8
- understanding of a hash ring: Cassandra, Riak, Dynamo, Couchbase Server
-
protocols for keeping track of changes in cluster topology, in response to network partitions, failures, and scaling events:
"Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because nodes come and go in elastic systems, failures need to be detected quickly, and load and state needs to be rebalanced in real time."- Gossip protocol
- Paxos protocol
- Raft consensus algorithm
- Popular consensus backed systems like
etcd
and Zookeeper, and gossip backed systems like Serf.
-
Eventually consistent data types and read/write consistencies
locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, for example, availability can be prioritised, and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.- familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store
- Operational Transform — implemented by Google originally in their Wave product and now in Google Docs. It has uses in collaboration apps, but OTs are complex and not widely implemented.
- Conflict-free Replicated Data Types or CRDT provides an eventually consistent result so long as the data types available are used. Used by Riak distributed database and presence in Phoenix.
- Consistency levels for both read and writes in distributed databases like Cassandra
- familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store
-
At each layer, be confident in your understanding and ability to debug problems at a packet or frame level:
WebSockets example- DNS protocol and UDP for address lookup.
- File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
- IP to route packets between hosts
- TCP to establish a connection
- TLS handshakes, termination and certificate authentication
- HTTP/1.1 or more recently 2.0 used extensively by gRPC.
- WebSocket upgrades over HTTP.
- higher level protocols such as HTTP, WebSockets, gRPC and TCP sockets and the full stack of protocols they rely on all the way down to the OS itself
also be a solid systems engineer: have the fundamentals such as programming languages, general design patterns, version control, infrastructure management, continuous integration and deployment systems already in place.
Top comments (0)