The Story Behind Replica_IO

#decentralizedcomputing #distributedsystems #faulttolerance #replication

This post tells how the Replica_IO project originated and explains the
motivation behind it.

The original post can be found here.

My Background

I'd like to start by tell you a bit about my professional background.
I'm a research engineer with quite some experience in software
engineering. I began working as a software engineer back in 2009.

First 7 years, I was mostly focused on developing low-level system
software: I worked with such things as Linux kernel, microcontrollers,
hardware emulation, and trusted execution. Back then, I particularly
enjoyed contributing to Qemu, a generic and
open-source machine emulator and virtualizer. My contribution included
enhancing emulation of the ARM platform and enabling multithreading
support in the generic binary translation engine

In 2016, I took a big leap and came into research and development in
the areas of blockchain, distributed and decentralized systems. Soon
enough, I became absolutely excited about this, and since then, I keep
expanding my knowledge and experience in that area, in particular,
designing and implementing distributed protocols. During that period,
apart from proprietary stuff, I worked on the following open-source
projects:

MinBFT Hyperledger Lab — an implementation of the MinBFT consensus protocol as a pluggable component. I was the main author, contributor, and maintainer of the project.
Mir — a framework for implementing, debugging, and analyzing distributed protocols. My main contribution was implementation of the checkpointing mechanism, protocol garbage collection, and reproducible testing with simulated time.
Interplanetary Consensus (IPC) — a framework to enable on-demand horizontal scalability of the Filecoin blockchain. My main contribution was redesign and implementation of the atomic cross-chain transaction execution protocol in Rust.

Implementing Distributed Protocols

So much was I excited about distributed systems, but, after a while, I
started feeling like there's something fundamentally wrong in how we
usually design and implement them.

Distributed protocols are notoriously complex, and it took academia
significant effort to develop a solid theoretical foundation for them.
Due to inherent concurrency, the reasoning about distributed systems
is quite tricky, and there are lots of pitfalls where one gets trapped
pretty quickly, unless being extremely careful. Though, I find this
really fascinating because I particularly love digging deep and
thinking thoroughly.

However, the way distributed protocols are conventionally described on
paper makes it hardly possible to implement them correctly with
confidence; it's simply too far from the realities of software
engineering. Not only academic papers often neglect some details of
practical importance but also the language and notation used there,
they require nontrivial translation to the languages and patterns
commonly used in programming. Add there typical issues that come up
inevitably when programming concurrent systems, time pressure, and we
end up with a great mess that one can hardly comprehend and maintain.

Moreover, it seems like those engineers who get their hands dirty and
implement distributed protocols for practical use tend to jump in and
try applying whatever approach they were used to or that was implied
by the surrounding system. Although one can certainly learn a lot from
such experiments (and I'm doing that), it's generally waste of efforts
when one simply needs to get the thing reliably working. More than
that, since this kind of code is quite hard to get right, inevitable
mistakes creep into such implementations and lurk there unnoticed.
Even when some of those mistakes get revealed, individual projects are
usually too busy and too specific to keep following and effectively
learning from each other.

Having implemented a couple of distributed protocols myself, I find
this status quo deeply unsatisfactory, especially when it comes to
distributed replication mechanisms such as consensus protocols. After
all, they are supposed to ensure consistency and availability in such
critical computing systems as distributed coordination services,
distributed databases, and blockchain. There is an opinion that the
main obstacle to wider adoption of distributed, decentralized systems,
particularly those capable of tolerating arbitrary (Byzantine) faults,
is their requirement for additional resources and reduced performance.
While it's certainly true that high reliability doesn't come for free,
I think the concerns regarding complexity do actually matter a lot in
the end; it's simply hard to get it right.

I think decentralized Byzantine-fault tolerant mechanisms should
prevail in future computing systems and we can do a much better job
working towards that. I believe such complex problems can have neat
solutions, not only efficient, but also easy to use. Clearly,
discovering and developing such solutions does take quite some effort.
There must have been attempts to solve this problem, apparently not
very successful. But since I like to think of myself as someone
discovering smart solutions to hard problems, I'm not too scared; I'm
stubborn enough 😄

Replica_IO

So I was thinking about this for years, but never managed to find room
for seriously working on it. Suddenly, in February 2023, I was
affected by a lay-off in Protocol Labs and had to
leave; by that time, I had worked with the company as a long-term
collaborator, a Research Engineer at the ConsensusLab
group, for almost a year. After a while, I realized that this is
actually a great chance to finally start working on what I was
dreaming of.

Initially, I thought I would just take a break and spend some time on
a hobby project. I already had a name for it — Replica_IO, which had
come to my mind a few months before, as I had been yet again thinking
about communication between replicas in a distributed replication
system. However, once I started asking myself about my real intention
behind this, I realized that it's much bigger than just playing with a
pet project: what I really want is to make a breakthrough in how
distributed systems are designed and implemented!

In March 2023, I decided to found the Replica_IO project and work on
it full time as an independent research engineer. Since I believe in
open source, open innovation and collaboration, I also wanted to make
it radically open and started developing it entirely in the open from
day one. I described the project's purpose, goals, and
approach, created its logo, defined the initial
roadmap, and started working on the first milestone.

At the time of this writing, I'm exploring some relevant
state of the art, summarizing the findings.
Approaching this in a systematic way lets me dive deeper into the
problem, form a more educated opinion, find some inspiration, and
ultimately come up with effective ideas for achieving the project's
key technical objectives.

I understand how ambitious the goals of this project are and that it
may take long time to get there, but I'm absolutely sure it is worth
the effort. I'm surprised how much attention the project has already
attracted and would like to see great experts from the relevant fields
become involved and help to make it real. I also count on getting enough
support for this initiative, and I'm grateful to those who have
already been helping 🙏

If you'd like to learn more about this project, please visit the
About page and watch this talk.