Introducing Datacake, the batteries included framework for building distributed systems in Rust
I'm super excited to share a library I've been working on for the last 6 months or so, its goal is to simplify the steps needed to be done in order to build an eventually consistent distributed system in Rust. You can find the GitHub repo at https://github.com/lnx-search/datacake
What it provides:
Here's a basic overview of what it provides if you want to keep the reading short:
- An ORSWOT (or-set without tombstones) CRDT (Conflict-free replicated data type) structure capable of managing multiple 'sources' which may apply events asynchronously from one another.
- An implementation of the hybrid logical clock (HLC) system in the form of the
HLCTimestampbased on [this paper] in essence, it is a drift-resistant, distributed wall clock.
- An actor-like RPC framework built upon rkyv supporting zero-copy payloads supporting runtime loading and unloading of services.
- An abstraction for building clusters built upon the chitchat crate, this can be used to extend the cluster system with a pre-established RPC system, cluster clock, etc...
- A pre-built extension for
datacake-nodebased on the above packages provides you with eventually consistent replication out of the box, only requiring one trait to be implemented for the storage of state.
- A pre-made implementation of the required traits as mentioned above using SQLite, giving you a complete batteries-included experience (works for both the eventually consistent store and Be Prepared)
Most of the code in the repo are tests or test utilities, and it still has a lot of things which could be tested, if you fancy a crack at breaking things or contributing adding more tests would be awesome! I fear my brain can no longer handle coming up with new ways of breaking this.
Why does Datacake exist?
Datacake is the result of my attempts at bringing high availability to lnx unlike languages like Erlang or Go, Rust currently has a fairly young ecosystem around distributed systems. This makes it very hard to build a replicated system in Rust without implementing a lot of things from scratch and without a lot of research into the area, to begin with.
Currently, the main algorithm available in Rust is Raft which is replication via consensus, overall it is a very good algorithm, and it's a very simple to understand algorithm however, I'm not currently satisfied that the current implementations are stable enough or are maintained in order to choose it. (Also for lnx's particular use case leader-less eventual consistency was preferable.)
Because of the above, I built Datacake with the aim of building a reliable, well-tested, eventually consistent system akin to how Cassandra or more specifically how ScyllaDB behaves with eventual consistent replication, but with a few core differences:
- Datacake does not require an external source or read repair to clear tombstones.
- The underlying CRDTs which are what actually power Datacake are kept purely in memory.
- Partitioning and sharding are not (currently) supported.
It's worth noting that Datacake itself does not implement the consensus and membership algorithms from scratch, instead, we use chitchat developed by Quickwit which is an implementation of the scuttlebutt algorithm.
Since this initial approach, I figured why not make Datacake an entire tool kit rather than just an eventually consistent store, so we have! And it makes building additional distributed algorithms a lot easier!
This package aims to provide some of the more core functionality needed in a distributed system, this essentially just includes the mentioned ORSWOT CRDT and the HLCTimestamp.
- Hybrid logical clock implementation, resilient to clock drift, and provides monotonic time.
- ORSWOT CRDT implementation with the concept of asynchronous 'sources'.
The core membership system used within Datacake.
This system allows you to build cluster extensions on top of this core functionality giving you access to
the live membership watchers, node selectors, cluster clock, etc...
A good example of this is the
datacake-eventual-consistency crate, it simply implements the
which lets it be added at runtime without issue.
Zero-copy RPC framework which allows for runtime adding and removing of services.
- Changeable node selector used for picking nodes out of a live membership to handle tasks.
- Pre-built data-center-aware node selector for prioritization of nodes in other availability zones.
- The distributed clock is used for keeping an effective wall clock that respects causality.
To get started we'll begin by creating our cluster:
Creating an extension:
Creating a cluster extension is really simple, it's one trait and it can do just about anything:
This came about due to some limitations with
tonic which we were previously using as our RPC system. The primary issue is that we cannot add and remove services at runtime once the server has been started, which the
datacake-node package requires to function in order to provide the entire cluster extension system.
While producing this framework the idea was along the lines of wanting to give it a familiar actor-like feel using the trait system, and also support any types which support
rkyv since it's our primary binary (de)serialization framework. All of this while using the same underlying
hyper http/2 system as tonic so performance should be about the same IO-wise.
As a side effect of this, we get the incredible performance of
rkyv and can support zero-copy deserialization of the messages in the form of the
Here's a basic demo:
Datacake Eventual Consistency
This was formerly known as
datacake-cluster (traces remain on crates.io) but has been renamed since the goal of Datacake changed from just being an embeddable store to being a distributed toolkit.
This provides you with an embeddable store that replicates data in an eventually consistent manner, built upon all over the previously mentioned crates it handles everything for you except the storage which you provide to it in the form of a single generic trait =D
- Cheap keyspaces, each keyspace can be operated concurrently, and it has the cost of a few Tokio tasks.
- Efficient IO, the system under working conditions only requires one round trip to nodes for syncing data.
- Adjustable consistency levels for controlling the state.
- Data center-aware replication prioritization, the system uses
datacake-nodes's node selector (which by default is data center aware) allows the service to prioritize replicating the data to nodes outside of its own availability zone, this can be really useful if you have a large number of nodes some of which are located in the same data center/zone.
This is the final crate that is currently developed, it implements the core storage trait required by the eventually consistent store.
It doesn't cover every requirement, but it does offer a reliable option for persisting state for testing or smaller workloads. If you're planning to use this for larger amounts of data then you may want to implement the trait yourself to better fit your needs.
Datacake so far has been a solo adventure more or less, and it's taken a long time to learn how to implement these things and test them. It would be awesome to have some more contributors help out and turn Datacake into the go-to distributed systems library.
- Writing more tests, particularly around the eventually consistent store.
- Investigate adding turmoil to tests for validating IO failures.
- System benchmarks.
- CASPaxos support (longer term)
- Multi-Raft support (very long-term)
Like what's going on here? Considering starting the project on GitHub 🌟