In recent months, we DaiLambda have worked on Plebeia, an implementation of sparse Merkle tree disk storage. It is immutable: persistent and functional both in memory and on disk. It is optimized for data with relatively small sizes under 40 bytes which makes it suitable for blockchains such as Tezos. Plebiea can reduce the size of the current state repository of Tezos down to 1/10. More importantly, since Plebeia is a straightforward implementation of sparse Merkle tree, it can provide Merkle proofs of values easily to lightweight wallets which cannot carry the whole states of blockchain.
Plebeia is currently written in OCaml, but we are also porting it to F* for correctness proofs.
Plebeia was originally designed and prototyped by Arthur Breitman (commit). Since we found it easily understandable and promising, we have forked it and filled missing features, fixed bugs, added optimizations, and added stress tests. We have tested it to Tezos node, and so far got pretty positive results.
We hope Plebeia one day replaces the current Tezos storage subsystem. The current storage subsystem of Tezos uses too much disk space: the disk storage of Tezos “context”, the file for the Tezos blockchain states, exceeds 180GB in about one year since its mainnet launch and still growing. Nomadic Labs have introduced snapshots and history modes for Tezos nodes which can reduce the disk usage for nodes in non-archive mode, but this is not the ultimate solution: the archive node with the full blockchain history still suffers of the huge data file size. Plebeia based storage subsystem can reduce it to 1/10.
Merkle tree and Merkle proof
Merkle tree is usually a binary tree (therefore each edge is labeled with
Right) where each node has its hash value. The hash value of the leaf node with value
V is the hash of
H(V). The hash of an internal node is
H(h||h'), the hash of the concatenation of the hashes
h' of its subnodes.
Merkle tree is a popular data structure in blockchain technology since it can provide Merkle proofs of values in it. Imagine that a user wants to know the value of the leaf he specifies in a tree, but he does not have the entire tree data at all. He asks a public server which carries all the data and he gets an answer. How can he be sure about the correctness of the answer? The server may be broken or malicious and the answer can be incorrect. Merkle proof provides hash values with which the user can reconstruct the root hash of the tree from the leaf value. The user only needs to know the top hash value of the tree, which also identifies the tree from the other trees. The size of a Merkle proof is proportional to the depth of the leaf, therefore it is only
O(log n) where
n is the size of the nodes.
This enables lightweight wallets. Thanks to the Merkle proofs, it needs not to carry the blockchain states at all but the history of the Merkle hashes.
Sparse Merkle tree
If an internal node has only one subnode, then its hash is only dependent on
its sole subnode's hash
H(0||h'). Having too many such internal nodes is a waste of space and hash calculations, and impacts the performance of the storage significantly. Therefore we reach to sparse Merkle trees, which extends the labels on the edges from a single alphabet
R to a segment of them, to get rid of the internal nodes with only one subnode. To distinguish the same node connected under edges with different alphabet segments, their hash computation must aware these edge alphabets. For example,
H(h||h'RR), if the right edge is labeled
If you are familiar with trie and patricia trees, then you can immediately understand another name of sparse Merkle trees, patricia hash trees. In addition, if you are familiar with the social class system in the ancient Roman empire, you can understand the name of the project, Plebeia.
The node sparsity is pretty common if the blockchain states are stored in the directory structured file system. Imagine that you have a directory to record the balances of the accounts. The directory should have files whose names are the public key hashes of the accounts. Such a directory must be very sparse, therefore we must use patricia tree (with hashes) rather than trie.
Plebeia is an implementation of this sparse Merkle tree. It is simple but carefully designed as a storage subsystem for blockchain states.
The tree manipulation is purely functional therefore it is a persistent data structure. Of course, we need disk I/O to record its nodes, but this I/O is separated from its purely functional core. Thus we can exclude the possibility of the typical bugs around pointer value mutations, which are usually very hard to fix (at least based on my programming experience).
The I/O itself is also persistent: it is “append-only”: only appending data at the end of the file and never overwrites existing data, except its 64 bytes header. This makes the I/O layer very simple and robust against system crashes: no “fsck” (data file integrity check) is required at crash recoveries.
Multiple depth directory structures
Plebeia implements multiple depth directory structures like Unix file system.
For this purpose, it introduces a special type of internal node called bud,
which corresponds with the Unix directory separator character
(This bud is omitted in the later pseudo OCaml code in this article for simplicity.)
Fixed cell size
The representations of all the nodes fit in 32-byte length “cells”.
On the data file, a node can be pointed via its index, a pointer of a 32-bit integer (4 bytes). The rest of 28 bytes in a cell is mostly used to record the hash of the node. This provides a bit unusual hash size: 224 bits at most. (The internal node has 222-bit hash.)
The sizes of the cell and the index restricts the maximum size of the data file:
32bytes * 0x100000000 = 138GB. Observing the storage size growth of Plebeia from the genesis to the current head in Tezos (18GB for 14months), it should take at least a few years to fill all this size. There is no technical restriction to change the size of the index (and the size of the cell subsequently) to accommodate more nodes.
Many cells point to their previous cells
In many cases, nodes refer to the data stored in their previous cells.
The 32-byte length cell design is hoping that 2 adjacent cells (64 bytes) can fit in the cache line of 64-bit CPU's. We expect it has positive performance impact but there is no benchmark yet around it.
Especially, one of the subnodes of an internal node is always stored in the previous cell in the persistent setting. Thanks to this property, we can omit recording 1 index of the subnodes of the internal nodes. (This partially breaks once if we introduce hash consing optimization explained later, but resolvable by introducing indirection.)
Internal node design
The naive definition of sparse Merkle tree is given as follows:
type t = | Leaf of hash * value | Internal of hash * (segment * node) option * (segment * node) option
This is perfectly OK in memory, but very bad on a file with the fixed cell size. The cell size is the maximum size of these 2 constructs, and the sizes for a
Leaf is much smaller than the size of an
Internal, we waste too much storage space for leaves.
Instead in Plebeia, we deconstruct the internal node of sparse Merkle tree into two, to give almost the same sizes to the constructs as possible:
- Branch: A node with two subnodes and both edges have only 1 alphabet.
- Extender: A node with only one subnode, which may have multiple alphabets on the edge.
In ML, it looks like:
type t = | Leaf of hash * value | Branch of hash * node * node | Extender of segment * node
Note that we do not need to record the hash for the extender since the hash is easily computable by appending segment string to the hash of the subnode:
H(Extender (seg,node)) = H(node) || seg.
When applied to Tezos blockchain data, this layout uses only the half of the disk space of the former naive definition. We have also estimated other 5 possible layouts in addition to the above 2, and this one has the best performance.
Tezos’ blockchain state is stored in a multi-depth directory structure in its storage subsystem, and the path names are very human-readable, such as
Tezos’ current storage is generic and independent from the protocol, therefore it does not know the path name structure of the blockchain state, such as
01/23/45/67/89/abcdef0123456789abcdef01234567 is a 20-byte length Blake2B hash and
123 is a natural number expressible in one byte. These directory names are recorded using their ASCII representation, which is much longer than their binary forms.
Plebeia is integrated into Tezos node with the knowledge of this directory structure: it converts path components to their optimal binary formats to reduce edge label length. Not only hashes nor integers, names like
index, etc. are also converted their shorter representations using 16-bit hashes. By compressing paths, we could reduce the number of
Extender nodes to 52% of
Branch nodes, which can be 200% in the theoretical worst case.
The directory structure information is crucial for storage subsystems to reduce the storage size. In the current experimental integration, this is given ad hoc, hardcoded in the wrapper layer between Plebeia and Tezos protocol. Ideally speaking, the protocol should have an interface to inform it to the storage subsystem in the shell.
Hash consing of small values
Most of the values in the blockchain state are very small. Here is the distribution of Tezos values by size at block level 547221. 99.994% of the values are under 35bytes:
Here, blue bars are the number of nodes of each byte size, and reds are the number of distinct values at each size. The distribution is not flat at all: for example, the high peak at 33 and 34 bytes are by public key hashes for blockchain accounts. You can also notice that red bars are very shorter than the blue and some are almost invisible at many places. This indicates that sharing leaves with these same values should be quite an effective optimization of the storage size.
Plebeia does this hash consing of small values up to 36 bytes, stores all the small values it has ever observed in memory and on the data file. At the block level 561124, 173 million distinct small values are observed and stored, which costs about 6GB of memory. Of course, this is too extreme. We hope we can reduce this table size in memory by restricting the target value sizes and forcing to forget rarely appearing values from it, without damaging cache performance significantly.
We have integrated Plebeia into Tezos node and ran it in the archive mode and fully bootstrapped from the genesis to the current block level, i.e. 561170. The data file size ratio of Plebeia/Original is 1/6 of the original at the beginning, then reached 1/10 around level 561170:
Since the node runs these 2 subsystems at the same time, we could also check the correctness of Plebeia storage comparing all the values from the both subsystems for each read.
We are pretty happy about getting this result even at the very first integration experiment. By analyzing the statistics, we hope to achieve further optimizations in the near future.
Porting to F*
Currently, Plebiea is written in pure OCaml code. The correctness of sparse Merkle tree manipulation is checked by runtime invariant checks at each tree construction. Quickcheck style random tests are also performed.
Writing tree manipulation is hard, even with a functional language which is sutable for such purposes and full of tests. Even with more than millions of random tests, some bugs around corner cases were not found until the integration experiment to Tezos node.
As well as the correctness of its protocol, the correctness of the storage subsystem is very important to blockchain technology. Tests are not enough. Formal verification is absolutely required to assure the correctness. And pure functional tree manipulation is a typical target of formal verification. No reason not to do it.
We have just started a joint research with Kyoto university team to port Plebeia to F*, where we can **prove** many invariants of tree manipulations which we only randomly test currently. Once proved its correctness, certified Plebeia in F* will be extracted down to OCaml and be easily linkable with Tezos node. More boldly in future, we may be able to extract it to C code using KreMLin, which should run much faster.
Current Tezos storage subsystem knows nothing about the directory structure of Tezos blockchain states. This restricts it very generic and makes any optimization using characteristics of the storage very hard. What we have done was a “cheat” in the integration experiment of Plebeia: we coded a path compression which converts human-readable but inefficient path names to short and space-efficient binaries. This information must be passed from the protocol to the shell in a clean way, across the protocol separation boundary. This requires the modification to the Tezos protocol layer, and in Tezos this must be approved by its on-chain governance.