Proposal of Micheline encoding optimization

Summary

The encoding of Micheline has several points to be improved. We propose to extend it to reduce the encoding size. Our quick survey shows that the possible size reduction is -22%.

The size reduction directly affects the storage burn fee. It should also improve the context disk access and reduces the storage access gas costs.

The new enconding is not yet implemented. The extension may have runtime overhead of the serialization but we expect it minimum.

Introduction

Storage burn is one of the main costs of using Tezos. It is 250 mutez (microtez) per byte in the latest protocol, that is 250 tez for 1MiB. If the data can be written with less bytes on Tezos blockchain, the users can save the storage burn.

Micheline is the data representation of contract values (code, storage, and big_maps). It is serialized by Data_encoding library. Working on the compact encoding for Merkle proofs, we have learnt that the current serialization of Data_encoding is not very optimal.

We propose an extention of the Micheline encoding to reduce the size of contract data on Tezos context.

Length field

Data_encoding uses a 4-byte fixed integer for the length of string, bytes, and sequences (lists). It is too large for almost of all data in Tezos context.

List is associated not with the number of elements but the size in bytes. It is much larger than the number of elements.
DIP { .. } usually takes opcodes less than 128.
pair ty1 ty2 ty3 uses more bytes than pair ty (pair ty2 ty3) due to the encoding of 3-ary or more Micheline prim using a list and a string inside. Each has a 4-byte fixed integer field.

Instead of the fixed width integer field, we can use a variable length integer. For example:

0 .. 2^7-1(=127): 1 byte
2^7(=128) .. 2^14-1(=16383): 2 bytes
2^14(=16384) .. 2^21-1(=2097151): 3 bytes
..

[name=jun] We may use the encoding of Z.t, but it is slightly less efficient since it has a sign bit.

Micheline tag extension

The encoding of Micheline has 1-byte tag field. Currently, only 11 tags of 256 are used and the others are unused. We can embed tags for Michelson opcodes, constructors of constants and types, and small integers here.

For the backward compatibility of the encoding, we keep these tags. When a value is serialized newly, tags 1,2,4,6,8,9,10 will no longer be used. The following new tags will be used instead.

New tags

New tags with variable length size field(s):

11 : string with a variable length size field
12 : bytes with a variable length size field
13 : seq with a variable length size field
14 : nullary prim with an annotation string with a variable length size field
15 : unary prim with an annotation string with a variable length size field
16 : binary prim with an annotation string with a variable length size field
17 : 3-ary or more prim with a variable length size field for the arity and an annotation string with a varialbe length size field

Tags of small integers, constructors, and opcodes embedded in the Micheline tag:

+16 tags: small integers (0..15)
About +8 tags: constant constructors (Unit, binary Pair, etc)
About +32 tags: type constructors (int, list, binary pair, etc)
About +120 tags: Michelson opcodes (SWAP, unary DIP, binary DIP, etc)

The total number of tags is around 192. We still have 64 tags for future extensions.

We may not want to embed all the constructors and opcodes. Rarely used constructors and opcodes can be skipped.

Estimation

Estimation method

We first defined a function to calculate the encoding size of a given Micheline data. Its correctness is randomly tested against the encoding function defined in Micheline_encoding.

A modified version of this function is then defined to compute the encoding size with new tags. (We do not write an optimized encoder yet.)

We run both size calculation functions over the same Micheline data set.

Data set

All the 106117 KT1 contracts found in Tezos mainnet at block level 2354630 (a block of 2022-05-11):

2639 different Michelson codes
106117 storage
334478 big_maps

Contract code

Target: 106117 Michelson codes which have 2639 unique ones.

Average result: 60.21%. Total size: 152.27MiB => 91.74MiB

For any Michelson code, the majority of the data are Michelson opcodes. Therefore the reduction rate is quite similar throughout the contracts.

Contract storage

Target: the storage of the 106117 contracts. Excluding the big_maps which are stored elsewhere.

Average result: 78.52% Total size: 23.47MiB => 20.65MiB

Unlike Michelson code, the reduction rate of the storage is less uniform. It is pretty much dependent on the type of storage: For example, we cannot expect a big size reduction for int list when the list is long and its integers are larger than 15.

Big_maps

Target: the values stored in 334478 big_maps.

Average result: 82.98% Total size: 644.17MiB => 526.19MiB

The following graph shows the encoding size change of the values of each big_map:

Total

Contract code + storage + big_maps:

77.9% (820.02MiB => 638.58MiB)

It frees 181.44 MiB which costs 45359.2 tez under the current burn fee, 1 byte = 250 mutez.

Reduction of context disk IO, context disk size, and snapshot sizes

Not available for the moment.

The new encoding should reduce them, too, but it is not easy to estimate. We need implement an actual encoding and reencode all the Micheline values on Tezos context with it.

Who do enjoy this optimization?

The optimization reduces the used_bytes by 22%. First users of each contact after the optimization can use these saved bytes freely for a while. Once all the saved bytes are used, the users must burn storage fee again, but it should be 22% cheaper.

It may sound unfair for the past users of the contracts who pay the storage burn for some of future users as a result. But I find no reasonable way to reimburse.

How to implement actually

Which part of Tezos implementation must we change?

It is about the encoding and the type of Micheline is not touched at all. The encoding is upper compatible: no problem to load values in the original encoding.

It will be a new version of Micheline encoding. Therefore we will need to add a new Prot_env. Apart from the switching of the Proto_env, I think no protocol upgrade is required: new encoding will be applied automatically when someone updates the storage or a big_map binding of a contract.

If we want to have the reduction effect immediately, it is also possible to apply the new encoding to all the data at a protocol migration. The conversion requires gradual migration over multiple blocks since it needs to traverse all the contract directories.

Incompatibility

Not really an incompatibility, but something we must be careful.

Hash

Hashes are computed from the serialized value. The change of the encoding should affect the hash of values. Since big_maps use the hash of values as keys, we cannot use the new encoding for their keys. Otherwise, they would be broken. We must use the original encoding for big_map key hashing.

Not only for big_map key but any hashing of Michelson values cannot use this new encoding.

Pack

PACK serializes values using the encoding. If we would use the new encoding for them, the result would change. This is bad if the result bytes are hashed. PACK must keep using the original encoding. A new opcode like PACK2 would be required for packing with the new encoding. We need no new opcode for unpacking since the encoding is upper compatible.