DaiLambda, Inc.

slides

New IO benchmarks

DaiLambda, Inc.

Jun FURUSE/古瀨淳

Gas team meeting, 2023-09-11

Summary

The current storage costs are seriously underestimated.
The existing IO benches are not good to update the parameters.
We wrote new benchmarks.
We got some results.

Blocks with overtimes

Mainnet block 4043830:

270 kGU: estimated time to apply <= 270 ms.
800 ms to apply in one machine.
3s in another.
6s in yet another.

Some costs should be underestimated,
likely the storage costs.

Gas composition of block 4043830

Total: 270739726 milli GU

Carbonated_map	76800
Apply	20060733
Typecheck	3448908
Storage	235456090
Blake2B	276484
Script_interpreter	929352
Cache	272064
Script_decode	7820340
Ticket	222336
Script_ir_unparser	2176619

87% of the gas is consumed by storage_functor.ml

Analysis of the gas-vs-time

Block 4043830:

Gas estimation: 270,739,726 milli gas
Esimated gas for storage: 87% of the above
Gas for 6 seconds: 6,000,000,000 milli gas

If we would explain the diff only by the storage cost underestimation, it would have to be at least x25 of the current estimation.

Issues of the existing IO benches

The effect of the OS disk cache is ignored.
Test data are too small (≒ 2GB). All data can be cached.
Inappropriate model for writing.
Slow to execute (>= 12h)

3 levels of caches

Irmin cache: Partially loaded Merkle tree on memory.
Purge by closing index or calling flush.
OS disk cache: Linux uses its available memory for disk cache.
Purging requires the root priviledge.
Hardware cache: Disks also have small memory cache inside.
vvv We ignore them since they are relatively small.

The benches ignore the effect of the OS disk cache.

Too small context data

By default, benchmarks generate contexts of 2GB size.

It can be cached even in 8GB machines.

Benches do almost no disk access at reads,
since the entire context can be cached.

Inappropriate write model

The current write model ignores the path length
(= the levels of directories).

Writing cost must highly depend on the path length.

The most of the storage costs: the random disk accesses.
# of random disk accesses ∝ Depth of the Merkle tree node
Depth of the Merkle tree node ∝ Path length
Writing requires to access to the directory of the destination.
Its cost must be proportional to its path length.
Disk cache may help reducing the random disk accesses.

New model for reading/writing

cost(path_length, nbytes) =
    coeff_path_length * path_length
    + coeff_nbytes * nbytes
    + intercept

where

path_length of /hello is 0
path_length of /contracts/index is 1

Slow to execute

Needs >= 12h.

Mainly due to the context data generation of 2GB
just for 1 read/write.

Problems of the reference machine

RAID 5
32GB memory

RAID

The reference machine uses RAID-5.
Not typical for Tezos nodes.

Multiple SSD drives
Reads in parallel.

Benchmarking on the reference machine may cause underestimation.

32GB memory

The ref. machine has 32GB memory.

25GB left available when running a Tezos node

Much bigger than 6.0GB, avail memory of the minimal spec machine (8GB of memory).

The size of the disk cache affects the IO perf a lot.

Need to emulate the memory environment of the minimal spec machine.

The worst scenario is unrealistic

If nothing is disk-cached:

1 read takes 0.1 seconds = 100 kGU.

It must be assumed that the disk cache is filled with recent context accesses.

Need to write new IO benches

The existing benches have issues.

We wrote new benches.

New IO benchmarks

io/READ and io/WRITE:

Restrict the available memory for cache
Use the mainnet context
Prepare a disk cache filled with random context accesses
Randomly read/write context files and measure their times
Execute without RAID

Restrict the available memory

Keep the kernel’s available memory around 6.0GiB
by allocating dummy memory blocks.

This requires /proc/meminfo. Linux specific.

Use the mainnet context

Not a fresh snapshot import:
too dense and cause underestimation.

Need to grow the context by running the node over it, until the size of the context reaches 40GB.

40GB is today’s maximum size of the context
The context GC keeps its size.

Prepare the disk cache

The disk cache of Tezos node machine is filled with the context data of the recent commits.

To emulate this state of the disk cache, we initialize it by accessing the context data.

Reset the disk cache
- This requires the root priviledge
- Linux speicific (/proc/)
Randomly read context files until the disk cache is filled.

Benchmark loop

Read/write random files from/to the head commit
under the prepared disk cache, and measure their times.

Read

Time of reading 1 random file

flush to reset the Irmin cache to avoid memory exhaustion.

Write

Time of writing 1 random file of random size + commit:

Overestimating: 1 commit for each write.
- 1 commit per 10 writes reduces the gas 2/3
Keep using the same context for performance.

Benchmark time

32mins for io/READ and io/WRITE

Some preparations required though:

Mainnet snapshot import
Run the node over it for a while

Result emulating 8GB machine

The results have high variance but something like:

Read: x1.3 of the current cost

coeff for the path length: 57_000
coeff for the file bytes: 1.5
intercept: 0

Write: x24 of the current cost

coeff for the path length: 930_000
coeff for the file bytes: 1
intercept: 0

Evaluation of the result

The suggested parameters match with the actual block application time?

So far we have only 1 sample, block 4043830.

Storage accesses of block 4043830

Reads

624 times
3656 total path length
183133 bytes

Writes

428 times
2376 total path length
18456 bytes

Storage access gas of block 4043830

Reads

Nairobi: 125K gas
Newly inferred: 173K gas, about x1.4

Writes

Nairobi : 85K gas
Newly inferred 1,812K gas, about x21.1

New inferred - Nairobi = 1,775K gas for 1.775 secs

It does not yet explain the 6 seconds application time…

Disk cache matters

More memory for the disk cache improves the IO performance.

Result emulating 16GB machine

14GB available memory for disk cache

Read: x1 of the current cost

coeff for the path length: 41_000
coeff for the file bytes: 2
intercept: 0

Write: x15 of the current cost

coeff for the path length: 600_000
coeff for the file bytes: 1.5
intercept: 0

Result emulating 32GB machine

30GB available memory for disk cache

Read: x1 of the current cost

coeff for the path length: 40_000
coeff for the file bytes: 2
intercept: 0

Write: x13 of the current cost

coeff for the path length: 510_000
coeff for the file bytes: 1.8
intercept: 0

Context GC

Irmin’s context GC should impact the IO perfomance negatively, but we have no idea how serious it is.

The IO perfomance may be affected by the size of the latest layer:

Better performance after a new layer is created?
Worst performance just before a GC, since the latest layer becomes maximum?

Afraid that it is out of scope of our work in Q3.

Conclusions so far

The gas cost of the storage (context) accesses are seriously underestimated.

Lots of variables make benchmarking hard
- Available memory for disk cache
- SSD performance
- Context GC timing
New IO benches suggest x1.3 for reading and x24 for writing.
Too early to update the gas parameters.

Q: How the times for the read and write are measured?

The benchmark is executed out of the protocol.

It measures the time to read or write a random file by Tezos_protocol_environment.Environment_context.

Q: How the size of the context affects the IO performance?

The benchmarks are done aganist a mainnet context of 40GB size, where several GCs are performed already.

The IO performance is much better under 7.1GB context just imported from a mainnet snapshot.

Read coeff: 58_000 (almost the same)
Write coeff: 580_000 (62%)

Q: How to visualize the benchmark and inference results?

The benchmark creates gnuplot files for visualization:

READ_io_validaiton.plot
WRITE_io_validaiton.plot