slides
New IO benchmarks



DaiLambda, Inc.
Jun FURUSE/古瀨 淳
Gas team meeting, 2023-09-11

Summary

  • The current storage costs are seriously underestimated.
  • The existing IO benches are not good to update the parameters.
  • We wrote new benchmarks.
  • We got some results.

Blocks with overtimes

Mainnet block 4043830:

  • 270 kGU: estimated time to apply <= 270 ms.
  • 800 ms to apply in one machine.
  • 3s in another.
  • 6s in yet another.

Some costs should be underestimated,
likely the storage costs.

Gas composition of block 4043830

Total: 270739726 milli GU

Carbonated_map 76800
Apply 20060733
Typecheck 3448908
Storage 235456090
Blake2B 276484
Script_interpreter 929352
Cache 272064
Script_decode 7820340
Ticket 222336
Script_ir_unparser 2176619

87% of the gas is consumed by storage_functor.ml

Analysis of the gas-vs-time

Block 4043830:

  • Gas estimation: 270,739,726 milli gas
  • Esimated gas for storage: 87% of the above
  • Gas for 6 seconds: 6,000,000,000 milli gas

If we would explain the diff only by the storage cost underestimation, it would have to be at least x25 of the current estimation.

Issues of the existing IO benches

  • The effect of the OS disk cache is ignored.
  • Test data are too small (≒ 2GB). All data can be cached.
  • Inappropriate model for writing.
  • Slow to execute (>= 12h)

3 levels of caches

Irmin cache
Partially loaded Merkle tree on memory.
Purge by closing index or calling flush.
OS disk cache
Linux uses its available memory for disk cache.
Purging requires the root priviledge.
Hardware cache
Disks also have small memory cache inside.
vvv We ignore them since they are relatively small.

The benches ignore the effect of the OS disk cache.

Too small context data

By default, benchmarks generate contexts of 2GB size.

It can be cached even in 8GB machines.

Benches do almost no disk access at reads,
since the entire context can be cached.

Inappropriate write model

The current write model ignores the path length
(= the levels of directories).

Writing cost must highly depend on the path length.

  • The most of the storage costs: the random disk accesses.
  • # of random disk accesses ∝ Depth of the Merkle tree node
  • Depth of the Merkle tree node ∝ Path length
  • Writing requires to access to the directory of the destination.
    Its cost must be proportional to its path length.
  • Disk cache may help reducing the random disk accesses.

New model for reading/writing

cost(path_length, nbytes) =
    coeff_path_length * path_length
    + coeff_nbytes * nbytes
    + intercept

where

  • path_length of /hello is 0
  • path_length of /contracts/index is 1

Slow to execute

Needs >= 12h.

Mainly due to the context data generation of 2GB
just for 1 read/write.

Problems of the reference machine

  • RAID 5
  • 32GB memory

RAID

The reference machine uses RAID-5.
Not typical for Tezos nodes.

  • Multiple SSD drives
  • Reads in parallel.

Benchmarking on the reference machine may cause underestimation.

32GB memory

The ref. machine has 32GB memory.

25GB left available when running a Tezos node

  • Much bigger than 6.0GB, avail memory of the minimal spec machine (8GB of memory).

The size of the disk cache affects the IO perf a lot.

Need to emulate the memory environment of the minimal spec machine.

The worst scenario is unrealistic

If nothing is disk-cached:

  • 1 read takes 0.1 seconds = 100 kGU.

It must be assumed that the disk cache is filled with recent context accesses.

Need to write new IO benches

The existing benches have issues.

We wrote new benches.

New IO benchmarks

io/READ and io/WRITE:

  • Restrict the available memory for cache
  • Use the mainnet context
  • Prepare a disk cache filled with random context accesses
  • Randomly read/write context files and measure their times
  • Execute without RAID

Restrict the available memory

Keep the kernel’s available memory around 6.0GiB
by allocating dummy memory blocks.

This requires /proc/meminfo. Linux specific.

Use the mainnet context

Not a fresh snapshot import:
too dense and cause underestimation.

Need to grow the context by running the node over it, until the size of the context reaches 40GB.

  • 40GB is today’s maximum size of the context
  • The context GC keeps its size.

Prepare the disk cache

The disk cache of Tezos node machine is filled with the context data of the recent commits.

To emulate this state of the disk cache, we initialize it by accessing the context data.

  • Reset the disk cache
    • This requires the root priviledge
    • Linux speicific (/proc/)
  • Randomly read context files until the disk cache is filled.

Benchmark loop

Read/write random files from/to the head commit
under the prepared disk cache, and measure their times.

Read

Time of reading 1 random file

  • flush to reset the Irmin cache to avoid memory exhaustion.

Write

Time of writing 1 random file of random size + commit:

  • Overestimating: 1 commit for each write.
    • 1 commit per 10 writes reduces the gas 2/3
  • Keep using the same context for performance.

Benchmark time

32mins for io/READ and io/WRITE

Some preparations required though:

  • Mainnet snapshot import
  • Run the node over it for a while

Result emulating 8GB machine

The results have high variance but something like:

Read: x1.3 of the current cost

  • coeff for the path length: 57_000
  • coeff for the file bytes: 1.5
  • intercept: 0

Write: x24 of the current cost

  • coeff for the path length: 930_000
  • coeff for the file bytes: 1
  • intercept: 0

Evaluation of the result

The suggested parameters match with the actual block application time?

So far we have only 1 sample, block 4043830.

Storage accesses of block 4043830

Reads

  • 624 times
  • 3656 total path length
  • 183133 bytes

Writes

  • 428 times
  • 2376 total path length
  • 18456 bytes

Storage access gas of block 4043830

Reads

  • Nairobi: 125K gas
  • Newly inferred: 173K gas, about x1.4

Writes

  • Nairobi : 85K gas
  • Newly inferred 1,812K gas, about x21.1

New inferred - Nairobi = 1,775K gas for 1.775 secs

It does not yet explain the 6 seconds application time…

Disk cache matters

More memory for the disk cache improves the IO performance.

Result emulating 16GB machine

14GB available memory for disk cache

Read: x1 of the current cost

  • coeff for the path length: 41_000
  • coeff for the file bytes: 2
  • intercept: 0

Write: x15 of the current cost

  • coeff for the path length: 600_000
  • coeff for the file bytes: 1.5
  • intercept: 0

Result emulating 32GB machine

30GB available memory for disk cache

Read: x1 of the current cost

  • coeff for the path length: 40_000
  • coeff for the file bytes: 2
  • intercept: 0

Write: x13 of the current cost

  • coeff for the path length: 510_000
  • coeff for the file bytes: 1.8
  • intercept: 0

Context GC

Irmin’s context GC should impact the IO perfomance negatively, but we have no idea how serious it is.

The IO perfomance may be affected by the size of the latest layer:

  • Better performance after a new layer is created?
  • Worst performance just before a GC, since the latest layer becomes maximum?

Afraid that it is out of scope of our work in Q3.

Conclusions so far

The gas cost of the storage (context) accesses are seriously underestimated.

  • Lots of variables make benchmarking hard
    • Available memory for disk cache
    • SSD performance
    • Context GC timing
  • New IO benches suggest x1.3 for reading and x24 for writing.
  • Too early to update the gas parameters.

Q: How the times for the read and write are measured?

The benchmark is executed out of the protocol.

It measures the time to read or write a random file by Tezos_protocol_environment.Environment_context.

Q: How the size of the context affects the IO performance?

The benchmarks are done aganist a mainnet context of 40GB size, where several GCs are performed already.

The IO performance is much better under 7.1GB context just imported from a mainnet snapshot.

  • Read coeff: 58_000 (almost the same)
  • Write coeff: 580_000 (62%)

Q: How to visualize the benchmark and inference results?

The benchmark creates gnuplot files for visualization:

  • READ_io_validaiton.plot
  • WRITE_io_validaiton.plot