Summary
- The current storage costs are seriously underestimated.
- The existing IO benches are not good to update the parameters.
- We wrote new benchmarks.
- We got some results.
Blocks with overtimes
- 270 kGU: estimated time to apply <= 270 ms.
- 800 ms to apply in one machine.
- 3s in another.
- 6s in yet another.
Some costs should be underestimated,
likely the storage
costs.
Gas composition of block 4043830
Total: 270739726 milli GU
Carbonated_map | 76800 |
Apply | 20060733 |
Typecheck | 3448908 |
Storage | 235456090 |
Blake2B | 276484 |
Script_interpreter | 929352 |
Cache | 272064 |
Script_decode | 7820340 |
Ticket | 222336 |
Script_ir_unparser | 2176619 |
87% of the gas is consumed by storage_functor.ml
Analysis of the gas-vs-time
Block 4043830:
- Gas estimation: 270,739,726 milli gas
- Esimated gas for storage: 87% of the above
- Gas for 6 seconds: 6,000,000,000 milli gas
If we would explain the diff only by the storage cost underestimation, it would have to be at least x25 of the current estimation.
Issues of the existing IO benches
- The effect of the OS disk cache is ignored.
- Test data are too small (≒ 2GB). All data can be cached.
- Inappropriate model for writing.
- Slow to execute (>= 12h)
3 levels of caches
- Irmin cache
-
Partially loaded Merkle tree on memory.
Purge by closingindex
or callingflush
. - OS disk cache
-
Linux uses its available memory for disk cache.
Purging requires the root priviledge. - Hardware cache
-
Disks also have small memory cache inside.
vvv We ignore them since they are relatively small.
The benches ignore the effect of the OS disk cache.
Too small context data
By default, benchmarks generate contexts of 2GB size.
It can be cached even in 8GB machines.
Benches do almost no disk access at reads,
since
the entire context can be cached.
Inappropriate write model
The current write model ignores the path length
(= the levels of
directories).
Writing cost must highly depend on the path length.
- The most of the storage costs: the random disk accesses.
- # of random disk accesses ∝ Depth of the Merkle tree node
- Depth of the Merkle tree node ∝ Path length
- Writing requires to access to the directory of the destination.
Its cost must be proportional to its path length. - Disk cache may help reducing the random disk accesses.
New model for reading/writing
cost(path_length, nbytes) =
coeff_path_length * path_length
+ coeff_nbytes * nbytes
+ intercept
where
path_length
of/hello
is0
path_length
of/contracts/index
is1
Slow to execute
Needs >= 12h.
Mainly due to the context data generation of 2GB
just for 1
read/write.
Problems of the reference machine
- RAID 5
- 32GB memory
RAID
The reference machine uses RAID-5.
Not typical for Tezos
nodes.
- Multiple SSD drives
- Reads in parallel.
Benchmarking on the reference machine may cause underestimation.
32GB memory
The ref. machine has 32GB memory.
25GB left available when running a Tezos node
- Much bigger than 6.0GB, avail memory of the minimal spec machine (8GB of memory).
The size of the disk cache affects the IO perf a lot.
Need to emulate the memory environment of the minimal spec machine.
The worst scenario is unrealistic
If nothing is disk-cached:
- 1 read takes 0.1 seconds = 100 kGU.
It must be assumed that the disk cache is filled with recent context accesses.
Need to write new IO benches
The existing benches have issues.
We wrote new benches.
New IO benchmarks
io/READ
and io/WRITE
:
- Restrict the available memory for cache
- Use the mainnet context
- Prepare a disk cache filled with random context accesses
- Randomly read/write context files and measure their times
- Execute without RAID
Restrict the available memory
Keep the kernel’s available memory around 6.0GiB
by allocating
dummy memory blocks.
This requires /proc/meminfo
. Linux specific.
Use the mainnet context
Not a fresh snapshot import:
too dense and cause
underestimation.
Need to grow the context by running the node over it, until the size of the context reaches 40GB.
- 40GB is today’s maximum size of the context
- The context GC keeps its size.
Prepare the disk cache
The disk cache of Tezos node machine is filled with the context data of the recent commits.
To emulate this state of the disk cache, we initialize it by accessing the context data.
- Reset the disk cache
- This requires the root priviledge
- Linux speicific (
/proc/
)
- Randomly read context files until the disk cache is filled.
Benchmark loop
Read/write random files from/to the head commit
under the
prepared disk cache, and measure their times.
Read
Time of reading 1 random file
flush
to reset the Irmin cache to avoid memory exhaustion.
Write
Time of writing 1 random file of random size +
commit
:
- Overestimating: 1
commit
for each write.- 1
commit
per 10 writes reduces the gas 2/3
- 1
- Keep using the same context for performance.
Benchmark time
32mins for io/READ and io/WRITE
Some preparations required though:
- Mainnet snapshot import
- Run the node over it for a while
Result emulating 8GB machine
The results have high variance but something like:
Read: x1.3 of the current cost
- coeff for the path length: 57_000
- coeff for the file bytes: 1.5
- intercept: 0
Write: x24 of the current cost
- coeff for the path length: 930_000
- coeff for the file bytes: 1
- intercept: 0
Evaluation of the result
The suggested parameters match with the actual block application time?
So far we have only 1 sample, block 4043830.
Storage accesses of block 4043830
Reads
- 624 times
- 3656 total path length
- 183133 bytes
Writes
- 428 times
- 2376 total path length
- 18456 bytes
Storage access gas of block 4043830
Reads
- Nairobi: 125K gas
- Newly inferred: 173K gas, about x1.4
Writes
- Nairobi : 85K gas
- Newly inferred 1,812K gas, about x21.1
New inferred - Nairobi = 1,775K gas for 1.775 secs
It does not yet explain the 6 seconds application time…
Disk cache matters
More memory for the disk cache improves the IO performance.
Result emulating 16GB machine
14GB available memory for disk cache
Read: x1 of the current cost
- coeff for the path length: 41_000
- coeff for the file bytes: 2
- intercept: 0
Write: x15 of the current cost
- coeff for the path length: 600_000
- coeff for the file bytes: 1.5
- intercept: 0
Result emulating 32GB machine
30GB available memory for disk cache
Read: x1 of the current cost
- coeff for the path length: 40_000
- coeff for the file bytes: 2
- intercept: 0
Write: x13 of the current cost
- coeff for the path length: 510_000
- coeff for the file bytes: 1.8
- intercept: 0
Context GC
Irmin’s context GC should impact the IO perfomance negatively, but we have no idea how serious it is.
The IO perfomance may be affected by the size of the latest layer:
- Better performance after a new layer is created?
- Worst performance just before a GC, since the latest layer becomes maximum?
Afraid that it is out of scope of our work in Q3.
Conclusions so far
The gas cost of the storage (context) accesses are seriously underestimated.
- Lots of variables make benchmarking hard
- Available memory for disk cache
- SSD performance
- Context GC timing
- New IO benches suggest x1.3 for reading and x24 for writing.
- Too early to update the gas parameters.
Q: How the times for the read and write are measured?
The benchmark is executed out of the protocol.
It measures the time to read or write a random file by
Tezos_protocol_environment.Environment_context
.
Q: How the size of the context affects the IO performance?
The benchmarks are done aganist a mainnet context of 40GB size, where several GCs are performed already.
The IO performance is much better under 7.1GB context just imported from a mainnet snapshot.
- Read coeff: 58_000 (almost the same)
- Write coeff: 580_000 (62%)
Q: How to visualize the benchmark and inference results?
The benchmark creates gnuplot files for visualization:
- READ_io_validaiton.plot
- WRITE_io_validaiton.plot