EIP-2926: Chunk-Based Code Merkleization Source

AuthorSina Mahmoodi, Alex Beregszaszi
Discussions-Tohttps://ethereum-magicians.org/t/eip-2926-chunk-based-code-merkleization/4555
StatusDraft
TypeStandards Track
CategoryCore
Created2020-08-25
Requires 161, 170, 2584

Abstract

Code merkleization, along with binarification of the trie and gas cost bump of state accessing opcodes, are considered as the main levers for decreasing block witness sizes in stateless or partial-stateless Eth1x roadmaps. Here we specify a fixed-sized chunk approach to code merkleization, outline how the transition of existing contracts to this model would look like, and pose some questions to be considered.

Motivation

Bytecode is currently the second contributor to block witness size, after all the proof hashes. Transitioning the trie from hexary to binary reduce the hash section of the witness by 3x, thereby making code the first contributor. By breaking contract code into chunks and committing to those chunks in a merkle tree, stateless clients would only need the chunks that were touched during a given transaction to execute it.

Specification

This specification assumes that EIP-2584 is deployed, and the merkleization rules and gas costs are proposed accordingly. What follows is structured to have three sections:

  1. How a given contract code is split into chunks and then merkleized
  2. How to merkleize all existing contract codes during a hardfork
  3. What EVM changes are required to account for the new way of representing code

Constants

  • CHUNK_SIZE: 32 (bytes)
  • KEY_LENGTH: 4 (bytes)
  • METADATA_KEY: 0xffffffff
  • METADATA_VERSION: 0
  • EMPTY_CODE_ROOT: 0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470 (==keccak256(''))
  • HF_BLOCK_NUMBER: to be defined

Code merkleization

The procedure for merkleizing a contract code starts with splitting it into chunks. To do this assume a function that takes the bytecode as input and returns a list of chunks as output, where each chunk is a tuple (startOffset, firstInstructionOffset, codeChunk). Note that startOffset corresponds to the absolute program counter, and firstInstructionOffset is the offset within the chunk to the first instruction, as opposed to data. This function runs in two passes:

  • First Pass: Divide the bytecode evenly into CHUNK_SIZE parts. The last part is allowed to be shorter. For each chunk, create the tuple, leaving firstInstructionOffset as 0, and add the tuple to an array of chunks. Also set chunkCount to the total number of chunks.

  • Second Pass: Scan the bytecode for instructions with immediates or multi-byte instructions (currently only PUSHN, opcodes 0x60 .. 0x7f). Let PC be the current opcode position and N the number of immediate bytes following, where N > 0. Then:

    • If (PC + N) % CHUNK_SIZE > PC % CHUNK_SIZE (i.e. instruction does not cross chunk boundary), continue to next instruction.
    • Else, compute the index of the chunk after the current instruction: chunkIdx = Math.floor((PC + N + 1) / CHUNK_SIZE).
      • If chunkIdx + 1 >= chunkCount then break the loop. This is to handle the following case: in a malformed bytecode, or the “data section” at the end of a Solidity contract, there could be a byte resembeling an instruction whose immediates exceed bytecode length.
      • Else, assuming nextChunk = chunks[chunkIdx], assign (PC + N + 1) - nextChunk.startOffset to nextChunk.firstInstructionOffset.

The firstInstructionOffset fields allows safe jumpdest analysis when a client doesn’t have all the chunks, e.g. a stateless clients receiving block witnesses.

To build a Merkle Patricia Tree we iterate chunks and process each entry, C, as follows:

  • Let the MPT key k be C.startOffset in big-endian extended with zero padding on the left to reach KEY_LENGTH. Note that the code trie is not vulnerable to grinding attacks, and therefore the key is not hashed as in the account or storage trie.
  • Let the MPT value v be RLP([firstInstructionOffset, C.code]).
  • Insert (k, v) into the codeTrie.

We further insert a metadata leaf into codeTrie, with the key being METADATA_KEY and value being RLP([METADATA_VERSION, codeHash, codeLength]). This metadata is added mostly to facilitate a cheaper implementation of EXTCODESIZE and EXTCODEHASH. It includes a version to allow for differentiating between bytecode types (e.g. EVM1.5/EIP-615, EIP-2315, etc.) or code merkleization schemes (or merkleization settings, such as larger CHUNK_SIZE) in the future.

The root of this codeTrie is computed according to the EIP-2584 rules (similar to the account trie) and stored in the account record, replacing the current codeHash field (henceforth called codeRoot). For accounts with empty code (including Externally-owned-Accounts aka EoAs) codeRoot will have the value EMPTY_CODE_ROOT (as opposed to empty codeTrie root hash) to maintain backwards compatibility.

Updating existing code (transition process)

The transition process involves reading all contracts in the state and apply the above procedure to them. A benchmark showing how long this process will take is still pending, but intuitively it should take longer than the time between two blocks (in the order of hours). Hence we recommend clients to pre-process the changes before the EIP is activated.

Code has the nice property that it is (mostly) static. Therefore clients can maintain a mapping of [accountAddress -> codeRoot] which stores the results for the contracts they have already merkleized. During this pre-computation phase, whenever a new contract is created its codeRoot is computed, and added to the mapping. Whenever a contract self-destructs, its corresponding entry is removed.

At block HF_BLOCK_NUMBER when the EIP gets activated, before executing any transaction the client must update the account record for all contracts with non-empty code in the state to replace the codeHash field with the pre-computed codeRoot. EoA accounts will keep their codeHash value as codeRoot. Accounts with empty code will keep their codeHash value as codeRoot.

Opcode and gas cost changes

It is not directly a part of consensus how clients store the merkleized code. However the gas cost analysis in this section assumes a specific way of storing it. In this approach clients only merkleize code once during creation to compute codeRoot, but then discard the chunks. They instead store the full bytecode as well as its hash and size in the database. Otherwise the gas scheme would have to be modified to charge per chunk accessed or conservatively charge for traversing the whole code trie. We believe per-chunk metering for calls would be more easily solvable by witness metering in the stateless model.

Contract creation cost

Contract creation –– whether it is initiated via CREATE, CREATE2, or an external transaction –– will execute an initcode, and the bytes returned by that initcode is what becomes the account code stored in the state. As of now there is a charge of 200 gas per byte of that code stored. We consider this cost to represent the long-term code storage cost, which could be reviewed under the “Stateless Ethereum” paradigm.

We introduce an additional charge to account for the chunking and merkleization cost: (ceil(len(code) / CHUNK_SIZE) + 1) * COST_PER_CHUNK, where COST_PER_CHUNK is TBD. We also add a new fixed charge for COST_OF_METADATA, which equals to the cost of rlpEncode([0, codeHash, codeSize]).

The exact value of COST_PER_CHUNK is to be determined, but we estimate it will be on the order of CHUNKING_COST + (2 * HASHING_COST) + (ceil(log2(NUM_CHUNKS)) * TREE_BUILDING_COST), where HASHING_COST is 36 (Gkeccak256 + Gkeccak256_word) and we have no estimate about CHUNKING_COST and TREE_BUILDING_COST. The hash cost could be reduced if the EIP-2666 proposal is accepted.

Preliminary cost comparison with CHUNKING_COST=60, TREE_BUILDING_COST=30 and COST_OF_METADATA=20 (note these values are not informed by benchmarks):

  Pre-EIP Post-EIP Increase
1 byte 200 353 76.50%
16000 bytes 3200000 3401422 6.29%
24576 bytes 4915200 5247428 6.75%

CALL, CALLCODE, DELEGATECALL, STATICCALL, EXTCODESIZE, EXTCODEHASH, and EXTCODECOPY

We expect no gas cost change. Clients have to perform the same lookup as before.

CODECOPY and CODESIZE

There should be no change to CODECOPY and CODESIZE, because the code is already in memory, and the loading cost is paid by the transaction/call which initiated it.

Other remarks

EIP-161 defines:

An account is considered empty when it has no code and zero nonce and zero balance.

We replace this statement with:

An account is considered empty when its codeRoot equals EMPTY_CODE_ROOT and has zero nonce and zero balance.

Rationale

Hexary vs binary trie

The Ethereum mainnet state is encoded as of now in a hexary Merkle Patricia Tree. As part of the Eth1x roadmap, a transition to a binary trie has been investigated with the goal of reducing witness sizes. Because code chunks are also stored in the trie, this EIP would benefit from the witness size reduction offered by the binarification. Therefore we have decided to explicitly state EIP-2584 to be a requirement of this change. Note that if code merkleization is included in a hard-fork beforehand, then all code must be re-merkleized after the binary transition.

Rationale behind chunk size

The current recommended chunk size of 32 bytes has been selected based on a few observations. Smaller chunks are more efficient (i.e. have higher chunk utilization), but incur a larger hash overhead (i.e. number of hashes as part of the proof) due to a higher trie depth. Larger chunks are less efficient, but incur less hash overhead. We plan to run a larger experiment comparing various chunk sizes to arrive at a final recommendation.

Different chunking logic

We have considered an alternative option to package chunks, where each chunk is prepended with its chunkLength and would only contain complete opcodes (i.e. any multi-byte opcode not fitting the CHUNK_SIZE would be deferred to the next chunk).

This approach has downsides compared to the one specified: 1) Requires a larger CHUNK_SIZE – at least 33 bytes to accommodate the PUSH32 instruction. 2) It is more wasteful. For example, DUP1 PUSH32 <32-byte payload> would be encoded as two chunks, the first chunk contains only DUP1, and the second contains only the PUSH32 instruction with its payload. 3) Calculating the number of chunks is not trivial and would have to be stored explicitly in the metadata.

Additionally we have reviewed many other options (basic block based, Solidity subroutines (requires determining the control flow), EIP-2315 subroutines). This EIP however only focuses on the chunk-based option.

The use of RLP

RLP was decided for, because it is used extensively in the data structures of Ethereum. Should there be a push for a different serialisation format, this proposal could be easily amended to use that. It must be noted that some have proposed (on the Stateless Ethereum Call #8) to consider a different format at the same time binarification is applied to the chain.

Alternative to the metadata

Instead of encoding codeHash and codeSize in the metadata, they could be made part of the account. In our opinion, the metadata is a more concise option, because EoAs do not need these fields, resulting in either additional logic (to omit those fields in the accounts) or calculation (to include them in merkleizing the account).

Versioning

This proposal includes a metadataVersion field in the metadata, which can be used to distinguish both various merklization rules, as well as bytecodes.

An alternative option would be to move this versioning to the account level: either following EIP-1702, or by adding an accountKind field (with potential options: eoa, merkleized_evm_chunk32, merkleized_eip2315_chunk64, etc.) as the first member of the account. One benefit this could provide is omitting codeHash for EoAs.

The keys in the code trie (and KEY_LENGTH)

As explained in the specification above, the keys in the code trie are bytecode offsets. Since valid bytecode offsets start at 0, there is no natural way to represent a “special key” needed for the metadata.

Even though the current bytecode limit is 0x6000 (as per EIP-170), we have decided to have a hard cap of 2^32-2 (influenced by EIP-1985), which makes the proposal compatible with proposals increasing, or “lifting” the code size limit.

Therefore it is safe to use 0xffffffff as the key for the metadata.

Index-based alternative for the keys in the code trie

Since the chunks are fixed-size, just by knowing the index of a chunk we can compute its startOffset = chunk_idx * CHUNK_SIZE. Therefore it’s not a necessity to encode this information in the key. We are reviewing this alternative both for simplicity and potentially lower tree building costs. By using indices as keys the structure of the trie is known only given the number of chunks, which means less analysis. Additionally, the metadata could be included at index 0, removing the need for a large, fixed-size and zero-padded, key.

Alternate values of codeRoot for EoAs

This proposal changes the meaning of the fourth field (codeHash) of the account. Prior to this change, that field represents the Keccak-256 hash of the bytecode, which is logically hash of an empty input for EoAs.

Since codeHash is replaced with codeRoot, the root hash of the code trie, the value would be different for EoAs under the new rules: the root of the codeTrie(metadata=[codeHash=keccak256(''), codeSize=0]). An alternative would be simply using the hash of an empty trie. Or to avoid introducing yet another constant (the result of the above), one could also consider using codeRoot = 0 for EoAs.

However, we wanted to maintain compatibility with EIP-1052 and decided to keep matters simple by using the hash of empty input (c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470) for EoAs.

Backwards Compatibility

From the perspective of contracts, the design aims to be transparent, with the exception of changes in gas costs.

Outside of the interface presented for contracts, this proposal introduces major changes to the way contract code is stored, and needs a substantial modification of the Ethereum state. Therefore it can only be introduced via a hard fork.

Test Cases

TBD

Show the codeRoot for the following cases:

  1. code=''
  2. code='PUSH1(0) DUP1 REVERT'
  3. code='PUSH32(-1)' (data passing through a chunk boundary)

Implementation

The implementation of the chunking and merkleization logic in Typescript can be found here, and in Python here. Please note neither of these examples currently use a binary tree.

Security Considerations

TBA

Copyright and related rights waived via CC0.

Citation

Please cite this document as:

Sina Mahmoodi, Alex Beregszaszi, "EIP-2926: Chunk-Based Code Merkleization [DRAFT]," Ethereum Improvement Proposals, no. 2926, August 2020. [Online serial]. Available: https://eips.ethereum.org/EIPS/eip-2926.