|Authors||Sina Mahmoodi (@s1na), Alex Beregszaszi (@axic)|
|Requires||EIP-161, EIP-170, EIP-2584|
Table of Contents
- Backwards Compatibility
- Test Cases
- Security Considerations
Code merkleization, along with binarification of the trie and gas cost bump of state accessing opcodes, are considered as the main levers for decreasing block witness sizes in stateless or partial-stateless Eth1x roadmaps. Here we specify a fixed-sized chunk approach to code merkleization and outline how the transition of existing contracts to this model would look like.
Bytecode is currently the second contributor to block witness size, after the proof hashes. Transitioning the trie from hexary to binary reduces the hash section of the witness by 3x, thereby making code the first contributor. By breaking contract code into chunks and committing to those chunks in a merkle tree, stateless clients would only need the chunks that were touched during a given transaction to execute it.
This specification assumes that EIP-2584 is deployed, and the merkleization rules and gas costs are proposed accordingly. What follows is structured to have two sections:
- How a given contract code is split into chunks and then merkleized
- How to merkleize all existing contract codes during a hardfork
CHUNK_SIZE: 32 (bytes)
KEY_LENGTH: 2 (bytes)
HF_BLOCK_NUMBER: to be defined
BE(x, N): casts
xto an unsigned integer of
Nbytes and returns its big-endian representation
For an account record
A with code
C, the field
A.codeHash is replaced with
C is empty. Otherwise it contains the root of
codeTrie, a binary trie with the following leaves:
In addition to the above,
codeTrie commits to a list of code
chunks = [(FIO_0, code_0), ..., (FIO_n, code_n)] which are derived from
C in a way that:
n < MAX_CHUNK_COUNT.
code_0 || ... || code_n == C.
length(code_i) == CHUNK_SIZEwhere
0 <= i < n.
length(code_n) <= CHUNK_SIZE.
FIO_iis the offset of the first instruction within the chunk. It should only be greater than zero if the last instruction in
code_i-1is a multi-byte instruction (like
PUSHN) crossing the chunk boundary. It is set to
CHUNK_SIZEin the case where all bytes of a chunk are data.
ith element of
chunks is stored in
BE(FIO_i, 1) || code_i, where
||stands for byte-wise concatenation
As of now there is a charge of 200 gas per byte of the code stored in state by contract creation operations, be it initiated via
CREATE2, or an external transaction. This per byte cost is to be increased from
TBD to account for the chunking and merkleization costs.
The transition process involves reading all contracts in the state and applying the above procedure to them. A benchmark showing how long this process will take is still pending, but intuitively it should take longer than the time between two blocks (in the order of hours). Hence we recommend clients to pre-process the changes before the EIP is activated.
Code has the nice property that it is (mostly) static. Therefore clients can maintain a mapping of
[accountAddress -> codeRoot] which stores the results for the contracts they have already merkleized. During this pre-computation phase, whenever a new contract is created its
codeRoot is computed, and added to the mapping. Whenever a contract self-destructs, its corresponding entry is removed.
HF_BLOCK_NUMBER when the EIP gets activated, before executing any transaction the client must update the account record for all contracts with non-empty code in the state to replace the
codeHash field with the pre-computed
codeRoot. EoA accounts will keep their
codeHash value as
codeRoot. Accounts with empty code will keep their
codeHash value as
The Ethereum mainnet state is encoded as of now in a hexary Merkle Patricia Tree. As part of the Eth1x roadmap, a transition to a binary trie has been investigated with the goal of reducing witness sizes. Because code chunks are also stored in the trie, this EIP would benefit from the witness size reduction offered by the binarification. Therefore we have decided to explicitly state EIP-2584 to be a requirement of this change. Note that if code merkleization is included in a hard-fork beforehand, then all code must be re-merkleized after the binary transition.
The current recommended chunk size of 32 bytes has been selected based on a few observations. Smaller chunks are more efficient (i.e. have higher chunk utilization), but incur a larger hash overhead (i.e. number of hashes as part of the proof) due to a higher trie depth. Larger chunks are less efficient, but incur less hash overhead. We plan to run a larger experiment comparing various chunk sizes to arrive at a final recommendation.
firstInstructionOffset fields allows safe jumpdest analysis when a client doesn’t have all the chunks, e.g. a stateless clients receiving block witnesses.
Note: there could be an edge case when computing FIO for the chunks where the data bytes at the end of a bytecode (last chunk) resemble a multi-byte instruction. This case can be safely ignored.
How merkleized code is stored in the client database affects the performance of code-accessing opcodes, i.e: CALL, CALLCODE, DELEGATECALL, STATICCALL, EXTCODESIZE, EXTCODEHASH, and EXTCODECOPY. Storing the code trie with all intermediate nodes in the database implies multiple look-ups to fetch the code of the callee, which is more than the current one look-up (excluding the trie traversal to get the account) required. Note CODECOPY and CODESIZE are not affected since the code for the current contract has already been loaded to memory.
The gas cost analysis in this section assumes a specific way of storing it. In this approach clients only merkleize code once during creation to compute
codeRoot, but then discard the chunks. They instead store the full bytecode as well as the metadata fields in the database. We believe per-chunk metering for calls would be more easily solvable by witness metering in the stateless model.
We have considered an alternative option to package chunks, where each chunk is prepended with its
chunkLength and would only contain complete opcodes (i.e. any multi-byte opcode not fitting the
CHUNK_SIZE would be deferred to the next chunk).
This approach has downsides compared to the one specified:
1) Requires a larger
CHUNK_SIZE – at least 33 bytes to accommodate the
2) It is more wasteful. For example,
DUP1 PUSH32 <32-byte payload> would be encoded as two chunks, the first chunk contains only
DUP1, and the second contains only the
PUSH32 instruction with its payload.
3) Calculating the number of chunks is not trivial and would have to be stored explicitly in the metadata.
Additionally we have reviewed many other options (basic block based, Solidity subroutines (requires determining the control flow), EIP-2315 subroutines). This EIP however only focuses on the chunk-based option.
To remain consistent with the binary transition proposal we avoid using RLP for serializing the leaf values. We have furthermore considered SSZ for both serializing data and merkleization and remain open to adopting it, but decided to use the binary trie format for simplicity.
The metadata fields
codeHash are added mostly to facilitate a cheaper implementation of
EXTCODEHASH in a stateless paradigm. The version field allows for differentiating between bytecode types (e.g. EVM1.5/EIP-615, EIP-2315, etc.) or code merkleization schemes (or merkleization settings, such as larger
CHUNK_SIZE) in future.
Instead of encoding
codeSize in the metadata, they could be made part of the account. In our opinion, the metadata is a more concise option, because EoAs do not need these fields, resulting in either additional logic (to omit those fields in the accounts) or calculation (to include them in merkleizing the account).
An alternative option to the version field would be to add an account-level field: either following EIP-1702, or by adding an
accountKind field (with potential options:
merkleized_eip2315_chunk64, etc.) as the first member of the account. One benefit this could provide is omitting
codeHash for EoAs.
As explained in the specification above, the keys in the code trie are indices of the
chunks array. Having a key length of 2 bytes means the trie can address 65536 - 3 (minus metadata fields) chunks, which corresponds to ~2Mb code size. That allows for roughly ~85x increase in the code size limit in future without requiring a change in merkleization.
This proposal changes the meaning of the fourth field (
codeHash) of the account. Prior to this change, that field represents the Keccak-256 hash of the bytecode, which is logically hash of an empty input for EoAs.
codeHash is replaced with
codeRoot, the root hash of the code trie, the value would be different for EoAs under the new rules: the root of the
codeTrie(metadata=[codeHash=keccak256(''), codeSize=0]). An alternative would be simply using the hash of an empty trie. Or to avoid introducing yet another constant (the result of the above), one could also consider using
codeRoot = 0 for EoAs.
However, we wanted to maintain compatibility with EIP-1052 and decided to keep matters simple by using the hash of empty input (
c5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470) for EoAs.
From the perspective of contracts, the design aims to be transparent, with the exception of changes in gas costs.
Outside of the interface presented for contracts, this proposal introduces major changes to the way contract code is stored, and needs a substantial modification of the Ethereum state. Therefore it can only be introduced via a hard fork.
codeRoot for the following cases:
code='PUSH1(0) DUP1 REVERT'
code='PUSH32(-1)'(data passing through a chunk boundary)
Copyright and related rights waived via CC0.
Please cite this document as:
Sina Mahmoodi (@s1na), Alex Beregszaszi (@axic), "EIP-2926: Chunk-Based Code Merkleization [DRAFT]," Ethereum Improvement Proposals, no. 2926, August 2020. [Online serial]. Available: https://eips.ethereum.org/EIPS/eip-2926.