Introduce a new opcode, CLZ(x), which pops x from the stack and pushes the number of leading zero bits in x to the stack. If x is zero, pushes 256.
Motivation
Count leading zeros (CLZ) is a native opcode in many processor architectures (even in RISC architectures like ARM).
It is a basic building block used in math operations, byte operations, compression algorithms, data structures:
lnWad
powWad
lambertW0Wad
sqrt
cbrt
byte string comparisons
generalized calldata compression/decompression
bitmaps (for finding the next/previous set/unset bit)
post quantum signature schemes
Adding a clz opcode will also lead to cheaper ZK proving costs. The fastest known Solidity implementation uses several dynamic bitwise right shifts shr, which are very expensive to prove. In SP1 rv32im, a 32-bit right shift costs 1.6x more than a 32-bit mul.
Specification
A new opcode is introduced: CLZ (0x1e).
Pops 1 value from the stack.
Pushes a value to the stack, according to the following code:
/// @dev Count leading zeros.
/// Returns the number of zeros preceding the most significant one bit.
/// If `x` is zero, returns 256.
/// This is the fastest known `clz` implementation in Solidity and uses about 184 gas.
functionclz(uint256x)internalpurereturns(uint256r){/// @solidity memory-safe-assembly
assembly{r:=shl(7,lt(0xffffffffffffffffffffffffffffffff,x))r:=or(r,shl(6,lt(0xffffffffffffffff,shr(r,x))))r:=or(r,shl(5,lt(0xffffffff,shr(r,x))))r:=or(r,shl(4,lt(0xffff,shr(r,x))))r:=or(r,shl(3,lt(0xff,shr(r,x))))// forgefmt: disable-next-item
r:=add(xor(r,byte(and(0x1f,shr(shr(r,x),0x8421084210842108cc6318c6db6d54be)),0xf8f9f9faf9fdfafbf9fdfcfdfafbfcfef9fafdfafcfcfbfefafafcfbffffffff)),iszero(x))}}
Or in Python,
defclz(x):"""Returns the number of zeros preceding the most significant one bit."""ifx<0:raiseValueError("clz is undefined for negative numbers")ifx>2**256-1:raiseValueError("clz is undefined for numbers larger than 2**256 - 1")ifx==0:return256# Convert to binary string and remove any '0b' prefix.
bin_str=bin(x).replace('0b','')return256-len(bin_str)
Or in C++,
inlineuint32_tclz(uint32_tx){uint32_tr=0;if(!(x&0xFFFF0000)){r+=16;x<<=16;}if(!(x&0xFF000000)){r+=8;x<<=8;}if(!(x&0xF0000000)){r+=4;x<<=4;}if(!(x&0xC0000000)){r+=2;x<<=2;}if(!(x&0x80000000)){r+=1;}returnr;}// `x` is a uint256 bit number represented with 8 uint32 limbs.// This implementation is optimized for SP1 proving via rv32im.// For regular compute, it performs similarly to `add`.inlineuint32_tclz(uint32_tx[8]){if(x[7]!=0)returnclz(x[7]);if(x[6]!=0)return32+clz(x[6]);if(x[5]!=0)return64+clz(x[5]);if(x[4]!=0)return96+clz(x[4]);if(x[3]!=0)return128+clz(x[3]);if(x[2]!=0)return160+clz(x[2]);if(x[1]!=0)return192+clz(x[1]);if(x[0]!=0)return224+clz(x[0]);return256;}
The cost of the opcode is 3, the same as add.
Rationale
The special 0 case
256 is the smallest number after 255. Returning a small number allows the result to be compared with minimal additional bytecode.
For byte scanning operations, one can get the number of bytes to be skipped for a zero word by simply computing 256 >> 3, which gives 32.
Opcode namespace conservation
Computing the least significant bit can be easily implemented with CLZ by isolating the smallest bit via x & -x.
Gas cost
We have benchmarked the clz implementation against the add implementation in the intx library. clz uses approximately the same amount of compute cycles as add.
The SP1 rv32im optimized variant uses less compute cycles than add, in the average and worst cases.
In SP1 rv32im, a 256-bit clz is cheaper to prove than add.
clz is a stateless opcode that has a low worst-case constant cost in memory usage, compute and proving costs. It is therefore safe from being exploited for denial of service attacks.