This document describes the performance engineering behind GEVM. The goal is to execute EVM bytecode at near-native speed in Go, closing the gap with Rust (revm) while maintaining correctness across 44,035 spec tests.
Result: ~2x faster than geth, within 10% of revm on real-world workloads (ERC-20, Snailtracer).
- Interpreter Dispatch
- Gas Accounting
- Opcode Inlining
- Stack
- Memory
- Bytecode & Jump Tables
- Object Pooling
- Arena Allocators
- Caches
- Pointer Parameters & Scratch Buffers
- Embedding & Heap Escape Prevention
- Precompile Set Caching
- Zero-Overhead Tracing
- Code Generation
Files: vm/gen/main.go, vm/table_gen.go
Traditional EVM interpreters use an instruction table: a [256]func(interp, host) array indexed by opcode. Each opcode dispatch requires a pointer dereference and indirect call. GEVM replaces this with a direct switch op statement over all 256 opcodes.
Why it matters:
- Indirect calls defeat CPU branch prediction and prevent inlining
- A switch statement lets the compiler emit a jump table with direct branches
- Hot opcodes (ADD, MLOAD, JUMP, PUSH1) stay in the L1 instruction cache
- The Go compiler can inline small case bodies directly into the switch
The switch is generated by vm/gen/main.go which parses opcode handler functions from inst_*.go files and emits them as switch cases into table_gen.go.
Files: vm/gen/main.go (lines 29-34, 265-289), vm/table_gen.go
Most EVMs check and deduct gas on every opcode. This adds a branch and a memory write per instruction. GEVM uses a two-mode gas strategy:
Accumulate mode (static-gas opcodes like ADD, MUL, LT, AND):
gasCounter += GasVeryLow // just add, no check
Flush mode (dynamic-gas opcodes like SLOAD, CALL, KECCAK256):
if gas.remaining < gasCounter + dynamicCost { halt OOG }
gas.remaining -= gasCounter + dynamicCost
gasCounter = 0
The gas counter is a local variable (register-allocated). It only flushes to the Gas struct at control flow boundaries (JUMP, CALL, STOP) or when dynamic gas is needed. This eliminates one branch and one memory write per static-gas opcode in the hot loop.
Stack underflow/overflow checks also trigger a flush before halting, so the gas state is always consistent on error.
Files: vm/gen/main.go (lines 318-341), vm/inst_*.go
The code generator recognizes opcodes marked inline: true and copies their function body directly into the switch case, rather than emitting a function call.
Inlined opcodes (~40):
- Arithmetic: ADD, MUL, SUB, DIV, SDIV, MOD, SMOD, ADDMOD, MULMOD, EXP, SIGNEXTEND
- Comparison: LT, GT, SLT, SGT, EQ, ISZERO
- Bitwise: AND, OR, XOR, NOT, BYTE, SHL, SHR, SAR, CLZ
- Memory: MLOAD, MSTORE
- Control: JUMP, JUMPI, POP
- Push: PUSH0-4, PUSH20, PUSH32
- Context: ADDRESS, CALLER, CALLVALUE, CALLDATALOAD, CALLDATASIZE, CODESIZE, RETURNDATASIZE, PC, MSIZE
Not inlined (function calls): KECCAK256, SLOAD, SSTORE, CALL, CREATE, BALANCE, EXTCODE* — these have complex logic or host interactions.
The generator also applies "local variable substitution": replacing interp.Bytecode. with a local bc variable to avoid repeated pointer dereferences in inlined bodies.
Shaped stack operations further reduce boilerplate. The generator recognizes patterns (binary op, unary op, push, pop) and emits optimized stack manipulation:
// Binary op (ADD, MUL, etc.): pop 2, push 1
s.top--
types.AddTo(&s.data[s.top-1], &s.data[s.top], &s.data[s.top-1])File: vm/stack.go
type Stack struct {
data [1024]types.Uint256 // 32KB fixed array
top int
}The stack is a fixed-size array, not a slice. This means:
- No bounds checking on growth (capacity is compile-time constant)
- No slice header manipulation
- Predictable memory layout for CPU cache
- Embedded in Interpreter struct (single allocation)
Uint256 is [4]uint64 (32 bytes, value type). Arithmetic uses pointer receivers (types.AddTo(&dst, &a, &b)) to modify in-place without copies.
File: vm/memory.go
All call frames in a transaction share a single underlying []byte buffer. Each frame owns a region starting at its checkpoint offset:
type Memory struct {
buffer *[]byte // shared pointer across call stack
checkpoint int // start offset for this frame
}When a CALL creates a child frame, it records the current buffer length as the child's checkpoint. The child appends to the same buffer. On return, the buffer is truncated back to the checkpoint. No copying occurs.
Child Memory structs (16 bytes of metadata) are pooled via sync.Pool. The buffer itself is never copied or reallocated unless it needs to grow.
if targetLen <= cap(buf) {
buf = buf[:targetLen]
clear(buf[oldLen:]) // compiles to SIMD memclr on arm64
}Go's clear() builtin compiles to runtime.memclrNoHeapPointers which uses NEON SIMD on ARM64. This is 2-3x faster than a byte-by-byte loop.
MemoryGas tracks the current word count and cumulative expansion cost. On resize, only the delta is charged:
func (m *MemoryGas) RecordNewLen(newNum uint64) uint64 {
if newNum <= m.WordsNum { return 0 }
oldCost := m.ExpansionCost
m.ExpansionCost = memoryGas(newNum) // quadratic formula
return m.ExpansionCost - oldCost // delta only
}File: vm/bytecode.go
Bytecode is padded with 33 zero bytes (STOP opcodes) at the end:
padded := make([]byte, originalLen + 33) // 1 opcode + 32 data bytes
copy(padded, code)This allows PUSH32 to read 32 bytes past any valid PC without bounds checking. Every PUSH operand read is a simple slice operation with no if pc+n > len(code) guard.
JUMPDEST positions are stored as a bitmap (1 bit per byte of code). Analysis scans the bytecode once, skipping PUSH operands. An 8-byte fast path skips zero regions:
if i+8 <= len(code) && binary.LittleEndian.Uint64(code[i:i+8]) == 0 {
i += 8 // skip 8 zero bytes at once
continue
}When a Bytecode object is recycled from the pool with the same code hash, the jump table analysis is skipped entirely:
func (b *Bytecode) ResetWithHash(code []byte, hash B256) {
if b.hash == hash && b.originalLen == len(code) {
copy(b.code, code) // reuse jump table
return
}
b.Reset(code) // full re-analysis
}The Evm struct maintains a JumpTableCache map[B256][]byte that persists across transactions. When the same contract is called multiple times in a block, the jump table is looked up by code hash and injected directly, skipping analysis entirely.
The jumpTableExternal flag prevents the pooled Bytecode from reusing the cached slice's capacity on recycle.
Files: vm/pool.go, state/journal.go, host/evm.go
GEVM pools every major object via sync.Pool:
| Object | Size | Pool |
|---|---|---|
| Evm | ~33KB (includes rootStack) | evmPool |
| Journal | ~1KB + arenas | journalPool |
| Interpreter | ~600B | interpreterPool |
| Stack | ~32KB | embedded in Interpreter |
| Memory | ~32B metadata + 4KB buffer | memoryPool |
| Child Memory | ~32B metadata | childMemPool |
| Bytecode | ~200B + code slice | bytecodePool |
| Return buffers | 4KB default | returnBufPool |
Release functions only nil out pointer/slice fields (for GC). Large value fields like ActionData (~300B) and stack data are left dirty because Clear() will overwrite them on next acquire.
The Evm struct embeds root-level objects directly:
type Evm struct {
rootMemory vm.Memory
rootStack vm.Stack
rootInterp vm.Interpreter
rootBytecode vm.Bytecode
}Depth-0 calls use these embedded objects directly, avoiding 3-4 pool round-trips per transaction. Only nested calls (depth > 0) use the pool.
A specialized arena accumulates RETURN/REVERT output buffers during a transaction and releases them all at once in ReleaseEvm():
type ReturnDataArena struct {
bufs []*[]byte
}
func (a *ReturnDataArena) Alloc(size int) []byte { ... } // get from pool
func (a *ReturnDataArena) Reset() { ... } // return all to poolFile: state/journal.go
Accounts are allocated from a slab allocator instead of sync.Pool:
type accountArena struct {
accounts []Account
idx int
}
func (a *accountArena) alloc() *Account {
if a.idx >= len(a.accounts) {
a.accounts = make([]Account, max(8, len(a.accounts)*2))
a.idx = 0
}
acc := &a.accounts[a.idx]
a.idx++
return acc
}A typical transaction touches 3-6 accounts. The arena eliminates per-account sync.Pool Get/Put overhead and provides contiguous memory for cache locality.
On reset, storage maps are clear()ed but not freed, preserving their allocations for the next transaction.
Storage slots are allocated from 1024-slot slabs:
type slotArena struct {
slabs [][]EvmStorageSlot
idx int
}A typical transaction accesses 10-100 storage slots. Batch allocation (1024 at a time) amortizes the cost and provides contiguous memory. On reset, only the first slab is retained.
cachedAddr types.Address
cachedAcc *AccountConsecutive operations on the same address (common in contract execution) hit O(1) instead of a map lookup. Safe because Journal.State never removes accounts during execution.
slotCacheAddr types.Address
slotCacheKeys [2]types.Uint256
slotCacheVals [2]*EvmStorageSlotThe SLOAD-then-SSTORE pattern (e.g., ERC-20 balance update) hits the cache on the second access. Round-robin eviction keeps the two most recent slots.
type ForkGas struct {
Balance, ExtCodeSize, ExtCodeHash uint64
Sload, Call, Selfdestruct uint64
}
var forkGasCache [20]ForkGas // pre-computed at initOnly 6 opcodes have fork-varying gas costs. A 48-byte struct replaces a 2048-byte [256]uint64 table, fitting in a single L1 cache line. Pre-computed at init for all 20 forks.
stateAddrs []types.AddressInstead of iterating the entire State map on release (O(map capacity)), tracked addresses enable O(N) cleanup where N = accounts actually touched.
The SSTORE hot path avoids struct copies at every layer:
// Host interface: key, value, result all by pointer
SStore(addr Address, key *Uint256, value *Uint256, out *SStoreResult)- Key: 32 bytes not copied (pointer)
- Value: 32 bytes not copied (pointer)
- Result: 97 bytes not copied (written to pointer)
- Total savings: ~160 bytes per SSTORE
The SStoreResult is written into Interpreter.SStoreScratch, a reusable scratch field embedded in the Interpreter struct.
Log(addr Address, topics *[4]B256, numTopics int, data Bytes)Topics passed as pointer to a fixed array (Interpreter.TopicsScratch), avoiding 128-byte array copies and slice allocation.
Journal.SStoreInto() inlines Touch + SLoad logic to avoid redundant map lookups. A naive implementation does 5 map lookups; the inlined version does 1-2.
func (j *Journal) SLoadInto(addr Address, key *Uint256, out *Uint256) {
k := *key // local copy: safe if key==out (same stack slot)
...
}Allows the caller to pass the same pointer for key and output (common when the opcode overwrites the top of stack).
type Evm struct {
host EvmHost // embedded, not *EvmHost
}The host is embedded by value in the pooled Evm struct. This prevents a separate heap allocation and avoids one pointer dereference on every host call.
type EvmHost struct {
Block *BlockEnv // ~256 bytes, stored by pointer
Cfg *CfgEnv // ~32 bytes, stored by pointer
Tx TxEnv // ~16 bytes, stored by value
}BlockEnv and CfgEnv are stored by pointer to avoid ~288 bytes of duffcopy (Go's bulk memory copy routine) per transaction. TxEnv is small enough to store by value.
CallOutcome, CreateOutcome, and FrameResult are value types (not heap-allocated). Frame results are returned on the stack, avoiding heap escape for sub-frame returns.
executeCall(inputs *vm.CallInputs, ...) // ~170 bytes
executeCreate(inputs *vm.CreateInputs, ...) // ~120 bytesPassed by pointer to avoid struct copies on every frame entry.
File: precompiles/precompile.go
PrecompileSets are built once per fork and cached globally:
var cachedSets [20]*PrecompileSet // one per fork, built at init
func ForSpec(forkID ForkID) *PrecompileSet { return cachedSets[forkID] }Each set includes a pre-computed warm address map (map[Address]struct{}). During transaction setup, this map is shared by reference (not copied) with the Journal's warm address tracker:
journal.WarmAddresses.SetPrecompileAddresses(precompileSet.WarmAddressMap())No Account objects are created for precompile warming. The coinbase is also warmed by value (not by loading an account):
type WarmAddresses struct {
precompiles map[Address]struct{} // shared immutable reference
coinbase Address // stored by value, not &addr
hasCoinbase bool
}Files: vm/hooks.go, vm/runner.go, vm/gen/main.go
GEVM supports Erigon-style tracing hooks (OnTxStart, OnEnter, OnExit, OnOpcode, OnFault) with zero overhead when disabled.
Two separate generated functions:
DefaultRunner.Run()— no tracing code at allTracingRunner.Run()— per-opcode gas deduction + hook calls
When Evm.Runner is nil (the default), DefaultRunner is used. The tracing code path is never entered. There is no if debug { branch in the fast path.
In the handler, OnEnter/OnExit hooks use simple nil checks:
if h.hooks != nil && h.hooks.OnEnter != nil {
h.hooks.OnEnter(...)
}Since hooks is nil in production, the branch predictor learns this pattern and the cost is ~0.5% (one mispredicted branch per ~200 correctly predicted ones during warmup).
A per-fork DebugGasTable ([256]uint64) is built on demand for tracers, providing accurate static gas costs in OnOpcode callbacks without interpreting gas expressions at runtime.
File: vm/gen/main.go
The entire dispatch loop is generated from a single opcode table definition (~170 lines). The generator:
- Parses all
inst_*.gofiles with Go's AST parser to extract handler function bodies - Emits a
switch opwith ~256 cases, inlining hot opcodes and calling functions for complex ones - Generates two variants: fast path (gas accumulation) and tracing path (per-opcode gas)
- Handles fork gating: opcodes gated by fork have
if forkID < ForkX { halt NotActivated }checks - Emits shaped stack boilerplate (binary/unary/push/pop patterns)
- Produces
DebugGasTableForFork()for tracer support
This ensures the fast and tracing paths stay in sync from a single source of truth, and allows adding new opcodes by editing one table entry.
Run with: cd vm/gen && go run .