GEVM Optimization Design

This document describes the performance engineering behind GEVM. The goal is to execute EVM bytecode at near-native speed in Go, closing the gap with Rust (revm) while maintaining correctness across 44,035 spec tests.

Result: ~2x faster than geth, within 10% of revm on real-world workloads (ERC-20, Snailtracer).

Interpreter Dispatch
Gas Accounting
Opcode Inlining
Stack
Memory
Bytecode & Jump Tables
Object Pooling
Arena Allocators
Caches
Pointer Parameters & Scratch Buffers
Embedding & Heap Escape Prevention
Precompile Set Caching
Zero-Overhead Tracing
Code Generation

1. Interpreter Dispatch

Files: vm/gen/main.go, vm/table_gen.go

Traditional EVM interpreters use an instruction table: a [256]func(interp, host) array indexed by opcode. Each opcode dispatch requires a pointer dereference and indirect call. GEVM replaces this with a direct switch op statement over all 256 opcodes.

Why it matters:

Indirect calls defeat CPU branch prediction and prevent inlining
A switch statement lets the compiler emit a jump table with direct branches
Hot opcodes (ADD, MLOAD, JUMP, PUSH1) stay in the L1 instruction cache
The Go compiler can inline small case bodies directly into the switch

The switch is generated by vm/gen/main.go which parses opcode handler functions from inst_*.go files and emits them as switch cases into table_gen.go.

2. Gas Accounting

Files: vm/gen/main.go (lines 29-34, 265-289), vm/table_gen.go

Most EVMs check and deduct gas on every opcode. This adds a branch and a memory write per instruction. GEVM uses a two-mode gas strategy:

Accumulate mode (static-gas opcodes like ADD, MUL, LT, AND):

gasCounter += GasVeryLow   // just add, no check

Flush mode (dynamic-gas opcodes like SLOAD, CALL, KECCAK256):

if gas.remaining < gasCounter + dynamicCost { halt OOG }
gas.remaining -= gasCounter + dynamicCost
gasCounter = 0

The gas counter is a local variable (register-allocated). It only flushes to the Gas struct at control flow boundaries (JUMP, CALL, STOP) or when dynamic gas is needed. This eliminates one branch and one memory write per static-gas opcode in the hot loop.

Stack underflow/overflow checks also trigger a flush before halting, so the gas state is always consistent on error.

3. Opcode Inlining

Files: vm/gen/main.go (lines 318-341), vm/inst_*.go

The code generator recognizes opcodes marked inline: true and copies their function body directly into the switch case, rather than emitting a function call.

Inlined opcodes (~40):

Arithmetic: ADD, MUL, SUB, DIV, SDIV, MOD, SMOD, ADDMOD, MULMOD, EXP, SIGNEXTEND
Comparison: LT, GT, SLT, SGT, EQ, ISZERO
Bitwise: AND, OR, XOR, NOT, BYTE, SHL, SHR, SAR, CLZ
Memory: MLOAD, MSTORE
Control: JUMP, JUMPI, POP
Push: PUSH0-4, PUSH20, PUSH32
Context: ADDRESS, CALLER, CALLVALUE, CALLDATALOAD, CALLDATASIZE, CODESIZE, RETURNDATASIZE, PC, MSIZE

Not inlined (function calls): KECCAK256, SLOAD, SSTORE, CALL, CREATE, BALANCE, EXTCODE* — these have complex logic or host interactions.

The generator also applies "local variable substitution": replacing interp.Bytecode. with a local bc variable to avoid repeated pointer dereferences in inlined bodies.

Shaped stack operations further reduce boilerplate. The generator recognizes patterns (binary op, unary op, push, pop) and emits optimized stack manipulation:

// Binary op (ADD, MUL, etc.): pop 2, push 1
s.top--
types.AddTo(&s.data[s.top-1], &s.data[s.top], &s.data[s.top-1])

4. Stack

File: vm/stack.go

type Stack struct {
    data [1024]types.Uint256  // 32KB fixed array
    top  int
}

The stack is a fixed-size array, not a slice. This means:

No bounds checking on growth (capacity is compile-time constant)
No slice header manipulation
Predictable memory layout for CPU cache
Embedded in Interpreter struct (single allocation)

Uint256 is [4]uint64 (32 bytes, value type). Arithmetic uses pointer receivers (types.AddTo(&dst, &a, &b)) to modify in-place without copies.

5. Memory

File: vm/memory.go

Shared Buffer with Checkpoints

All call frames in a transaction share a single underlying []byte buffer. Each frame owns a region starting at its checkpoint offset:

type Memory struct {
    buffer     *[]byte  // shared pointer across call stack
    checkpoint int      // start offset for this frame
}

When a CALL creates a child frame, it records the current buffer length as the child's checkpoint. The child appends to the same buffer. On return, the buffer is truncated back to the checkpoint. No copying occurs.

Child Context Pooling

Child Memory structs (16 bytes of metadata) are pooled via sync.Pool. The buffer itself is never copied or reallocated unless it needs to grow.

Resize Fast Path

if targetLen <= cap(buf) {
    buf = buf[:targetLen]
    clear(buf[oldLen:])  // compiles to SIMD memclr on arm64
}

Go's clear() builtin compiles to runtime.memclrNoHeapPointers which uses NEON SIMD on ARM64. This is 2-3x faster than a byte-by-byte loop.

Memory Gas Memoization

MemoryGas tracks the current word count and cumulative expansion cost. On resize, only the delta is charged:

func (m *MemoryGas) RecordNewLen(newNum uint64) uint64 {
    if newNum <= m.WordsNum { return 0 }
    oldCost := m.ExpansionCost
    m.ExpansionCost = memoryGas(newNum)  // quadratic formula
    return m.ExpansionCost - oldCost     // delta only
}

6. Bytecode & Jump Tables

File: vm/bytecode.go

End Padding

Bytecode is padded with 33 zero bytes (STOP opcodes) at the end:

padded := make([]byte, originalLen + 33)  // 1 opcode + 32 data bytes
copy(padded, code)

This allows PUSH32 to read 32 bytes past any valid PC without bounds checking. Every PUSH operand read is a simple slice operation with no if pc+n > len(code) guard.

Jump Table Analysis

JUMPDEST positions are stored as a bitmap (1 bit per byte of code). Analysis scans the bytecode once, skipping PUSH operands. An 8-byte fast path skips zero regions:

if i+8 <= len(code) && binary.LittleEndian.Uint64(code[i:i+8]) == 0 {
    i += 8  // skip 8 zero bytes at once
    continue
}

Hash-Based Reuse

When a Bytecode object is recycled from the pool with the same code hash, the jump table analysis is skipped entirely:

func (b *Bytecode) ResetWithHash(code []byte, hash B256) {
    if b.hash == hash && b.originalLen == len(code) {
        copy(b.code, code)  // reuse jump table
        return
    }
    b.Reset(code)  // full re-analysis
}

External Cache

The Evm struct maintains a JumpTableCache map[B256][]byte that persists across transactions. When the same contract is called multiple times in a block, the jump table is looked up by code hash and injected directly, skipping analysis entirely.

The jumpTableExternal flag prevents the pooled Bytecode from reusing the cached slice's capacity on recycle.

7. Object Pooling

Files: vm/pool.go, state/journal.go, host/evm.go

GEVM pools every major object via sync.Pool:

Object	Size	Pool
Evm	~33KB (includes rootStack)	`evmPool`
Journal	~1KB + arenas	`journalPool`
Interpreter	~600B	`interpreterPool`
Stack	~32KB	embedded in Interpreter
Memory	~32B metadata + 4KB buffer	`memoryPool`
Child Memory	~32B metadata	`childMemPool`
Bytecode	~200B + code slice	`bytecodePool`
Return buffers	4KB default	`returnBufPool`

Selective Cleanup

Release functions only nil out pointer/slice fields (for GC). Large value fields like ActionData (~300B) and stack data are left dirty because Clear() will overwrite them on next acquire.

Root Frame Embedding

The Evm struct embeds root-level objects directly:

type Evm struct {
    rootMemory   vm.Memory
    rootStack    vm.Stack
    rootInterp   vm.Interpreter
    rootBytecode vm.Bytecode
}

Depth-0 calls use these embedded objects directly, avoiding 3-4 pool round-trips per transaction. Only nested calls (depth > 0) use the pool.

Return Data Arena

A specialized arena accumulates RETURN/REVERT output buffers during a transaction and releases them all at once in ReleaseEvm():

type ReturnDataArena struct {
    bufs []*[]byte
}
func (a *ReturnDataArena) Alloc(size int) []byte { ... }  // get from pool
func (a *ReturnDataArena) Reset()                 { ... }  // return all to pool

8. Arena Allocators

File: state/journal.go

Account Arena

Accounts are allocated from a slab allocator instead of sync.Pool:

type accountArena struct {
    accounts []Account
    idx      int
}
func (a *accountArena) alloc() *Account {
    if a.idx >= len(a.accounts) {
        a.accounts = make([]Account, max(8, len(a.accounts)*2))
        a.idx = 0
    }
    acc := &a.accounts[a.idx]
    a.idx++
    return acc
}

A typical transaction touches 3-6 accounts. The arena eliminates per-account sync.Pool Get/Put overhead and provides contiguous memory for cache locality.

On reset, storage maps are clear()ed but not freed, preserving their allocations for the next transaction.

Storage Slot Arena

Storage slots are allocated from 1024-slot slabs:

type slotArena struct {
    slabs [][]EvmStorageSlot
    idx   int
}

A typical transaction accesses 10-100 storage slots. Batch allocation (1024 at a time) amortizes the cost and provides contiguous memory. On reset, only the first slab is retained.

9. Caches

Single-Entry Account Cache

cachedAddr types.Address
cachedAcc  *Account

Consecutive operations on the same address (common in contract execution) hit O(1) instead of a map lookup. Safe because Journal.State never removes accounts during execution.

2-Entry Storage Slot Cache

slotCacheAddr types.Address
slotCacheKeys [2]types.Uint256
slotCacheVals [2]*EvmStorageSlot

The SLOAD-then-SSTORE pattern (e.g., ERC-20 balance update) hits the cache on the second access. Round-robin eviction keeps the two most recent slots.

ForkGas Cache

type ForkGas struct {
    Balance, ExtCodeSize, ExtCodeHash uint64
    Sload, Call, Selfdestruct         uint64
}
var forkGasCache [20]ForkGas  // pre-computed at init

Only 6 opcodes have fork-varying gas costs. A 48-byte struct replaces a 2048-byte [256]uint64 table, fitting in a single L1 cache line. Pre-computed at init for all 20 forks.

Tracked State Addresses

stateAddrs []types.Address

Instead of iterating the entire State map on release (O(map capacity)), tracked addresses enable O(N) cleanup where N = accounts actually touched.

10. Pointer Parameters & Scratch Buffers

SSTORE Path

The SSTORE hot path avoids struct copies at every layer:

// Host interface: key, value, result all by pointer
SStore(addr Address, key *Uint256, value *Uint256, out *SStoreResult)

Key: 32 bytes not copied (pointer)
Value: 32 bytes not copied (pointer)
Result: 97 bytes not copied (written to pointer)
Total savings: ~160 bytes per SSTORE

The SStoreResult is written into Interpreter.SStoreScratch, a reusable scratch field embedded in the Interpreter struct.

LOG Topics

Log(addr Address, topics *[4]B256, numTopics int, data Bytes)

Topics passed as pointer to a fixed array (Interpreter.TopicsScratch), avoiding 128-byte array copies and slice allocation.

Journal SSTORE Inlining

Journal.SStoreInto() inlines Touch + SLoad logic to avoid redundant map lookups. A naive implementation does 5 map lookups; the inlined version does 1-2.

SLoadInto Aliasing

func (j *Journal) SLoadInto(addr Address, key *Uint256, out *Uint256) {
    k := *key  // local copy: safe if key==out (same stack slot)
    ...
}

Allows the caller to pass the same pointer for key and output (common when the opcode overwrites the top of stack).

11. Embedding & Heap Escape Prevention

EvmHost in Evm

type Evm struct {
    host EvmHost  // embedded, not *EvmHost
}

The host is embedded by value in the pooled Evm struct. This prevents a separate heap allocation and avoids one pointer dereference on every host call.

Block/Cfg by Pointer

type EvmHost struct {
    Block *BlockEnv  // ~256 bytes, stored by pointer
    Cfg   *CfgEnv   // ~32 bytes, stored by pointer
    Tx    TxEnv      // ~16 bytes, stored by value
}

BlockEnv and CfgEnv are stored by pointer to avoid ~288 bytes of duffcopy (Go's bulk memory copy routine) per transaction. TxEnv is small enough to store by value.

Frame Result Value Types

CallOutcome, CreateOutcome, and FrameResult are value types (not heap-allocated). Frame results are returned on the stack, avoiding heap escape for sub-frame returns.

CallInputs/CreateInputs by Pointer

executeCall(inputs *vm.CallInputs, ...)    // ~170 bytes
executeCreate(inputs *vm.CreateInputs, ...) // ~120 bytes

Passed by pointer to avoid struct copies on every frame entry.

12. Precompile Set Caching

File: precompiles/precompile.go

PrecompileSets are built once per fork and cached globally:

var cachedSets [20]*PrecompileSet  // one per fork, built at init
func ForSpec(forkID ForkID) *PrecompileSet { return cachedSets[forkID] }

Each set includes a pre-computed warm address map (map[Address]struct{}). During transaction setup, this map is shared by reference (not copied) with the Journal's warm address tracker:

journal.WarmAddresses.SetPrecompileAddresses(precompileSet.WarmAddressMap())

No Account objects are created for precompile warming. The coinbase is also warmed by value (not by loading an account):

type WarmAddresses struct {
    precompiles map[Address]struct{}  // shared immutable reference
    coinbase    Address               // stored by value, not &addr
    hasCoinbase bool
}

13. Zero-Overhead Tracing

Files: vm/hooks.go, vm/runner.go, vm/gen/main.go

GEVM supports Erigon-style tracing hooks (OnTxStart, OnEnter, OnExit, OnOpcode, OnFault) with zero overhead when disabled.

Two separate generated functions:

DefaultRunner.Run() — no tracing code at all
TracingRunner.Run() — per-opcode gas deduction + hook calls

When Evm.Runner is nil (the default), DefaultRunner is used. The tracing code path is never entered. There is no if debug { branch in the fast path.

In the handler, OnEnter/OnExit hooks use simple nil checks:

if h.hooks != nil && h.hooks.OnEnter != nil {
    h.hooks.OnEnter(...)
}

Since hooks is nil in production, the branch predictor learns this pattern and the cost is ~0.5% (one mispredicted branch per ~200 correctly predicted ones during warmup).

A per-fork DebugGasTable ([256]uint64) is built on demand for tracers, providing accurate static gas costs in OnOpcode callbacks without interpreting gas expressions at runtime.

14. Code Generation

File: vm/gen/main.go

The entire dispatch loop is generated from a single opcode table definition (~170 lines). The generator:

Parses all inst_*.go files with Go's AST parser to extract handler function bodies
Emits a switch op with ~256 cases, inlining hot opcodes and calling functions for complex ones
Generates two variants: fast path (gas accumulation) and tracing path (per-opcode gas)
Handles fork gating: opcodes gated by fork have if forkID < ForkX { halt NotActivated } checks
Emits shaped stack boilerplate (binary/unary/push/pop patterns)
Produces DebugGasTableForFork() for tracer support

This ensures the fast and tracing paths stay in sync from a single source of truth, and allows adding new opcodes by editing one table entry.

Run with: cd vm/gen && go run .

FilesExpand file tree

DESIGN.md

Latest commit

History