VSR Coding Standards

This document outlines the normative coding standards for the VSR (Viewstamped Replication) codebase.

Module Organization

Structure

Organize modules with clear sections using comments:

defmodule Vsr.ModuleName do
  @moduledoc """
  Clear, concise module documentation.
  
  Explain purpose, usage patterns, and important notes.
  Include examples where helpful.
  """
  
  # Section 0: Types and Constants
  @type t :: %__MODULE__{}
  @default_opts [key: :value]
  
  # Section 1: Public API
  def public_function(args), do: implementation
  
  # Section 2: GenServer callbacks (if applicable)
  def init(opts), do: {:ok, state}
  def handle_call(msg, from, state), do: {:reply, :ok, state}
  
  # Section 3: Private helpers
  defp helper_function(args), do: implementation
end

Documentation Standards

All public functions must have @doc strings
All modules must have @moduledoc with clear purpose explanation
Protocol callbacks should document expected behavior and return values
Complex algorithms require inline comments explaining the logic
Include examples in documentation where helpful

Naming Conventions

Modules

PascalCase: Vsr.StateMachine, VsrKv
Descriptive names: Module name should clearly indicate purpose
Namespace properly: Use Vsr. prefix for core protocol modules

Functions

snake_case: client_request_impl, increment_op_number
Descriptive names: Function name should clearly indicate what it does
Implementation suffix: Message handlers use _impl suffix (prepare_impl/2)
Protocol callbacks: Prefix with _ (_apply_operation, _get_state)

Variables and Atoms

snake_case: view_number, op_number
Descriptive: Avoid abbreviations unless they're domain-specific (VSR terms)
Consistent terminology: Use VSR specification terms exactly

Constants

Uppercase with underscores: @DEFAULT_OPTS, @FILTER
Module attributes: Use @ for compile-time constants

Module Aliases

Individual alias statements: Use separate alias statements for each module in source files
❌ Avoid multi-module aliases in source code: alias Module.{A, B, C}

✅ Use individual aliases in source files:

alias Module.A
alias Module.B
alias Module.C

✅ Exception: Bracketed aliases are acceptable for command-line usage (IEx, elixir -e "...")
Reason: Individual aliases are clearer, easier to grep, and avoid merge conflicts in source files

Function Patterns

Function Definitions

# Good - descriptive name, clear pattern matching
defp client_request_impl(%ClientRequest{operation: op, from: from}, state) do
  # implementation
end

# Bad - abbreviated name, unclear purpose  
defp handle_cr(msg, state) do
  # implementation
end

Error Handling

Let it crash: Follow Erlang/Elixir philosophy - don't code defensively with excessive try/catch
No defensive catch-all branches: Never use _ -> branches in case statements to "handle" unexpected input - let it crash so problems surface in logs
Explicit error tuples: {:error, :reason} not just :error for expected error conditions
Consistent patterns: Always use same error tuple format
Meaningful error reasons: Use descriptive atoms for error reasons
Crash on invalid input: Let functions crash on invalid input rather than handling every edge case
No defensive logging: Don't log warnings for "unhandled" cases - if it's truly unhandled, it should crash

# Good - let it crash on invalid input
def fetch_operation(log, op_number) when is_integer(op_number) and op_number > 0 do
  case Log.fetch(log, op_number) do
    {:ok, entry} -> {:ok, entry}
    :error -> {:error, :not_found}
  end
end

# Good - crash immediately on invalid JSON rather than defensive error handling
def stdin_impl(stdin, _from, state) do
  json_map = JSON.decode!(stdin)  # Let it crash if invalid JSON
  message = Message.from_json_map(json_map)  # Let it crash if invalid message
  # ... process message
end

# Bad - overly defensive
def fetch_operation(log, op_number) do
  try do
    case Log.fetch(log, op_number) do
      {:ok, entry} -> {:ok, entry}
      :error -> {:error, :not_found}
    end
  rescue
    _ -> {:error, :unknown}
  end
end

When to let it crash:

Invalid JSON input
Missing required fields in messages
Type mismatches (string passed where integer expected)
Programming errors (calling undefined functions)
Contract violations (guards should enforce contracts)

When to use error tuples:

Expected business logic failures (key not found, insufficient permissions)
Network failures that should be retried
Resource unavailability that can be handled gracefully

Never add catch-all clauses for operation handlers:

Protocol message handlers like handle_commit/2 should NOT have default _ -> ... clauses
Unknown operations indicate programming errors - let them crash to surface bugs immediately
Defensive catch-alls hide problems - if an unexpected operation arrives, it's better to crash noisily than silently ignore it
Example: handle_commit/2 should only handle known operations like ["read", key], ["write", key, value], ["cas", key, from, to] and crash on anything else

Pattern Matching

Match in function heads when possible
Use guards for simple validations
Use with for complex validation chains

# Good - pattern matching in function head
defp prepare_impl(%Prepare{view: view} = prepare, %{view_number: current_view} = state) 
  when view >= current_view do
  # implementation
end

# Good - with statement for complex validation
defp validate_prepare(prepare, state) do
  with {:view, view} when view >= state.view_number <- {:view, prepare.view},
       {:op, op} when op > Log.length(state.log) <- {:op, prepare.op_number} do
    {:ok, prepare}
  else
    {:view, _} -> {:error, :stale_view}
    {:op, _} -> {:error, :duplicate_operation}
  end
end

Code Style and Syntax

Keyword Lists

Remove unnecessary brackets: When keyword lists are the last argument to a function, omit the brackets
❌ Avoid: start_supervised!({Module, [option: value, other: data]})
✅ Use: start_supervised!({Module, option: value, other: data})
Exception: Brackets are required when the keyword list is not the last argument

# Good - no brackets around keyword list as last argument
start_supervised!({Vsr.ListKv, node_id: :replica1, cluster_size: 3})

# Good - brackets required when not last argument
some_function([option: value], other_arg)

# Bad - unnecessary brackets around keyword list
start_supervised!({Vsr.ListKv, [node_id: :replica1, cluster_size: 3]})

Module Aliases

Never use as:: Always use individual alias statements without renaming
❌ Avoid: alias Some.Long.Module, as: Module
✅ Use: alias Some.Long.Module
Reasoning: The as: pattern obscures the actual module name and makes code harder to follow

# Good - clear module aliasing
alias Maelstrom.Message
alias Maelstrom.Message.Init
alias Maelstrom.Message.Echo

# Bad - using as: for aliasing
alias Maelstrom.Message, as: Message
alias Some.Long.Module, as: Short

# Exception: Only use full module names when there would be naming conflicts
defmodule MyApp.Message do
  # Here we'd need to use full names to avoid conflicts with our own Message
  # But prefer restructuring modules to avoid this situation
end

Protocol Implementation with Protoss

Understanding Protoss Framework

CRITICAL: VSR uses the Protoss framework for colocated protocol implementations. This is NOT standard Elixir protocols or behaviours.

Protoss Protocol Definition

use Protoss  # MUST be at the top of protocol files

defprotocol Vsr.ProtocolName do
  @moduledoc """
  Clear protocol documentation explaining purpose.
  """
  
  @doc """
  Function documentation with expected behavior.
  """
  def callback_name(implementer, args)
after
  # Protoss callback specification (like @behaviour)
  @callback _new(vsr :: term, options :: keyword) :: t
end

Protoss Protocol Implementation

defmodule MyImplementation do
  use Vsr.ProtocolName  # Use the Protoss protocol (NOT defimpl!)
  
  # Implementation callbacks - these are called by Protoss
  def _new(vsr_instance, opts) do
    # Initialize the implementation
    %__MODULE__{...}
  end
  
  # Protocol function implementations
  def callback_name(implementer, args) do
    # implementation
  end
end

Key Protoss Patterns

State Machine Implementation

defmodule MyStateMachine do
  use Vsr.StateMachine  # Use Protoss protocol
  
  def _new(_vsr_pid, _opts), do: %__MODULE__{state: %{}}
  def _apply_operation(sm, op), do: {new_sm, result}  
  def _read_only?(_sm, _op), do: false
  def _require_linearized?(_sm, _op), do: true
  def _get_state(sm), do: sm.state
  def _set_state(sm, new_state), do: %{sm | state: new_state}
end

Log Implementation

defmodule MyLog do
  use Vsr.Log  # Use Protoss protocol
  
  def _new(node_id, opts), do: %__MODULE__{...}
  def append(log, entry), do: updated_log
  def fetch(log, op_number), do: {:ok, entry} | {:error, :not_found}
  def get_all(log), do: [entries...]
  def get_from(log, op_number), do: [entries...]  
  def length(log), do: integer()
  def replace(log, entries), do: updated_log
  def clear(log), do: cleared_log
end

Protoss vs Standard Elixir Protocols

WRONG - Standard Elixir Protocol

defimpl Vsr.Log, for: MyLog do
  def append(log, entry), do: ...
end

CORRECT - Protoss Protocol

defmodule MyLog do
  use Vsr.Log  # This is the Protoss way
  
  def append(log, entry), do: ...
end

VSR Initialization Patterns

Using Protoss Specs (Recommended)

# VSR will call Module._new(args...) via Protoss
vsr_opts = [
  log: {MyLog, [node_id, [dets_file: "log.dets"]]},
  state_machine: {MyStateMachine, []}
]

Direct Instance (When Needed)

# Only use this when you need to pre-configure the instance
log = MyLog._new(node_id, [dets_file: "log.dets"])
vsr_opts = [log: log, ...]

Common Protoss Mistakes to Avoid

Using defimpl instead of use
- ❌ defimpl Vsr.Log, for: MyLog
- ✅ defmodule MyLog do use Vsr.Log
Missing _new callback
- ❌ No _new function defined
- ✅ def _new(args...), do: %__MODULE__{...}
Wrong initialization pattern
- ❌ MyLog.new(args...)
- ✅ MyLog._new(args...) or {MyLog, [args...]}
Mixing standard protocols with Protoss
- ❌ Using both use and defimpl for same protocol
- ✅ Use only use for Protoss protocols

Behaviour Implementation (Non-Protoss)

defmodule MyBehaviour do
  @behaviour Vsr.BehaviourName
  
  @impl Vsr.BehaviourName
  def callback_function(args) do
    # implementation
  end
end

Type Specifications

Struct Types

defmodule Vsr.Message.Prepare do
  @type t :: %__MODULE__{
    view: non_neg_integer(),
    op_number: non_neg_integer(), 
    operation: term(),
    commit_number: non_neg_integer(),
    from: GenServer.from()
  }
  
  defstruct [:view, :op_number, :operation, :commit_number, :from]
end

Function Specs

@spec client_request(pid(), term()) :: term()
def client_request(pid, operation) do
  # implementation
end

GenServer Patterns

handle_call vs handle_cast Guidelines

PREFER handle_call: Always use handle_call by default for synchronous operations
AVOID handle_cast: Only use handle_cast when there is a specific performance or deadlock reason
Performance reasons: High-throughput scenarios where immediate response is not needed
Deadlock reasons: When synchronous calls would create circular dependencies
Code clarity: handle_call makes control flow explicit and easier to debug
Error handling: handle_call allows proper error propagation to callers

Client/Server Separation

# Client API
def start_link(opts), do: GenServer.start_link(__MODULE__, opts)
def client_request(pid, op), do: GenServer.call(pid, {:client_request, op})

# Server callbacks  
def handle_call({:client_request, op}, from, state) do
  client_request_impl(op, from, state)
end

# Implementation functions
defp client_request_impl(operation, from, state) do
  # actual logic here
  {:reply, result, new_state}
end

Router Pattern

The Router pattern (as exemplified in VsrServer) separates message routing from message processing:

# ROUTER - Only routes messages to handlers, no processing
def handle_call({:message, message}, from, state) do
  case message.body do
    %Init{} = init ->
      do_init(init, from, state)
      
    %Echo{} = echo ->
      do_echo(echo, from, state)
      
    # No defensive _ -> branch - let it crash on unexpected input
  end
end

# HANDLERS - Do the actual work and return proper GenServer tuples
defp do_init(init, message, state) do
  # Process the init message
  send_reply_if_needed()
  {:noreply, new_state}
end

defp do_echo(echo, message, state) do
  # Process the echo message  
  send_reply_immediately()
  {:noreply, new_state}
end

Router Pattern Rules:

Router only routes: handle_call should only pattern match and delegate
Handlers return GenServer tuples: {:noreply, state}, {:noreply, state, {:client_request, op}}, etc.
No processing in router: All business logic goes in do_* functions
Let it crash: No defensive _ -> branches in routers
Consistent naming: Use do_* prefix for message handlers

State Management

Use structs for GenServer state with clear field definitions
Immutable updates: Always return new state, never mutate existing
Helper functions: Use private helpers for state transformations

Testing Standards

Test Organization

defmodule ModuleTest do
  use ExUnit.Case
  
  setup do
    # Common setup
    {:ok, fixtures}
  end
  
  test "descriptive test name explaining scenario", %{fixtures: fixtures} do
    # Test implementation
  end
end

Test Patterns

Descriptive test names: Explain what scenario is being tested
Proper fixtures: Use setup blocks for common test data
Avoid hardcoded delays: Use proper synchronization mechanisms
Test error cases: Include tests for both success and failure paths
Diagnostic tests: Include tests that help debug complex distributed scenarios
NEVER use GenServer. functions in tests*: Tests should only use the public API functions provided by modules, not internal GenServer functions like GenServer.call, GenServer.cast, GenServer.reply, etc.
ALL TESTS MUST BE ASYNC: Never use async: false - all tests must run concurrently. Use unique node IDs and proper test isolation instead of serializing tests.
Do not check process alive status: Tests should not include Process.alive?/1 assertions - processes being alive is handled by the supervision tree and is not a meaningful test assertion
NEVER use :sys.get_state or :sys.replace_state: These functions break encapsulation and create brittle tests. Instead use dump_state/1 for state inspection or proper public APIs for state modification. Tests using :sys functions are fragile and can break with internal implementation changes.

Test Configuration

# Explicit configuration in tests
{:ok, replica} = GenServer.start_link(Vsr, 
  log: [],  # Use List log for testing
  state_machine: TestStateMachine,
  comms_module: TestComms,  # Use test comms implementation
  cluster_size: 3
)

Code Quality

General Principles

Single responsibility: Each module should have one clear purpose
Functional programming: Prefer immutable data and pure functions
Explicit over implicit: Make dependencies and configurations explicit
Consistent terminology: Use VSR specification terms exactly

Common Patterns

# Pipeline operations for data transformation
new_state =
  state
  |> increment_op_number()
  |> append_new_log(from, operation)
  |> maybe_commit_operation(op_number)

# Pattern matching with guards
defp primary_for_view(view, replicas) when is_integer(view) and view >= 0 do
  # implementation
end

# Consistent error handling
case result do
  {:ok, value} -> handle_success(value)
  {:error, reason} -> handle_error(reason)
end

Avoiding Common Issues

No unused aliases or imports
No hardcoded values - use module attributes or configuration
No TODO comments in production code
Proper error propagation - don't swallow errors silently
Clear variable names - avoid single letter variables except for very short scopes

VSR-Specific Guidelines

Message Handling

Struct-based messages: All VSR messages should be defined as structs
Type-based routing: Use pattern matching on message types in handle_cast
Validation first: Always validate message fields before processing

State Transitions

Follow specification: Implement exactly as described in SPECIFICATION.md
Atomic updates: State changes should be atomic within a single function
Consistent ordering: Apply state changes in the order specified by the protocol

Protocol Compliance

Exact field names: Use field names exactly as specified in the VSR protocol
Proper sequencing: Implement message sequences as described in the specification
Safety properties: Ensure all safety properties are maintained

Build and Development

Compilation

Code must compile without warnings when using --warnings-as-errors
All tests must pass before committing changes
Use mix format for consistent code formatting

Dependencies

Minimize external dependencies
Use only well-maintained, stable libraries
Document any new dependency requirements

Maelstrom Integration Guidelines

JSON Message Protocol

CRITICAL: All VSR protocol messages (except Init) sent over Maelstrom are generated by VSR itself, not by external sources.

Message Generation and Parsing Rules

VSR generates all VSR protocol messages: Prepare, PrepareOk, Heartbeat, Commit, etc. are all created by our VSR code
Init is the only exception: This is the only message Maelstrom generates for us
All other messages are VSR-to-VSR: They must contain all required fields

JSON Decoding Standards

# CORRECT - Use Map.fetch!/2 for all VSR protocol messages
%Prepare{
  view: Map.fetch!(json_map, "view"),
  op_number: Map.fetch!(json_map, "op_number"),
  leader_id: Map.fetch!(json_map, "leader_id")  # Must be present
}

# WRONG - Don't use Map.get/3 with defaults for VSR messages
%Prepare{
  leader_id: Map.get(json_map, "leader_id", nil)  # Hides bugs in message generation
}

Rationale: If a required field is missing, it means we have a bug in our message generation code, not a compatibility issue. Let it crash to surface the bug immediately.

Test Message Generation

Tests should generate JSON by encoding structs: Don't hand-write JSON maps
Use proper struct creation: Ensure all required fields are present

Example:

# CORRECT - Generate JSON from struct
prepare = %Prepare{view: 1, op_number: 1, leader_id: "n0", ...}
json = Message.to_json_map(prepare)

# WRONG - Hand-written JSON that might miss fields
json = %{"type" => "prepare", "view" => 1}  # Missing leader_id!

Key Architectural Principles

Durability Strategy: Log is durable (DETS), state machine is in-memory cache
- ❌ Making state machine persistent with DETS
- ✅ DETS log + in-memory state machine that can be reconstructed from log
Node Communication: Maelstrom uses string node IDs, not Erlang PIDs
- ❌ Using PIDs for dest_pid in communications
- ✅ Using string node IDs like "n1", "n2", "n3"
JSON Protocol: Use built-in JSON module, not external libraries
- ❌ {:jason, "~> 1.0"} or other JSON libraries
- ✅ Built-in JSON.encode!/1 and JSON.decode/1

Environment Configuration

# mix.exs - Maelstrom environment setup
def elixirc_paths(:maelstrom), do: ["lib", "maelstrom", "test/_support"]
def elixirc_paths(:test), do: ["lib", "test/_support", "maelstrom"]

def application do
  case Mix.env() do
    :maelstrom -> [extra_applications: [:logger], mod: {Maelstrom.Application, []}]
    _ -> [extra_applications: [:logger]]
  end
end

Testing Command Patterns

# Use these exact commands for testing Maelstrom integration
MIX_ENV=maelstrom elixir simple_test.exs
MIX_ENV=maelstrom elixir debug_test.exs

# For regular tests, NEVER use MIX_ENV=test - it's the default
# ❌ MIX_ENV=test mix test
# ✅ mix test

Common Maelstrom Mistakes to Avoid

Wrapper Pattern Overuse
- ❌ Creating wrapper modules when not needed
- ✅ Use wrappers only when you need to adapt interfaces (e.g., PID vs string node_id)
Wrong Persistence Layer
- ❌ Persisting state machine state directly
- ✅ Persist operations in log, reconstruct state from log
Communication Assumptions
- ❌ Assuming Erlang distribution is available
- ✅ Use Maelstrom's JSON protocol over STDIN/STDOUT
Environment Mixing
- ❌ Loading wrong modules for different environments
- ✅ Proper elixirc_paths configuration for each environment

VSR + Maelstrom Integration Pattern

# Correct Maelstrom VSR setup
vsr_opts = [
  log: {Maelstrom.DetsLog, [node_id, [dets_file: "#{node_id}_log.dets"]]},
  state_machine: {Maelstrom.Kv, []},  # Simple in-memory state machine
  cluster_size: length(node_ids),
  comms_module: Maelstrom.Comms
]

Debugging and Error Analysis Guidelines

Post-Test Analysis Workflow

CRITICAL: After every Maelstrom test run, ALWAYS check these files in order:

First: Check jepsen.log - Contains Maelstrom's network-level errors
```
cat store/lin-kv/latest/jepsen.log
```
- Look for "Malformed network message" errors
- Look for "missing-required-key" errors
- Look for process startup/teardown issues
- This file reveals network-level problems that node logs might miss
Second: Check individual node logs - Contains our application errors
```
cat store/lin-kv/latest/node-logs/n0.log
cat store/lin-kv/latest/node-logs/n1.log  
cat store/lin-kv/latest/node-logs/n2.log
```
- Look for GenServer crashes and stack traces
- Look for "Sending Maelstrom message" vs "Decoded Maelstrom message" patterns
- Look for missing expected messages (e.g., if n1 sends prepare_ok to n0, does n0 receive it?)
Pattern Analysis: Compare what was sent vs received across nodes
- If node A sends a message to node B, check if node B received it
- Look for gaps in message delivery patterns
- Check if crashes occur during message processing

Understanding Error Messages

Language Context in Error Messages
- ❌ Assuming {:body {}} is Elixir tuple syntax
- ✅ Recognize {:body {}} is Clojure syntax from Maelstrom's Java process
- Key insight: Error messages may come from different languages in the stack
Maelstrom Error Interpretation
```
Malformed network message. Node n0 tried to send the following message via STDOUT:
{:body {}}
This is malformed because: {:src missing-required-key, :dest missing-required-key}
```
- Meaning: Our VSR node sent an empty/incomplete message to Maelstrom
- Not: Elixir code outputting tuples to stdout
- Cause: Message construction failure, empty maps, or process crashes during send
Stdout vs Stderr Confusion
- ❌ Debugging stdout issues by adding more stdout logging
- ✅ Use Logger (goes to stderr) to debug stdout JSON messages
- Key insight: Maelstrom reads structured JSON from stdout, logs go to stderr

Systematic Debugging Approach

Start Simple, Build Complexity
- ❌ Jump to complex scenarios when basic communication fails
- ✅ Test basic echo/init messages first before VSR integration
- Rule: If init fails, don't test VSR operations
Isolate the Problem Layer
- ❌ Assume the problem is in the most recently changed code
- ✅ Consider all layers: JSON encoding, message construction, protocol handling
- Process:
  1. Test message construction in isolation
  2. Test JSON encoding/decoding
  3. Test protocol message flow
  4. Test VSR integration
Let it crash avoid try/do blocks, and just let it crash instead of coding defensively.

Common Debugging Mistakes

Expensive Trial-and-Error Loops
- ❌ Making multiple changes without understanding the root cause
- ❌ Adding debug code that interferes with the actual protocol
- ✅ Make minimal, targeted changes to isolate the issue
- ✅ Use proper debugging tools (Logger to stderr, not stdout)
Misinterpreting Cross-Language Error Messages
- ❌ Thinking Clojure error syntax indicates Elixir code problems
- ✅ Understand which process/language generated the error message
- Tool: Check process boundaries - Java/Clojure (Maelstrom) vs Elixir (VSR)
Debugging the Wrong Layer
- ❌ Debugging VSR logic when the problem is in basic JSON communication
- ✅ Start with the simplest possible test case
- Rule: Fix lower layers before debugging higher layers

Running Maelstrom Tests

Primary Command: Use the maelstrom-kv script for all testing:

./maelstrom-kv

This script runs the lin-kv workload with the following configuration:

3 nodes (n0, n1, n2)
10 second time limit
6 concurrent operations
Tests linearizable key-value operations

Alternative manual command (for reference only):

lein run test -w lin-kv --bin ./run-vsr-node --node-count 3 --time-limit 10 --concurrency 6

Post-Test Analysis Workflow

CRITICAL: After every Maelstrom test run, ALWAYS check these files in order:

First: Check jepsen.log - Contains Maelstrom's network-level errors

cat store/lin-kv/latest/jepsen.log | grep -i error
cat store/lin-kv/latest/jepsen.log | grep -i malformed

Second: Check individual node logs - Contains our application errors

cat store/lin-kv/latest/node-logs/n0.log
cat store/lin-kv/latest/node-logs/n1.log  
cat store/lin-kv/latest/node-logs/n2.log

Third: CRITICAL Message Flow Verification

ESSENTIAL: Always verify that messages are both SENT and RECEIVED correctly:

a) Check Message Sending:
```
# Look for "Sending Maelstrom message" in each node log
grep "Sending.*MESSAGE_TYPE" store/lin-kv/latest/node-logs/n*.log
```
b) Check Message Receiving:
```
# Look for "Received Maelstrom message" in each node log  
grep "Received.*MESSAGE_TYPE" store/lin-kv/latest/node-logs/n*.log
```
c) ❌ CRITICAL PATTERN: Messages sent but never received
- If node A shows "Sending" but node B never shows "Received"
- This indicates a message delivery failure in Maelstrom
- Check jepsen.log for malformed message errors
- Verify message format has proper src, dest, and body fields
d) VSR Protocol Flow Example:
```
# Check complete prepare/prepare_ok cycle
grep -E "(prepare|commit)" store/lin-kv/latest/node-logs/n*.log | grep -E "(Sending|Received)"
```
Fourth: Look for other patterns
- Malformed message errors: {:body {}}
- Missing required keys: :src missing-required-key
- Client timeouts: Client read timeout
- GenServer timeouts: time out

Maelstrom-Specific Debugging

Log File Analysis

# Check actual Maelstrom logs, not just our stderr
cat store/lin-kv/latest/jepsen.log | grep -i error
cat store/lin-kv/latest/node-logs/n0.log

Protocol Isolation Testing

# Test basic node communication without VSR
echo '{"src":"c0","dest":"n0","body":{"type":"init","msg_id":1,"node_id":"n0","node_ids":["n0"]}}' | ./run-vsr-node

Message Format Validation
- ✅ All Maelstrom messages must have src, dest, and body fields
- ✅ The body must contain type field
- ❌ Sending incomplete or empty maps

Learning from Mistakes Pattern

When debugging fails repeatedly:

Document the error pattern in CLAUDE.md
Identify the root misconception that led to the wrong approach
Create systematic debugging steps to avoid the same mistake
Update coding standards to prevent similar issues in future

VSR + Maelstrom Synchronous Operation Flow

CRITICAL: All Maelstrom operations (read, write, cas) must be synchronous, following the same pattern as ListKv operations.

Correct MaelstromKv Operation Flow

do_write/do_cas functions:

defp do_write(%Write{key: key, value: value}, message, genserver_from, state) do
  operation = ["write", key, value]
  
  # Store GenServer from in ETS and create encoded from for VSR
  from_hash = store_from(state, genserver_from)
  encoded_from = %{"node" => state.node_id, "from" => from_hash}
  
  # Store message for Maelstrom JSON reply later
  :ets.insert(state.from_table, {"message_#{from_hash}", message})
  
  {:noreply, state, {:client_request, encoded_from, operation}}
end

Store GenServer from tuple in ETS table with hash
Create JSON-encoded from as %{"node" => node_id, "from" => from_hash}
Store Maelstrom message separately for JSON reply generation
Use {:client_request, encoded_from, operation} pattern

handle_commit function:

def handle_commit(["write", key, value], state) do
  new_data = Map.put(state.data, key, value)
  new_state = %{state | data: new_data}
  {new_state, :ok}  # VSR handles reply via send_reply callback
end

Apply operation to state machine
Return {new_state, result}
VSR automatically calls send_reply(from, result, vsr_state)

send_reply function: send_reply(from, reply, vsr_state)

def send_reply(%{"node" => node_id, "from" => from_hash} = from, reply, vsr_state) do
  if node_id == vsr_state.inner.node_id do
    # Local: retrieve GenServer from and Maelstrom message from ETS
    {:ok, genserver_from} = fetch_from(vsr_state.inner, from_hash)
    [{_, message}] = :ets.lookup(vsr_state.inner.from_table, "message_#{from_hash}")
    
    # Send Maelstrom JSON reply
    Message.reply(message.body, to: message, ...)
    # Send GenServer reply to unblock {:message, message} call  
    GenServer.reply(genserver_from, :ok)
  else
    # Remote: send ForwardedReply across Maelstrom bus
    send_forwarded_reply(node_id, from_hash, reply)
  end
end

If local: Lookup both GenServer.from tuple and Maelstrom message from ETS
Send both replies: Maelstrom JSON response + GenServer reply to unblock call
If remote: Send ForwardedReply message across Maelstrom bus

Key Insights

Synchronous at GenServer level: The {:message, message} call blocks until VSR consensus completes
Maelstrom message is the from: Pass the entire Maelstrom message as from parameter to VSR
No separate request tracking: Don't use pending_requests - let VSR handle the flow
send_reply handles distribution: Translates between GenServer.from tuples and Maelstrom ForwardedReply messages
Signature: send_reply(from, reply, vsr_state) NOT send_reply(from, reply, inner_state)

Environment and Tool Restrictions

Never Use Lein

CRITICAL: Never attempt to use lein (Leiningen) commands in this environment. Lein is not installed and is not available. Always use the provided maelstrom.jar file directly with Java for running Maelstrom tests.

❌ Never use: lein run test -w lin-kv ... ✅ Always use: java -jar maelstrom.jar test -w lin-kv ...

Interpreting Maelstrom Test Results

CRITICAL: A Maelstrom test run is only valid if there are minimal network timeouts. High numbers of net-timeout errors in the jepsen logs indicate fundamental problems with the VSR implementation.

What to check after every test run:

Count network timeouts: grep -c "net-timeout" store/latest/jepsen.log

Without nemesis (normal operation):
- ✅ 0 timeouts: Ideal - system should respond promptly under normal conditions
- ❌ Any timeouts: Test run is invalid - indicates performance issues in VSR implementation
With nemesis (partition/failure testing):
- ✅ 0 timeouts: Ideal - system maintains availability during faults
- ⚠️ A few timeouts: May indicate issues, but could be expected. Correlate the timeouts with network partitions before accepting.
- ❌ Many timeouts: Test run likely invalid - system failing to handle partitions properly
Common causes of excessive timeouts:
- VSR nodes not forming proper quorum during partitions
- Primary election failures
- Message delivery issues between VSR nodes
- Incorrect view change handling
- Missing or broken heartbeat mechanisms
Response to high timeout counts:
- Do not ignore: High timeouts mean the test is meaningless
- Investigate logs: Check node logs for crashes, errors, view changes
- Fix underlying issues: Address VSR protocol problems before claiming success
- Re-test: Only accept results from runs with minimal timeouts

Remember: Network partitions should cause some unavailability (this is correct), but not massive timeout failures across the entire test duration.

This document serves as the normative standard for all code contributions to the VSR codebase.

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

VSR Coding Standards

Module Organization

Structure

Documentation Standards

Naming Conventions

Modules

Functions

Variables and Atoms

Constants

Module Aliases

Function Patterns

Function Definitions

Error Handling

Pattern Matching

Code Style and Syntax

Keyword Lists

Module Aliases

Protocol Implementation with Protoss

Understanding Protoss Framework

Protoss Protocol Definition

Protoss Protocol Implementation

Key Protoss Patterns

State Machine Implementation

Log Implementation

Protoss vs Standard Elixir Protocols

WRONG - Standard Elixir Protocol

CORRECT - Protoss Protocol

VSR Initialization Patterns

Using Protoss Specs (Recommended)

Direct Instance (When Needed)

Common Protoss Mistakes to Avoid

Behaviour Implementation (Non-Protoss)

Type Specifications

Struct Types

Function Specs

GenServer Patterns

handle_call vs handle_cast Guidelines

Client/Server Separation

Router Pattern

State Management

Testing Standards

Test Organization

Test Patterns

Test Configuration

Code Quality

General Principles

Common Patterns

Avoiding Common Issues

VSR-Specific Guidelines

Message Handling

State Transitions

Protocol Compliance

Build and Development

Compilation

Dependencies

Maelstrom Integration Guidelines

JSON Message Protocol

Message Generation and Parsing Rules

JSON Decoding Standards

Test Message Generation

Key Architectural Principles

Environment Configuration

Testing Command Patterns

Common Maelstrom Mistakes to Avoid

VSR + Maelstrom Integration Pattern

Debugging and Error Analysis Guidelines

Post-Test Analysis Workflow

Understanding Error Messages

Systematic Debugging Approach

Common Debugging Mistakes

Running Maelstrom Tests

Post-Test Analysis Workflow

Maelstrom-Specific Debugging

Learning from Mistakes Pattern