Skip to content

sashakolpakov/bayesilisk

Repository files navigation

Bayesilisk

CI

Bayesilisk logo

Beyond E2E Scripts: Using LLM-Proposed Scenarios Without Letting the LLM Be the Oracle.

Bayesilisk is a deterministic local layer for permission, entitlement, route, and data-boundary sitting over Playwright, with Grassmann attention, and LLM-generated scenario-proposal workflows gated by a finite-state verifier.

Bayesilisk is intentionally local-first. It uses static scenario fragments, caller-provided context, optional observation history, optional browser evidence, and optional local model proposals. It does not connect to production systems or inspect live customer data. It is built for testers and agents that need reproducible findings without granting a model authority over the final verdict.

What It Is

Bayesilisk is designed to find "bad spots" in authorization and data-boundary logic before those gaps become hard-to-debug application bugs.

It checks scenarios involving:

  • permission and role-route matrices;
  • customer module entitlements;
  • expense approval and receipt evidence;
  • billing export access;
  • HR document access boundaries;
  • support takeover sessions;
  • DMS tenant and process boundaries;
  • travel funding and travel-expense consistency.

The core verifier is deterministic:

scenario facts -> invariant checks -> pass/fail -> Bayesian ranking

No embedding, model output, issue text, or Playwright observation can directly declare a bug. Those layers can only steer where Bayesilisk looks next.

See docs/architecture.md for the public architecture:

Playwright is the sensor.
Grassmann attention is the router.
The scenario proposer model is the proposer.
Bayesilisk is the judge.

Quick Start

Run the CLI from the repository root:

python3 -m bayesilisk --seed 150 --format json
python3 -m bayesilisk --seed 150 --format markdown --output /tmp/bayesilisk.md
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-context.json --issue-payloads

After installation, the same entry points are available as:

bayesilisk --seed 150 --format json
bayesilisk-mcp

Run the test suite:

python3 -m pytest

GitHub CI runs deterministic tests and the Sphinx docs build without Ollama, hosted models, browser services, or hidden local state:

python3 -m pytest -m "not live_playwright and not live_ollama"
sphinx-build -b html docs docs/_build/html

Live browser/model checks are local opt-in tests:

python3 -m pytest tests/test_live_integrations.py -m live_playwright -rs
BAYESILISK_LIVE_OLLAMA=1 python3 -m pytest tests/test_live_integrations.py -m live_ollama -rs

Reports

Reports include:

  • seed and tool version;
  • deterministic production-access boundary;
  • scenario fragments and generated sub-scenarios;
  • access patterns;
  • expected invariant and observed result;
  • stable fingerprint and dedupe key;
  • classification and issue readiness;
  • attention score and attention reasons when context is supplied;
  • posterior probability and risk score;
  • suggested issue title and body.

Only findings with:

observedResult = fail
issueReadiness = ready-for-issue

should be opened automatically. probe-only, regression-watch, do-not-open-muted, and no-issue-control findings are intentionally not automatic issue material.

Proof Artifacts

Bayesilisk proof loop

The proof loop is deliberately split. Evidence and proposals can route attention, but only deterministic verification can produce automatic issue material:

Playwright evidence + local context
(browser trace, DOM state, fixture state, app facts)
        |
        v
Grassmann attention
(rank suspicious contexts; no verdict authority)
        |
        v
Candidate scenario
(catalog, rule, or model proposed; untrusted)
        |
        v
Bayesilisk verification
(deterministic invariants and controls decide pass/fail)
        |
        +--> ready issue payload
        |    stable fingerprint + evidence summary
        |
        +--> reject / watchlist
             no automatic issue

Example artifacts:

The Cal.com example uses the general Bayesilisk core with an app-specific connector that follows the connector docs. It records the Cal.com repository URL, exact tested commit, connector source context, generated proposals, observed local execution context, reports, and upstream outcome references. In the clean current run Bayesilisk generated 7 proposals: 6 route mutations from explicit connector rules plus 1 bounded workflow sequence from a connector-declared action graph. All 7 local observations were verified as app findings. One reported finding already has an upstream human-authored fix PR with a human approval review, which is stronger validation than an issue being closed without fix context.

For coding agents and LLM teams building connectors, use the ingestible contract at examples/connector-agent-contract.json. It spells out required source-context fields, observed-evidence fields, allowed agent steps, and boundaries that keep app-specific logic out of Bayesilisk core. For reusable workflow motifs, see the typed ABAG example at examples/abag-action-graph-context.json.

Why This Is Not a Black Box

Bayesilisk exposes separate ledgers for observedByPlaywright, selectedByGrassmannAttention, proposedByModel, and verifiedByBayesilisk. Only verifiedByBayesilisk contains deterministic invariant results that can feed issue payloads. Model output remains untrusted candidate input.

Model Unavailable? Still Works

The default verifier path requires no model provider. With no Ollama or hosted model configured, Bayesilisk still composes deterministic scenarios, evaluates finite-state invariants, ranks findings, validates report schemas, and emits issue payloads from verified failures.

Microsoft Playwright Bridge

Bayesilisk includes a local workflow pressure demo and an optional Microsoft Playwright probe. Playwright observes concrete browser behavior and writes Bayesilisk context; Bayesilisk still performs deterministic verification afterward.

Install the optional browser dependency:

python3 -m pip install -e '.[playwright]'
python3 -m playwright install chromium

Run the bundled demo from a repo checkout:

cd /path/to/bayesilisk
python3 -m pip install -e '.[playwright]'
python3 -m playwright install chromium

# Terminal transcript only; no browser window.
python3 -m bayesilisk.demo --no-playwright

# Full screen-recordable run with headed Chromium.
python3 -m bayesilisk.demo --recording

After editable install, the console script is also available from the active environment:

bayesilisk-demo
bayesilisk-demo --recording
bayesilisk-demo --no-playwright

The demo accepts a deterministic seed. Changing it changes the sweep order while keeping that run reproducible:

python3 -m bayesilisk.demo --seed 150 --recording
python3 -m bayesilisk.demo --seed 151 --no-playwright

To run only the lower-level Playwright adapter against the bundled static probe target and then feed the captured context to Bayesilisk:

python3 tools/playwright_probe.py --demo --output /tmp/bayesilisk-playwright-context.json
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json --format markdown
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json --issue-payloads

bayesilisk-demo serves a synthetic local fixture defined in bayesilisk/demo.py::DEMO_PROBES. Those rows are not claims about an existing customer app; they are twelve deliberately brittle product-like workflows across Travel, Expenses, Billing, HR, Support, and DMS, with stale state, impossible ordering, duplicate submission, feature-flag exposure, tenant boundaries, two controls, and role lanes. Its output shows the chain:

Playwright evidence
  -> Grassmann plane
  -> generated catalog/attention scenarios
  -> optional model-style proposal
  -> deterministic verdict
  -> issue payload

It also includes a hard-to-find drill-down showing a route-matrix failure that appears only after connecting support takeover state, HR document access, route permissions, and module context. The drill-down includes a seeded sweep order, so changing --seed can make the same buried failure surface earlier or later while remaining reproducible for that seed. Use bayesilisk-demo --recording to open headed Chromium, slow the probe clicks, and hold the browser long enough to screen-record the local workflow pressure. Use bayesilisk-demo --no-playwright to see the same local loop without launching a browser. The transcript explains every finding class: breakage.easy, breakage.hard-to-find, finding.candidate-breakage, and control-confirmed. breakage.hard-to-find means the deterministic invariant failed only after context narrowed the search to a cross-role, cross-module, stale-state, or unusual workflow path; it does not mean the model guessed the verdict.

For a real app, serve a page that exposes data-bayesilisk-probe rows with actor, route, invariant, expected status, and actual click behavior, then run:

python3 tools/playwright_probe.py --url http://localhost:3000/probe-page \
  --output /tmp/bayesilisk-real-context.json
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-real-context.json --format markdown

Realistic App Integration Demo

The realistic demo is a small local permission app, not a static table. It has users, tenants, module flags, support takeover state, HR documents, DMS receipts, billing exports, and expense approvals. The page at /internal/bayesilisk-probes exposes data-bayesilisk-probe rows, and each button calls a local permission handler before writing the observed status back to the page. Bayesilisk then consumes the captured context exactly like it would for a caller-provided app.

Run it without launching a browser:

python3 -m bayesilisk.realistic_demo --no-playwright

Run the screen-recordable browser flow:

python3 -m bayesilisk.realistic_demo --recording

Write the captured context and inspect it through the normal verifier:

python3 -m bayesilisk.realistic_demo \
  --context-output /tmp/bayesilisk-realistic-context.json \
  --no-playwright
python3 -m bayesilisk \
  --seed 150 \
  --context /tmp/bayesilisk-realistic-context.json \
  --format markdown

After editable install, the console script is:

bayesilisk-realistic-demo --recording

To run it like a real app integration, keep the local app serving in one terminal:

python3 -m bayesilisk.realistic_demo --serve-only

Then copy the printed /internal/bayesilisk-probes URL into the normal Playwright bridge command from a second terminal.

Grassmann Attention

Contextual reports include a bounded Grassmann-style attention layer. It treats Playwright observations, repository facts, issue text, and invariant descriptions as local context planes, then scores which planes look bad or under-tested.

By default this uses a dependency-free anchor-plane proxy. Set BAYESILISK_USE_OLLAMA_EMBEDDINGS=1 to add Ollama /api/embed similarities with BAYESILISK_OLLAMA_MODEL, defaulting to nomic-embed-text.

The same behavior can be controlled explicitly from the CLI:

python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json \
  --enable-embeddings \
  --embedding-model nomic-embed-text \
  --attention-threshold 0.4 \
  --attention-selection-limit 3

Attention scores answer:

Where should Bayesilisk look next?

Risk scores answer:

Given this deterministic rule result, how important is this finding?

Those are deliberately separate.

Scenario Proposer Model

Set BAYESILISK_USE_OLLAMA_SCENARIO_MODEL=1 to let a local scenario proposer model suggest extra scenario compositions through Ollama /api/chat. The provider is selected with BAYESILISK_SCENARIO_PROVIDER, defaulting to ollama. API-key backed providers read keys from BAYESILISK_SCENARIO_API_KEY or the env var named by BAYESILISK_SCENARIO_API_KEY_ENV; reports record only whether a key was configured, never the key itself. Runtime config precedence is explicit CLI/MCP arguments, then environment variables, then defaults.

The preferred local proposer is gemma4:e2b:

BAYESILISK_USE_OLLAMA_SCENARIO_MODEL=1 \
BAYESILISK_OLLAMA_SCENARIO_MODEL=gemma4:e2b \
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json --format json

Equivalent CLI controls avoid hidden environment-only behavior:

python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json \
  --enable-scenario-proposer \
  --scenario-provider ollama \
  --scenario-model gemma4:e2b \
  --scenario-proposal-limit 3 \
  --ollama-base-url http://localhost:11434

Model output is untrusted. Bayesilisk accepts a proposal only if it uses known fragment ids and invariant ids, targets a selected attention plane, and passes schema validation. Accepted proposals appear as generated.model.* scenarios with weak-model-proposal:* provenance for compatibility with the earlier report field name.

Every JSON report includes effectiveConfiguration, recording the effective attention/model settings with the Ollama base URL reduced to a safe URL class.

MCP Server

Bayesilisk includes a small stdio MCP tool server:

bayesilisk-mcp

From a checkout, the module form is equivalent:

python3 -m bayesilisk.mcp_server

By default the server writes only MCP JSON-RPC frames on stdout and stays quiet on stderr. Set BAYESILISK_MCP_BANNER=1 when running it manually if you want the ASCII startup banner.

Verifier tools:

  • run;
  • rank_context;
  • issue_payloads;
  • propose_probes.

Codex orchestration tools:

  • interview_connector_need;
  • establish_provenance;
  • connector_prompt_packet;
  • scenario_plan;
  • verify_connector_outputs;
  • fix_packet.

The MCP tools accept the same control names as JSON arguments, including enableEmbeddings, embeddingModel, enableScenarioProposer, scenarioModel, scenarioProposalLimit, attentionThreshold, attentionSelectionLimit, and ollamaBaseUrl.

Agents should pass current issue lists, open PRs, branch facts, local verifier notes, Playwright observations, and known Bayesilisk fingerprints as context. The MCP server still runs locally and does not mutate GitHub or production systems.

Codex Setup

Install Bayesilisk directly from GitHub:

python3 -m pip install 'git+https://github.com/sashakolpakov/bayesilisk.git'

Or clone and install editable:

git clone https://github.com/sashakolpakov/bayesilisk.git
cd bayesilisk
python3 -m pip install -e .

From an existing checkout:

python3 -m pip install -e .

Then add Bayesilisk to Codex config:

[mcp_servers.bayesilisk]
command = "bayesilisk-mcp"
args = []
startup_timeout_sec = 60
tool_timeout_sec = 120

For a project-local config inside a Bayesilisk checkout, use an explicit checkout path. An absolute Python path is safest if Codex does not inherit your interactive shell PATH.

[mcp_servers.bayesilisk]
command = "python3"
args = ["-m", "bayesilisk.mcp_server"]
cwd = "/absolute/path/to/bayesilisk"
startup_timeout_sec = 60
tool_timeout_sec = 120

Restart Codex, then ask:

Use Bayesilisk to build a connector for this repo. Start by interviewing me
about the connector need, then establish provenance, generate a connector prompt
packet, plan scenarios, and verify connector outputs.

The intended loop is:

interview_connector_need
  -> establish_provenance
  -> connector_prompt_packet
  -> Codex writes connector code in the target app/test repo
  -> scenario_plan
  -> connector executes local fixtures
  -> verify_connector_outputs
  -> fix_packet

run can also call the local scenario proposer model/API when enableScenarioProposer=true. The model proposes; Bayesilisk validates and verifies. Codex remains responsible for app-specific connector execution, issue creation, and code changes, and should act only on verified Bayesilisk output.

The OpenAI Codex configuration reference documents mcp_servers.<id>.command, args, cwd, startup_timeout_sec, and tool_timeout_sec: https://developers.openai.com/codex/config-reference

Documentation

Sphinx documentation lives in docs/. The GitHub Pages workflow builds it with MyST Markdown support and publishes it from GitHub Actions.

Local docs build:

python3 -m pip install -r docs/requirements.txt
sphinx-build -b html docs docs/_build/html

Development Notes

The test suite includes scenario-matrix coverage:

  • every catalog scenario must reference valid fragments and invariants;
  • every invariant must have at least one passing control and one failing bad-spot case in the deterministic catalog;
  • Playwright, Grassmann attention, and model proposals must not override finite-state verifier results.

Current public planning issues are tracked in GitHub Issues.

Boundaries

Bayesilisk is a verifier and prioritizer, not an authorization engine. It must not connect to production systems, inspect live customer data, create migrations, or emit internal platform claims as customer package claims.

About

Deterministic local layer for permission, entitlement, route, and data-boundary sitting over Playwright, with Grassmann attention, and LLM-generated scenario-proposal workflows gated by a finite-state verifier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors