Beyond E2E Scripts: Using LLM-Proposed Scenarios Without Letting the LLM Be the Oracle.
Bayesilisk is a deterministic local layer for permission, entitlement, route, and data-boundary sitting over Playwright, with Grassmann attention, and LLM-generated scenario-proposal workflows gated by a finite-state verifier.
Bayesilisk is intentionally local-first. It uses static scenario fragments, caller-provided context, optional observation history, optional browser evidence, and optional local model proposals. It does not connect to production systems or inspect live customer data. It is built for testers and agents that need reproducible findings without granting a model authority over the final verdict.
Bayesilisk is designed to find "bad spots" in authorization and data-boundary logic before those gaps become hard-to-debug application bugs.
It checks scenarios involving:
- permission and role-route matrices;
- customer module entitlements;
- expense approval and receipt evidence;
- billing export access;
- HR document access boundaries;
- support takeover sessions;
- DMS tenant and process boundaries;
- travel funding and travel-expense consistency.
The core verifier is deterministic:
scenario facts -> invariant checks -> pass/fail -> Bayesian ranking
No embedding, model output, issue text, or Playwright observation can directly declare a bug. Those layers can only steer where Bayesilisk looks next.
See docs/architecture.md for the public architecture:
Playwright is the sensor.
Grassmann attention is the router.
The scenario proposer model is the proposer.
Bayesilisk is the judge.
Run the CLI from the repository root:
python3 -m bayesilisk --seed 150 --format json
python3 -m bayesilisk --seed 150 --format markdown --output /tmp/bayesilisk.md
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-context.json --issue-payloadsAfter installation, the same entry points are available as:
bayesilisk --seed 150 --format json
bayesilisk-mcpRun the test suite:
python3 -m pytestGitHub CI runs deterministic tests and the Sphinx docs build without Ollama, hosted models, browser services, or hidden local state:
python3 -m pytest -m "not live_playwright and not live_ollama"
sphinx-build -b html docs docs/_build/htmlLive browser/model checks are local opt-in tests:
python3 -m pytest tests/test_live_integrations.py -m live_playwright -rs
BAYESILISK_LIVE_OLLAMA=1 python3 -m pytest tests/test_live_integrations.py -m live_ollama -rsReports include:
- seed and tool version;
- deterministic production-access boundary;
- scenario fragments and generated sub-scenarios;
- access patterns;
- expected invariant and observed result;
- stable fingerprint and dedupe key;
- classification and issue readiness;
- attention score and attention reasons when context is supplied;
- posterior probability and risk score;
- suggested issue title and body.
Only findings with:
observedResult = fail
issueReadiness = ready-for-issue
should be opened automatically. probe-only, regression-watch,
do-not-open-muted, and no-issue-control findings are intentionally not
automatic issue material.
The proof loop is deliberately split. Evidence and proposals can route attention, but only deterministic verification can produce automatic issue material:
Playwright evidence + local context
(browser trace, DOM state, fixture state, app facts)
|
v
Grassmann attention
(rank suspicious contexts; no verdict authority)
|
v
Candidate scenario
(catalog, rule, or model proposed; untrusted)
|
v
Bayesilisk verification
(deterministic invariants and controls decide pass/fail)
|
+--> ready issue payload
| stable fingerprint + evidence summary
|
+--> reject / watchlist
no automatic issue
Example artifacts:
The Cal.com example uses the general Bayesilisk core with an app-specific connector that follows the connector docs. It records the Cal.com repository URL, exact tested commit, connector source context, generated proposals, observed local execution context, reports, and upstream outcome references. In the clean current run Bayesilisk generated 7 proposals: 6 route mutations from explicit connector rules plus 1 bounded workflow sequence from a connector-declared action graph. All 7 local observations were verified as app findings. One reported finding already has an upstream human-authored fix PR with a human approval review, which is stronger validation than an issue being closed without fix context.
For coding agents and LLM teams building connectors, use the ingestible contract at examples/connector-agent-contract.json. It spells out required source-context fields, observed-evidence fields, allowed agent steps, and boundaries that keep app-specific logic out of Bayesilisk core. For reusable workflow motifs, see the typed ABAG example at examples/abag-action-graph-context.json.
Bayesilisk exposes separate ledgers for observedByPlaywright,
selectedByGrassmannAttention, proposedByModel, and verifiedByBayesilisk.
Only verifiedByBayesilisk contains deterministic invariant results that can
feed issue payloads. Model output remains untrusted candidate input.
The default verifier path requires no model provider. With no Ollama or hosted model configured, Bayesilisk still composes deterministic scenarios, evaluates finite-state invariants, ranks findings, validates report schemas, and emits issue payloads from verified failures.
Bayesilisk includes a local workflow pressure demo and an optional Microsoft Playwright probe. Playwright observes concrete browser behavior and writes Bayesilisk context; Bayesilisk still performs deterministic verification afterward.
Install the optional browser dependency:
python3 -m pip install -e '.[playwright]'
python3 -m playwright install chromiumRun the bundled demo from a repo checkout:
cd /path/to/bayesilisk
python3 -m pip install -e '.[playwright]'
python3 -m playwright install chromium
# Terminal transcript only; no browser window.
python3 -m bayesilisk.demo --no-playwright
# Full screen-recordable run with headed Chromium.
python3 -m bayesilisk.demo --recordingAfter editable install, the console script is also available from the active environment:
bayesilisk-demo
bayesilisk-demo --recording
bayesilisk-demo --no-playwrightThe demo accepts a deterministic seed. Changing it changes the sweep order while keeping that run reproducible:
python3 -m bayesilisk.demo --seed 150 --recording
python3 -m bayesilisk.demo --seed 151 --no-playwrightTo run only the lower-level Playwright adapter against the bundled static probe target and then feed the captured context to Bayesilisk:
python3 tools/playwright_probe.py --demo --output /tmp/bayesilisk-playwright-context.json
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json --format markdown
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json --issue-payloadsbayesilisk-demo serves a synthetic local fixture defined in
bayesilisk/demo.py::DEMO_PROBES. Those rows are not claims about an existing
customer app; they are twelve deliberately brittle product-like workflows across
Travel, Expenses, Billing, HR, Support, and DMS, with stale state, impossible
ordering, duplicate submission, feature-flag exposure, tenant boundaries, two
controls, and role lanes. Its output shows the chain:
Playwright evidence
-> Grassmann plane
-> generated catalog/attention scenarios
-> optional model-style proposal
-> deterministic verdict
-> issue payload
It also includes a hard-to-find drill-down showing a route-matrix failure that
appears only after connecting support takeover state, HR document access, route
permissions, and module context. The drill-down includes a seeded sweep order,
so changing --seed can make the same buried failure surface earlier or later
while remaining reproducible for that seed. Use
bayesilisk-demo --recording to open headed Chromium, slow the probe clicks, and
hold the browser long enough to screen-record the local workflow pressure. Use
bayesilisk-demo --no-playwright to see the same local loop without launching a
browser. The transcript explains every finding class: breakage.easy,
breakage.hard-to-find, finding.candidate-breakage, and
control-confirmed. breakage.hard-to-find means the deterministic invariant
failed only after context narrowed the search to a cross-role, cross-module,
stale-state, or unusual workflow path; it does not mean the model guessed the
verdict.
For a real app, serve a page that exposes data-bayesilisk-probe rows with
actor, route, invariant, expected status, and actual click behavior, then run:
python3 tools/playwright_probe.py --url http://localhost:3000/probe-page \
--output /tmp/bayesilisk-real-context.json
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-real-context.json --format markdownThe realistic demo is a small local permission app, not a static table. It has
users, tenants, module flags, support takeover state, HR documents, DMS
receipts, billing exports, and expense approvals. The page at
/internal/bayesilisk-probes exposes data-bayesilisk-probe rows, and each
button calls a local permission handler before writing the observed status back
to the page. Bayesilisk then consumes the captured context exactly like it would
for a caller-provided app.
Run it without launching a browser:
python3 -m bayesilisk.realistic_demo --no-playwrightRun the screen-recordable browser flow:
python3 -m bayesilisk.realistic_demo --recordingWrite the captured context and inspect it through the normal verifier:
python3 -m bayesilisk.realistic_demo \
--context-output /tmp/bayesilisk-realistic-context.json \
--no-playwright
python3 -m bayesilisk \
--seed 150 \
--context /tmp/bayesilisk-realistic-context.json \
--format markdownAfter editable install, the console script is:
bayesilisk-realistic-demo --recordingTo run it like a real app integration, keep the local app serving in one terminal:
python3 -m bayesilisk.realistic_demo --serve-onlyThen copy the printed /internal/bayesilisk-probes URL into the normal
Playwright bridge command from a second terminal.
Contextual reports include a bounded Grassmann-style attention layer. It treats Playwright observations, repository facts, issue text, and invariant descriptions as local context planes, then scores which planes look bad or under-tested.
By default this uses a dependency-free anchor-plane proxy. Set
BAYESILISK_USE_OLLAMA_EMBEDDINGS=1 to add Ollama /api/embed similarities with
BAYESILISK_OLLAMA_MODEL, defaulting to nomic-embed-text.
The same behavior can be controlled explicitly from the CLI:
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json \
--enable-embeddings \
--embedding-model nomic-embed-text \
--attention-threshold 0.4 \
--attention-selection-limit 3Attention scores answer:
Where should Bayesilisk look next?
Risk scores answer:
Given this deterministic rule result, how important is this finding?
Those are deliberately separate.
Set BAYESILISK_USE_OLLAMA_SCENARIO_MODEL=1 to let a local scenario proposer
model suggest extra scenario compositions through Ollama /api/chat.
The provider is selected with BAYESILISK_SCENARIO_PROVIDER, defaulting to
ollama. API-key backed providers read keys from BAYESILISK_SCENARIO_API_KEY
or the env var named by BAYESILISK_SCENARIO_API_KEY_ENV; reports record only
whether a key was configured, never the key itself.
Runtime config precedence is explicit CLI/MCP arguments, then environment
variables, then defaults.
The preferred local proposer is gemma4:e2b:
BAYESILISK_USE_OLLAMA_SCENARIO_MODEL=1 \
BAYESILISK_OLLAMA_SCENARIO_MODEL=gemma4:e2b \
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json --format jsonEquivalent CLI controls avoid hidden environment-only behavior:
python3 -m bayesilisk --seed 150 --context /tmp/bayesilisk-playwright-context.json \
--enable-scenario-proposer \
--scenario-provider ollama \
--scenario-model gemma4:e2b \
--scenario-proposal-limit 3 \
--ollama-base-url http://localhost:11434Model output is untrusted. Bayesilisk accepts a proposal only if it uses known
fragment ids and invariant ids, targets a selected attention plane, and passes
schema validation. Accepted proposals appear as generated.model.* scenarios
with weak-model-proposal:* provenance for compatibility with the earlier
report field name.
Every JSON report includes effectiveConfiguration, recording the effective
attention/model settings with the Ollama base URL reduced to a safe URL class.
Bayesilisk includes a small stdio MCP tool server:
bayesilisk-mcpFrom a checkout, the module form is equivalent:
python3 -m bayesilisk.mcp_serverBy default the server writes only MCP JSON-RPC frames on stdout and stays
quiet on stderr. Set BAYESILISK_MCP_BANNER=1 when running it manually if
you want the ASCII startup banner.
Verifier tools:
run;rank_context;issue_payloads;propose_probes.
Codex orchestration tools:
interview_connector_need;establish_provenance;connector_prompt_packet;scenario_plan;verify_connector_outputs;fix_packet.
The MCP tools accept the same control names as JSON arguments, including
enableEmbeddings, embeddingModel, enableScenarioProposer,
scenarioModel, scenarioProposalLimit, attentionThreshold,
attentionSelectionLimit, and ollamaBaseUrl.
Agents should pass current issue lists, open PRs, branch facts, local verifier notes, Playwright observations, and known Bayesilisk fingerprints as context. The MCP server still runs locally and does not mutate GitHub or production systems.
Install Bayesilisk directly from GitHub:
python3 -m pip install 'git+https://github.com/sashakolpakov/bayesilisk.git'Or clone and install editable:
git clone https://github.com/sashakolpakov/bayesilisk.git
cd bayesilisk
python3 -m pip install -e .From an existing checkout:
python3 -m pip install -e .Then add Bayesilisk to Codex config:
[mcp_servers.bayesilisk]
command = "bayesilisk-mcp"
args = []
startup_timeout_sec = 60
tool_timeout_sec = 120For a project-local config inside a Bayesilisk checkout, use an explicit
checkout path. An absolute Python path is safest if Codex does not inherit your
interactive shell PATH.
[mcp_servers.bayesilisk]
command = "python3"
args = ["-m", "bayesilisk.mcp_server"]
cwd = "/absolute/path/to/bayesilisk"
startup_timeout_sec = 60
tool_timeout_sec = 120Restart Codex, then ask:
Use Bayesilisk to build a connector for this repo. Start by interviewing me
about the connector need, then establish provenance, generate a connector prompt
packet, plan scenarios, and verify connector outputs.
The intended loop is:
interview_connector_need
-> establish_provenance
-> connector_prompt_packet
-> Codex writes connector code in the target app/test repo
-> scenario_plan
-> connector executes local fixtures
-> verify_connector_outputs
-> fix_packet
run can also call the local scenario proposer model/API when
enableScenarioProposer=true. The model proposes; Bayesilisk validates and
verifies. Codex remains responsible for app-specific connector execution, issue
creation, and code changes, and should act only on verified Bayesilisk output.
The OpenAI Codex configuration reference documents mcp_servers.<id>.command,
args, cwd, startup_timeout_sec, and tool_timeout_sec:
https://developers.openai.com/codex/config-reference
Sphinx documentation lives in docs/. The GitHub Pages workflow builds it with MyST Markdown support and publishes it from GitHub Actions.
Local docs build:
python3 -m pip install -r docs/requirements.txt
sphinx-build -b html docs docs/_build/htmlThe test suite includes scenario-matrix coverage:
- every catalog scenario must reference valid fragments and invariants;
- every invariant must have at least one passing control and one failing bad-spot case in the deterministic catalog;
- Playwright, Grassmann attention, and model proposals must not override finite-state verifier results.
Current public planning issues are tracked in GitHub Issues.
Bayesilisk is a verifier and prioritizer, not an authorization engine. It must not connect to production systems, inspect live customer data, create migrations, or emit internal platform claims as customer package claims.
