This file captures the MiroFish pipeline in operator terms and ties each stage back to concrete engine behavior.
Unless stated otherwise, file paths in this document refer to the upstream MiroFish engine repository, not to this guide repository.
MiroFish accepts uploaded source files and uses them as the basis for graph extraction and simulation setup.
Code-grounded facts:
- accepted file types:
pdf,md,txt,markdown - upload size limit:
50 MB - text is split into chunks before graph ingestion
- default chunk settings in the graph builder are
chunk_size=500andchunk_overlap=50
Why it matters:
- the engine does not magically invent missing stakeholders;
- sparse or one-sided source material creates a sparse or biased graph;
- the quality of later personas is constrained by what survives extraction here.
Practical guidance:
- include named entities, relationships, dates, numbers, and competing viewpoints;
- use one focused scenario per source package;
- avoid giant mixed-context dumps when the simulation question is narrow.
- make temporal order explicit when the scenario depends on changing facts.
The graph builder creates a standalone Zep graph, sets ontology, chunks text, uploads episodes, and waits for Zep processing to complete.
Relevant engine areas:
backend/app/services/graph_builder.pybackend/app/api/graph.py
Operator implications:
- graph quality is constrained by both source text and ontology quality;
- if the graph is weak, later stages inherit that weakness;
- graph build is asynchronous, so operators should watch task state instead of assuming completion.
MiroFish does not simulate every possible node blindly. It reads graph entities, filters them, and enriches them with context before simulation preparation.
Relevant engine areas:
backend/app/services/zep_entity_reader.pybackend/app/api/simulation.py
Operator implications:
- agent count depends on filtered entities, not on a hardcoded persona list;
- if you want better agents, improve extraction quality and entity relevance first;
- inspect entity types before concluding the engine "made bad personas".
Each retained entity can become an OASIS profile for Twitter and Reddit simulation.
Relevant engine areas:
backend/app/services/oasis_profile_generator.pybackend/scripts/test_profile_format.py
Code-grounded facts:
- profiles include fields such as
persona,bio,mbti,country,profession, and platform-specific counters; - MiroFish can enrich profile generation with additional Zep search context;
- generated files include
reddit_profiles.jsonandtwitter_profiles.csv.
Practical guidance:
- do not judge this stage only by the agent name;
- inspect persona richness and whether the entity source was relevant;
- if personas are generic, the first suspects are weak source material and weak extracted context.
The simulation config is generated by an LLM from the simulation requirement, source text, and filtered entities.
Relevant engine areas:
backend/app/services/simulation_config_generator.pybackend/app/services/simulation_manager.py
Code-grounded defaults worth knowing:
- time config defaults to
72simulated hours minutes_per_rounddefaults to60- activity assumptions are centered on a China-style daily rhythm
- peak hours default to
19-22 - off-peak hours default to
0-5
Operator implications:
- a vague simulation requirement produces a vague config;
- if your scenario is not China-centric, note that in the requirement or adjust downstream expectations;
- generated config quality is part prompt quality, part entity quality, part model quality.
MiroFish can run Twitter, Reddit, or both in parallel and records run state continuously.
Relevant engine areas:
backend/app/services/simulation_runner.pybackend/scripts/run_parallel_simulation.pybackend/scripts/run_twitter_simulation.pybackend/scripts/run_reddit_simulation.py
Generated simulation artifacts usually include:
state.jsonsimulation_config.jsonreddit_profiles.jsontwitter_profiles.csvrun_state.jsontwitter/actions.jsonlreddit/actions.jsonlenv_status.jsontwitter_simulation.dbreddit_simulation.db
Important interpretation detail:
- "number of rounds" is not a reliable proxy for "number of LLM calls";
- some runtime behavior is driven by pre-generated profiles and environment state, not by a fresh LLM completion every round.
The report stage is where MiroFish performs structured reasoning over the simulation outputs using tool-backed analysis.
Relevant engine areas:
backend/app/services/report_agent.pybackend/app/services/zep_tools.pybackend/app/api/report.py
Code-grounded facts:
- the report agent follows a ReACT-style loop;
- tool calls include
insight_forge,panorama_search,quick_search, andinterview_agents; - the section-generation prompt requires at least
3tool calls and the hard cap is5; - report logs are stored separately from runtime logs.
Generated report artifacts include:
reports/<report_id>/agent_log.jsonlreports/<report_id>/console_log.txt
Operator implications:
- report quality depends heavily on model quality;
- report debugging should start from these artifacts, not from guesswork about the final prose.
- final markdown is a summary layer, not the primary evidence layer.
When a MiroFish result is weak, inspect stages in this order:
- source material quality
- graph extraction quality
- entity relevance
- profile richness
- simulation requirement specificity
- runtime artifact health
- report logs
That order prevents you from trying to fix report quality at the very end when the actual problem started at the beginning.
For the operator loop around those stages, use references/operator-workflow.md.
For graph-build failures, runtime evidence, and report verification, also use:
references/graph-build-runbook.mdreferences/runtime-forensics.mdreferences/report-audit.md