DecisioningAssistant is a local-first MLX project for macOS that ingests PDFs, Markdown files, and Webex threads, generates QA datasets, fine-tunes small instruction models, builds a hybrid local RAG index, and serves a Streamlit chat assistant with source-aware citations.
- Ingests PDF and Markdown documentation with structure-aware paragraph chunking.
- Fetches Webex room history directly from the Webex REST API using
rooms.jsonplus a YAML config. - Groups Webex data into thread-based chunks so each chunk starts with the thread root.
- Generates English QA pairs locally with an MLX-loaded model.
- Fine-tunes MLX-compatible models with LoRA.
- Builds or updates a local Qdrant index from source chunks and optional QA pairs.
- Runs a Streamlit RAG app with chat history, retrieval controls, reranking, answer selection, citations, and source popups.
- PDF and Markdown ingestion are structure-aware: PDFs are packed from whole paragraphs, while Markdown keeps H1 chapters whole by default and preserves section metadata.
- Webex ingestion is thread-aware: threads with fewer than 2 messages are skipped, and each thread chunk keeps room and thread metadata.
- Webex metadata now includes a
webexteams://...deep link to the parent/root message in the thread. - Webex QA generation uses the thread start to generate the question and uses child messages as the answer.
- Webex QA can be filtered to a specific user, keeping only threads where that user appears in child messages and only that user’s child messages in the answer.
- RAG retrieval supports vector search plus reranking with
cross_encoder,embedding_cosine, ornone. - Answer generation supports Best-of-N answer selection with reranking. Default candidate count is
4. - The Streamlit source popup can show the retrieved text and the Webex parent-message link when available.
configs/
sources.yaml
models.yaml
qa_generation.yaml
finetune.yaml
rag.yaml
webex_fetch.yaml
data/
raw/pdf/
raw/markdown/
raw/webex/
staging/documents/
staging/chunks/
qa/
rag/vectordb/
pipelines/
01_ingest.sh
02_generate_qa.sh
03_finetune.sh
04_build_rag.sh
05_eval.sh
06_export_rag.sh
07_import_rag.sh
src/
common/
decisioning_assistant/
ingestion/
qa/
rag/
training/
wiki/
Home.md
Overview.md
Technical-Details.md
Usage-and-Configuration.md
- macOS with Apple Silicon for MLX workflows.
- Python
>=3.10. - English-only source material and QA generation.
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .The base install includes the test/lint tools, PDF markdown support, MLX-VLM model support with TorchVision image utilities, and TurboQuant conversion/runtime support used by the project.
configs/sources.yaml: PDF, Markdown, and Webex ingestion paths plus normalization/chunking settings.configs/models.yaml: QA generator, answer model, and embedding model settings.configs/qa_generation.yaml: QA generation, validation, split, and Webex-specific QA controls.configs/finetune.yaml: MLX LoRA fine-tuning settings.configs/rag.yaml: indexing, retrieval, reranking, answer selection, and prompt-budget settings.configs/webex_fetch.yaml: direct Webex API fetch settings.
Named machine profiles are available alongside the defaults:
configs/models.m3_24gb.yaml: explicit M3 24 GB MLX-LM generation profile.configs/rag.m3_24gb.yaml: explicit M3 24 GB retrieval/context profile.configs/qa_generation.m3_24gb.yaml: explicit M3 24 GB QA generation profile.configs/models.m5_pro_64gb.yaml: larger MLX-LM generation profile.configs/models.m5_pro_64gb.gemma4.yaml: Gemma 4 MLX-VLM generation profile.configs/rag.m5_pro_64gb.yaml: larger retrieval/context profile.configs/qa_generation.m5_pro_64gb.yaml: denser QA generation profile.
- Fetch raw Webex spaces if needed.
- Put PDFs into
data/raw/pdf/and Markdown files intodata/raw/markdown/. - Run ingestion and chunking.
- Generate QA (optional step).
- Fine-tune model using QA dataset from step 4 (optional).
- Build or update the RAG index.
- Start the chat app.
Example:
#required for webex threads
decisioning-assistant webex-fetch \
--rooms-json configs/rooms.json \
--config configs/webex_fetch.yaml \
--output-dir data/raw/webex
#put any pdf into pdf dir and markdown into markdown dir
decisioning-assistant ingest #required step
decisioning-assistant qa #optional
decisioning-assistant finetune --finetune-config configs/finetune.yaml #optional
decisioning-assistant rag-index --recreate
decisioning-assistant app --server-port 8501# Ingest PDF + Markdown + Webex + normalize
decisioning-assistant ingest
# Override Markdown chapter defaults for smaller chunks if needed
decisioning-assistant ingest --markdown-target-chars 900 --markdown-split-level 6
# Generate, validate, and split QA
decisioning-assistant qa
# Fine-tune with MLX LoRA
decisioning-assistant finetune --finetune-config configs/finetune.yaml
# Build or update the hybrid RAG index
decisioning-assistant rag-index
# Recreate the RAG collection from scratch
decisioning-assistant rag-index --recreate
# Export the RAG index
decisioning-assistant rag-export --output-dir data/rag/export
# Export only selected source types
decisioning-assistant rag-export --output-dir data/rag/export --source pdf
decisioning-assistant rag-export --output-dir data/rag/export --source markdown
decisioning-assistant rag-export --output-dir data/rag/export --source webex
# Import an exported RAG bundle
decisioning-assistant rag-import --input-dir data/rag/export --recreate
# Start the Streamlit app
decisioning-assistant app --server-port 8501Raw Webex exports can be created directly through the Webex API.
Example:
decisioning-assistant webex-fetch \
--rooms-json /path/to/rooms.json \
--config configs/webex_fetch.yaml \
--output-dir data/raw/webexNotes:
--room-type groupis the default.- Output file names are derived from the room title and shortened to 80 characters.
- The fetch config only uses
tokenandmax_total_messages.
- QA generation is local-only.
- Short Webex chunks are skipped using
min_webex_chunk_chars. - Webex thread QA uses generated questions plus child-message answers.
webex_user_namecan restrict QA generation to replies from a specific user.max_webex_thread_answer_charscontrols the separate answer cap for Webex thread answers.
- Qdrant runs locally on disk.
- The index can include raw source chunks, QA pairs, or both.
- Retrieval reranking and answer reranking are separate stages.
- The default retrieval reranker is
cross_encoder. - The default answer-selection candidate count is
4. - The Streamlit app exposes the main retrieval, reranking, and prompt-budget controls in the sidebar.
Run the app directly if needed:
PYTHONPATH=src streamlit run src/rag/assistant_app.pyThe app provides:
- session chat history,
- configurable retrieval and prompt budgets,
- answer Best-of-N selection,
- source citations,
- source popups with retrieved text,
- Webex room timestamp display,
- Webex parent-message deep links when available.
Gemma 4 MLX checkpoints use mlx-vlm, so set provider: mlx_vlm in the relevant
qa_generator or answer_model config block. The text-only MLX path remains
provider: mlx or provider: mlx_lm.
Example:
answer_model:
provider: mlx_vlm
model: mlx-community/gemma-4-26b-a4b-it-mxfp4
max_tokens: 2048
temperature: 0.15
trust_remote_code: trueThe rag.chat_local CLI can also pass media to VLM models:
PYTHONPATH=src python3 -m rag.chat_local \
"What does this screenshot show?" \
--rag-config configs/rag.m5_pro_64gb.yaml \
--models-config configs/models.m5_pro_64gb.gemma4.yaml \
--image /path/to/screenshot.pngTurboQuant-compressed MLX models can be converted and used by the same QA, evaluation, local chat, and Streamlit app paths as standard MLX-LM models.
Convert the configured answer_model:
decisioning-assistant turboquant-convert \
--models-config configs/models.yaml \
--model-key answer_model \
--mlx-path data/models/answer-model-tq3 \
--bits 3 \
--group-size 64Or convert an explicit HuggingFace/local model:
decisioning-assistant turboquant-convert \
--hf-path openai/gpt-oss-20b \
--mlx-path data/models/gpt-oss-20b-tq3 \
--bits 3 \
--group-size 64Point the app at the converted model with provider: turboquant_mlx:
answer_model:
provider: turboquant_mlx
model: data/models/answer-model-tq3
max_tokens: 2048
temperature: 0.15
turboquant_kv_bits: 3
turboquant_kv_group_size: 64Set turboquant_kv_bits to 0 or omit it to use the TurboQuant weight-compressed
model with the normal FP16 KV cache. Use turboquant_fast: true only for
converted models that include QJL correction and where speed is preferred over
the highest-quality decode.
Portable RAG bundles can be moved to another machine.
Example:
# Export
decisioning-assistant rag-export --output-dir data/rag/export
# Import on another machine
decisioning-assistant rag-import --input-dir data/rag/export --recreate- The defaults in
configs/were tuned for a MacBook Pro M3 with 24GB RAM, but the code is not hard-limited to that hardware. - Larger future Apple Silicon systems can increase model size, retrieval depth, and prompt budgets through config.
- After changing Webex ingestion metadata, rerun ingestion, QA generation, and RAG indexing so new metadata reaches the app.
- PyMuPDF uses a dual AGPL/commercial license. Check fit for your usage.
See the wiki pages in wiki/ for a fuller walkthrough:
wiki/Overview.mdwiki/Technical-Details.mdwiki/Usage-and-Configuration.md