Building AI apps today feels less like engineering and more like "voodoo." When an AI fails to answer a question, we spend hours manually tweaking prompts—changing adjectives, adding "please," or asking it to "think step-by-step."
-
It is unscientific.
-
It is unscalable.
-
It breaks whenever the model changes.
We are building a Comparative RAG System that pits a human against a machine.
-
Contestant A (Human): Uses standard, manually written prompts (e.g., "Search for X").
-
Contestant B (DSPy): Uses an automated optimizer that "compiles" prompts by learning from data, mathematically maximizing retrieval accuracy.
The Goal: A live dashboard proving that automated prompt engineering outperforms human intuition on complex, multi-hop questions.
We are keeping this lean to move fast.
-
The Engine: DSPy (Python framework for programming LLMs).
-
The Data: HotPotQA (A pre-existing dataset of "hard" multi-step questions) + Wikipedia.
- Note: We are using a local ColBERTv2 server (hosted via
colbert-server) to ensure reliability and speed, replacing the unstable public index.
- Note: We are using a local ColBERTv2 server (hosted via
-
The Backend: FastAPI. Serves the logic and runs the comparison.
-
The Frontend: React. Visualizes the battle.
To ensure we finish on time, we are adhering to these constraints:
-
Local Indexing: We use a local ColBERTv2/Wikipedia index (cached).
-
Small Training Set: We will train the optimizer on just 20–50 examples from the HotPotQA dataset.
-
Optimization Target: We are strictly optimizing Query Rephrasing.
-
Human approach: Searches for the user's question literally.
-
Machine approach: Breaks the question down into sub-queries or hypothetical answers to find better documents.
-
Goal: Visualize the "Man vs. Machine" comparison.
-
The View: A split-screen interface. Left side = Human results. Right side = Machine results.
-
The Data: Display the Input Question, the Retrieved Context (the text chunks found), and the Final Answer.
-
The "Cool Factor": Show the internal thought process (the "Prompt Evolution") of the Machine side. Show us how it rewrote the query.
-
The Scoreboard: A simple metric bar showing success/fail for the current query.
Goal: Build the logic pipeline.
-
Integration: Connect DSPy to the OpenAI API and the public ColBERTv2 retriever.
-
The "Human" Route: Build a standard RAG chain (Retrieve -> Generate).
-
The "Machine" Route: Build the DSPy module and run the
BootstrapFewShotoptimizer to "compile" the prompt using the training set. -
Endpoints: Expose a simple API that takes a
questionand returns both Human and Machine answers + metadata.
Goal: Ensure the demo doesn't fail.
-
Curate the "Evil" Questions: Select 5-10 specific questions from the dataset where we know the simple search fails but the optimized search succeeds.
-
The Narrative: Prepare the script explaining why the machine won (e.g., "Notice how the machine broke the question into two parts, while the human prompt got stuck on keywords").
-
Hour 1: Everyone agrees on the API JSON structure (Input/Output).
-
Hour 2-3:
-
Backend: Get the DSPy optimizer running and save the "compiled" program.
-
Frontend: Skeleton of the split-screen UI.
-
-
Hour 4: Integration. Connect React to FastAPI.
-
Hour 5: Polish the "Evil Questions" list and style the UI to make the "Winner" obvious.