- Problem Statement
- Project Overview
- Architecture Overview
- Tech Stack
- LangGraph Pipeline
- State Schema
- Context Window Management
- Output Specification
- AI-Readiness Scoring System
- Agent File Generation
- Frontend Specification
- Additional Features
- Key Architectural Decisions
- Build Order
- Evaluation Strategy
Every developer has joined a team and stared at a codebase they don't understand. Existing tools like GitHub Copilot help you write code, but nothing helps you understand an unfamiliar codebase systematically — what the architecture is, how modules connect, where the critical business logic lives, what the common patterns are.
The onboarding process at most companies looks like this:
- First 10 minutes: "What even is this?" — no high-level overview exists, or the README is outdated.
- First hour: "How do I run this?" — setup instructions are missing, incomplete, or wrong.
- First day: "Where does X happen?" — tracing a feature through the codebase is trial-and-error.
- First week: "What are the unwritten rules?" — conventions exist but are learned through osmosis over weeks.
Simultaneously, the rise of AI coding assistants (Claude Code, GitHub Copilot, Cline, Aider) has created a second problem: these tools perform dramatically better when they have context files describing the project's conventions, structure, and constraints. Almost nobody writes these files well, or at all.
This project solves both problems from a single automated analysis.
The Codebase Onboarding Agent is a LangGraph-powered system that takes a GitHub repository URL, performs a deep automated analysis, and produces three categories of output:
- Human-readable onboarding documentation — a structured, multi-section interactive report that answers every question a new developer has on day one.
- Machine-readable AI assistant context files — optimized context files for Claude Code, GitHub Copilot, Cline, and Aider that make AI tools understand the codebase from the first interaction.
- AI-readiness diagnostic — a scored assessment across six dimensions with a prioritized improvement roadmap.
The value proposition: "Paste a GitHub URL. Get a complete onboarding document for humans AND optimized context files for your AI coding tools. Your team onboards faster, and your AI assistants write better code from day one."
The system consists of four major components:
- URL input, configuration options, live progress streaming, interactive report viewer, Q&A chat interface
- Communicates with backend via REST (for submissions) and WebSocket (for live progress streaming)
- API server handling analysis requests and serving results
- WebSocket endpoint for streaming node-by-node progress updates to the frontend
- Job queue for managing analysis requests
- Serves cached results when the same repo+commit has already been analyzed
- The core analysis system — a stateful graph with 8 nodes, conditional edges, cyclic processing, and fan-out/fan-in patterns
- Streams progress events back to the backend via LangGraph's
astream_eventsmethod - Uses LangGraph checkpointing for state persistence across cycles
- GitHub API: Clone repos, fetch file contents, read metadata
- OpenAI API: LLM calls for semantic analysis (module summarization, pattern detection, scoring)
- LangSmith: Tracing and observability for every graph execution
- SQLite: Analysis result cache keyed by
repo_url + commit_hash, session data - File system: Cloned repository storage, generated output files
Frontend ──REST──> FastAPI ──triggers──> LangGraph Pipeline
Frontend <──WebSocket── FastAPI <──astream_events── LangGraph Pipeline
LangGraph Pipeline ──> GitHub API (clone/fetch)
LangGraph Pipeline ──> OpenAI API (LLM analysis)
LangGraph Pipeline ──> LangSmith (tracing)
FastAPI ──> SQLite (cache results)
| Component | Technology | Justification |
|---|---|---|
| Pipeline orchestration | LangGraph | Cyclic, conditional, stateful workflows with checkpointing |
| Backend framework | FastAPI | Native async, WebSocket support, direct LangGraph integration (both Python) |
| LLM provider | OpenAI API | Structured outputs, reliable function calling |
| Schema enforcement | Pydantic | Typed state models, structured output validation at every node |
| AST parsing | tree-sitter | Multi-language AST parsing for static analysis |
| Database | SQLite | Simple deployment, sufficient for single-user/portfolio use |
| Frontend | React | Interactive report viewer, dependency graph visualization |
| Graph visualization | D3.js or react-flow | Interactive module dependency graph |
| Observability | LangSmith | Full trace logging for every graph execution |
| Deployment | Vercel (frontend) + Railway/Render (backend) | Free-tier friendly |
The pipeline consists of 8 nodes with conditional edges, one self-loop (cycle), and one fan-out/fan-in pattern.
START
│
▼
[Node 1: Structure Scanner]
│
▼
[Node 2: Dependency Analyzer]
│
▼
[Node 3: Module Deep-Diver] ◄──── cycle (while pending modules exist)
│
▼ (all modules done)
[Node 4: Pattern Detector]
│
▼
[Node 5: AI-Readiness Scorer]
│
▼
[Node 6: Output Router]
│
├──────────────┬──────────────┐ (parallel fan-out)
▼ ▼ ▼
[7a: Doc Gen] [7b: Agent Gen] [7c: Readiness Report]
│ │ │
└──────────────┴──────────────┘ (fan-in)
│
▼
[Node 8: Final Assembler]
│
▼
END
Purpose: Build the initial structural skeleton of the repository. This is the foundation that all subsequent nodes build upon.
Input: Repository URL or local path.
Processing:
- Clone the repository (or access local path)
- Walk the directory tree (2 levels deep for the overview, full depth for analysis)
- Identify entry points:
main.py,index.ts,app.py,server.js, route directories, etc. - Identify configuration files:
package.json,pyproject.toml,tsconfig.json,Dockerfile,.env.example, CI configs, etc. - Identify build/test/lint scripts from package manifests and Makefiles
- Classify file types and count lines per directory
LLM Usage: None. This node is entirely deterministic — file system traversal and pattern matching.
Writes to State:
metadata.repo_urlmetadata.commit_hashmetadata.directory_treemetadata.entry_pointsmetadata.config_files
Purpose: Build a complete picture of the project's technology profile and internal dependency structure.
Processing:
- Parse package manifests (
package.json,requirements.txt,pyproject.toml,Cargo.toml,go.mod) for external dependencies with versions - Build an import graph: for every source file, extract import/require statements and map them to either external packages or internal modules
- Scan all source files for environment variable references (
process.env.*,os.environ.*,os.getenv(),.envfile patterns) - For each env var: record the variable name, which files reference it, what format it appears to expect (URL, API key, boolean, number), and whether a default value exists
LLM Usage: One call to synthesize the technology profile from the raw data — "Given these dependencies and configs, describe this project's tech stack in structured format."
Writes to State:
dependencies.tech_stack— structured TechProfile (framework, language version, key libraries, deployment target)dependencies.packages— list of Package objects (name, version, category)dependencies.import_graph— directed graph of file-to-file importsdependencies.env_vars— list of EnvVar objects (name, files_used_in, expected_format, has_default)
Purpose: The core analysis engine. Iteratively analyzes individual modules/files, building understanding incrementally. This is where LangGraph's cyclic execution is essential.
Processing (per cycle):
- Check
modules.pendinglist. If empty, exit the cycle (conditional edge routes to Node 4). - Pop the highest-priority file from
modules.pending. - Read the file's full content.
- Send to LLM with context:
- The file's content (full text)
- Compressed summaries of all previously analyzed modules (100-200 tokens each)
- The directory tree skeleton
- The import graph relevant to this file
- LLM returns a structured
ModuleSummary:- Purpose (one-line description of what this module does)
- Public interfaces (exported functions, classes, components with signatures)
- Internal dependencies (what other modules it imports from and why)
- External dependencies (what libraries it uses)
- Key patterns observed (error handling style, data access pattern, etc.)
- If the file is an API route handler, also extract endpoint details: method, path, middleware, request/response shapes, downstream calls.
- Compress the full file content into a summary and replace it in state (the raw content is no longer needed).
- Update
modules.analyzed(append) andmodules.pending(may add newly discovered important files). - Route back to step 1 (cycle continues).
Priority Ranking Logic (determines order of analysis):
- Files with the highest hub score in the import graph (most imported by others) go first
- Entry points are analyzed early
- Config/utility files are deprioritized unless heavily imported
- Test files are skipped during deep-dive (analyzed separately for coverage metrics)
Cycle Termination:
- All files in
modules.pendinghave been processed, OR - A configurable maximum number of files has been analyzed (default: 50), OR
- The user-selected depth setting caps analysis ("quick" = 15 files, "standard" = 30, "deep" = 50)
LLM Usage: One call per file per cycle. This is the most LLM-intensive node.
Writes to State:
modules.analyzed— appended each cycle with a new ModuleSummarymodules.pending— updated each cycle (items removed, possibly new items added)modules.module_connections— updated graph of how analyzed modules connectmodules.api_endpoints— appended when route handlers are foundmodules.feature_flows— appended when traceable feature paths are identified
Key LangGraph Pattern: Conditional edge after this node checks len(state.modules.pending) > 0. If true, routes back to Node 3 (cycle). If false, routes to Node 4.
Purpose: Analyze the accumulated module summaries to identify cross-cutting patterns, conventions, inconsistencies, and potential issues.
Processing:
- Feed all module summaries (compressed) to the LLM in one call
- Ask for identification of:
- Recurring conventions: error handling pattern, data access pattern, naming conventions, file organization conventions, testing conventions, import style
- Inconsistencies: places where the codebase deviates from its own conventions (potential bugs or tech debt)
- Dead code indicators: exported functions/components that nothing imports, route handlers not reachable from routing config
- Complexity hotspots: files or functions flagged during deep-dive as unusually complex, deeply nested, or excessively long
- "How to add a..." patterns: infer the standard workflow for common tasks (add a new API endpoint, add a new database model, add a new page/route) based on detected conventions
LLM Usage: One comprehensive call with all summaries. Optionally a second call for the "how to add" guides.
Writes to State:
patterns.conventions— list of Convention objects (name, description, example_files, pattern)patterns.inconsistencies— list of Issue objects (description, files_involved, severity, possible_explanation)patterns.dead_code— list of FilePath objects (file, export, reason_flagged)patterns.complexity_hotspots— list of Hotspot objects (file, function, line_count, nesting_depth, description)
Purpose: Evaluate the codebase across six dimensions that determine how well AI coding assistants will perform on it. Produce scores, a radar chart data structure, and a prioritized action plan.
Processing:
- For each of the six dimensions, compute a score (0-10) using a mix of deterministic measurement and LLM assessment:
See the AI-Readiness Scoring System section for full dimension details.
- Compute weighted overall score
- Generate prioritized recommendation list ordered by effort-to-impact ratio
LLM Usage: One call for the qualitative assessment dimensions (Consistency, Discoverability narrative). The quantitative dimensions (Type Safety, Modularity, Test Coverage, Dependency Hygiene) are computed deterministically from state data.
Writes to State:
scores.dimension_scores— dict mapping dimension name to float scorescores.overall_score— weighted average floatscores.recommendations— list of Action objects (description, impact, effort, score_improvement_estimate)scores.agent_file_config— configuration data needed by the Agent File Generator
Purpose: Decision node that determines which output generators to invoke based on user configuration.
Processing:
- Read user's original request configuration (which agent files were selected, what export formats were requested)
- Prepare Send() calls for each required output generator
- Fan out to 7a, 7b, and 7c in parallel
LLM Usage: None. Pure routing logic.
LangGraph Pattern: Uses LangGraph's Send API to invoke multiple nodes in parallel. Each Send includes the relevant slice of state plus the generator-specific configuration.
These three nodes execute in parallel and are independent of each other.
Purpose: Produce the 7-section onboarding document from accumulated state.
Processing:
- Transform state data into the seven report sections (see Output Specification)
- For each section, make one LLM call with the relevant state slice to generate human-readable prose
- Format output as structured JSON (for the interactive report) and Markdown (for export)
LLM Usage: 3-4 calls (some sections can be batched, others need dedicated calls for quality).
Output: Structured report data in both JSON and Markdown formats.
Purpose: Generate AI assistant context files for the user's selected tools.
Processing:
- Run the Common Knowledge Extractor: pull universal facts from state (stack, commands, conventions, patterns, structure)
- For each selected agent tool, run the tool-specific formatter (see Agent File Generation)
- Flag uncertain fields with
# VERIFYcomments
LLM Usage: One call per agent file (4 maximum).
Output: Generated file contents for each selected agent tool.
Purpose: Produce the visual readiness report from scoring data.
Processing:
- Transform dimension scores into radar chart data structure
- Format the action plan with specific file references and improvement estimates
- Generate the one-line verdict summary
LLM Usage: One call for the narrative verdict and action plan descriptions.
Output: Structured scoring report data.
Purpose: Collect outputs from all three generators and package them into the final deliverable.
Processing:
- Receive outputs from 7a, 7b, and 7c
- Combine into a unified response structure
- Generate export-ready files:
- Interactive report JSON (consumed by React frontend)
- Markdown document (downloadable, can be pasted into repo wiki)
- Structured JSON (for programmatic consumption)
- Individual agent files (CLAUDE.md, copilot-instructions.md, .clinerules, .aider.conf.yml)
- Store results in SQLite cache keyed by
repo_url + commit_hash
LLM Usage: None. Pure assembly and formatting.
The CodebaseState is the central data structure that flows through the entire graph. Every node reads from and writes to specific sections of this state. The state is append-only — nodes never overwrite previous sections.
CodebaseState:
│
├── metadata (written by: Structure Scanner)
│ ├── repo_url: str
│ ├── commit_hash: str
│ ├── directory_tree: Tree
│ ├── entry_points: list[FilePath]
│ └── config_files: list[ConfigFile]
│
├── dependencies (written by: Dependency Analyzer)
│ ├── tech_stack: TechProfile
│ ├── packages: list[Package]
│ ├── import_graph: Graph
│ └── env_vars: list[EnvVar]
│
├── modules (written by: Module Deep-Diver, appended each cycle)
│ ├── analyzed: list[ModuleSummary]
│ ├── pending: list[FilePath]
│ ├── module_connections: Graph
│ ├── api_endpoints: list[Endpoint]
│ └── feature_flows: list[FlowTrace]
│
├── patterns (written by: Pattern Detector)
│ ├── conventions: list[Convention]
│ ├── inconsistencies: list[Issue]
│ ├── dead_code: list[FilePath]
│ └── complexity_hotspots: list[Hotspot]
│
├── scores (written by: AI-Readiness Scorer)
│ ├── dimension_scores: dict[str, float]
│ ├── overall_score: float
│ ├── recommendations: list[Action]
│ └── agent_file_config: AgentConfig
│
└── outputs (written by: Output Generators + Final Assembler)
├── report_json: dict
├── report_markdown: str
├── agent_files: dict[str, str]
└── readiness_report: dict
ModuleSummary:
- file_path: str
- purpose: str (one-line description)
- public_interfaces: list[Interface] (name, signature, description)
- internal_deps: list[str] (file paths this module imports from)
- external_deps: list[str] (libraries used)
- patterns_observed: list[str]
- compressed_summary: str (100-200 token summary for downstream use)
Endpoint:
- method: str (GET, POST, PUT, DELETE, PATCH)
- path: str
- handler_file: str
- handler_function: str
- middleware: list[str]
- request_shape: dict (inferred from validation/types)
- response_shape: dict (inferred from return statements/types)
- downstream_calls: list[str] (services/functions called)
Convention:
- name: str
- description: str
- example_files: list[str] (2-3 files that demonstrate this pattern)
- pattern_type: str (error_handling, data_access, naming, file_org, testing, imports)
TechProfile:
- primary_language: str
- language_version: str
- framework: str
- framework_version: str
- key_libraries: list[str]
- deployment_target: str (inferred from configs)
- build_tool: str
- test_framework: str
- linter: str | None
- formatter: str | None
Action:
- description: str
- impact: str (high, medium, low)
- effort: str (high, medium, low)
- affected_dimension: str
- score_improvement_estimate: str (e.g., "Type Safety: 5 → 7.5")
- specific_files: list[str] (files to modify)
Large codebases won't fit in a single LLM context. This is the most important engineering challenge in the project.
Step 1: Static Pre-processing (Zero LLM tokens)
Most structural analysis doesn't need an LLM at all. These operations are purely deterministic:
- Parse AST using
tree-sitter(multi-language support) - Extract import/require/include statements via AST or regex
- Map the directory tree via
os.walk - Identify file types by extension and content
- Count lines per file/directory
- Parse package manifests for dependencies
- Extract environment variable references via regex
This eliminates the majority of context window pressure before any LLM call is made.
Step 2: Priority Ranking (1 LLM call)
Not all files need deep analysis. Rank files by:
- Hub score: files imported by the most other files (from the import graph). High-hub files are architectural anchors.
- Entry point proximity: files that are direct entry points or one import away from entry points.
- File size: very large files are likely important but need careful handling.
- File type relevance: source code > configuration > documentation > test files > generated files.
One LLM call to review the ranked list and adjust priorities based on file names and directory context ("this looks like the main business logic directory").
Top N files (based on user-selected depth) go into modules.pending.
Step 3: File-by-File Deep Dive (1 LLM call per file)
Each cycle of the Module Deep-Diver sends the LLM:
- Full content of the current file (the one being analyzed)
- Compressed summaries of all previously analyzed files (100-200 tokens each)
- Directory tree skeleton (truncated, just top 2 levels)
- Relevant import graph slice (what this file imports, what imports this file)
This means the LLM always has maximum context for the file under analysis, plus enough surrounding knowledge to understand connections.
Step 4: Summary Compression (after each cycle)
After analyzing a file, its full content is replaced in state with a compressed summary (100-200 tokens). This summary includes:
- One-line purpose
- Public interface signatures
- Key dependencies
- Patterns observed
This keeps the cumulative state within budget even for large repos. After analyzing 50 files, the summaries consume roughly 50 × 150 = 7,500 tokens — well within limits.
Step 5: Cross-Module Reasoning (uses summaries only)
The Pattern Detector and AI-Readiness Scorer operate on compressed summaries, never on raw file contents. This is sufficient for identifying patterns, inconsistencies, and computing scores — the fine-grained detail was already captured during the deep-dive phase.
| Operation | Tokens consumed |
|---|---|
| Directory tree skeleton | ~500 |
| Import graph for current file | ~200 |
| Current file content (average) | ~2,000 |
| 30 compressed summaries | ~4,500 |
| System prompt + instructions | ~1,000 |
| Total per deep-dive cycle | ~8,200 |
| Response (structured ModuleSummary) | ~500 |
With GPT-4o's 128K context window, this leaves massive headroom. Even at 50 analyzed files, the cumulative summaries only consume ~7,500 tokens.
The onboarding document is NOT a single monolithic file. It is a structured, multi-section interactive report — something between a wiki and a dashboard.
A single-screen summary — the "README that should have existed."
Contents:
- Project name and one-line description (inferred from package.json, README, or top-level comments)
- Tech stack with exact versions (not just "React" but "React 18.2 with Next.js 14 App Router")
- Repository structure as a visual tree — 2 levels deep, with annotations on what each top-level directory contains
- Entry points clearly marked (where does the app start? where do API requests enter? where's the main config?)
- External service dependencies (databases, APIs, auth providers, queues) inferred from env vars, configs, and imports
Format: Visual card layout — scannable, not scrollable.
Generated by analyzing the project, not guessing.
Contents:
- Complete list of required environment variables with:
- Variable name
- Where it's used (which files reference it)
- What it appears to expect (URL format? API key? boolean?)
- Whether a default exists
- Step-by-step setup commands, inferred from package managers, Dockerfiles, Makefiles, and scripts
- Not generic "run npm install" but the actual sequence specific to this project
- Both Docker and local development paths if both exist
- Common pitfalls and prerequisites (Node version requirements, database setup, seed scripts)
Format: Numbered step list with copy-paste commands. Each step has a "why" annotation.
Complete inventory of all route definitions.
Contents per endpoint:
- HTTP method and path
- Handler function name and file location
- Middleware/auth guards applied
- Request shape (inferred from validation schemas, TypeScript types, or parameter usage)
- Response shape (inferred from return statements and types)
- Database queries or external service calls it makes
Format: Interactive table — sortable by path, filterable by method, clickable to expand details. Essentially a locally-generated Swagger doc that also shows implementation paths.
Architectural visualization of how modules connect.
Contents:
- Which modules import from which other modules
- High-connectivity hubs identified (the files everything depends on)
- Circular dependencies flagged
- Distinction between internal module imports and external library usage
- Files grouped into logical clusters (even if directory structure doesn't reflect this cleanly)
Format: Interactive graph visualization (D3.js or react-flow). Nodes = modules/files, edges = import relationships. Click a node to see its summary, public exports, and dependents. Color-coded by directory or logical function.
"Where does X happen?" traces for each major feature.
Contents per feature flow:
- Starting point (user action or entry point)
- Each step in the flow: file, function, what it does
- External service calls at each step
- Database interactions at each step
Example: "User Checkout Flow — starts at /pages/checkout.tsx, calls useCart() hook from /hooks/useCart.ts, which reads from CartContext, on submit calls /api/checkout which invokes processOrder() in /lib/orders.ts, which calls Stripe via /lib/stripe.ts and writes to DB via Prisma model Order."
Format: Vertical flow diagram per feature — simplified sequence diagram. Each step shows file, function, and action. Clickable to jump to code context.
The "unwritten rules" section.
Contents:
- Detected coding patterns with code examples:
- Error handling pattern
- Data access pattern
- Component structure pattern
- Testing convention
- Import style convention
- Naming convention
- Inconsistencies flagged with file references:
- Where conventions are violated (potential bugs or intentional exceptions)
- "How to add a..." guides:
- Step-by-step instructions for common tasks, following the codebase's own conventions
- "How to add a new API endpoint"
- "How to add a new database model"
- "How to add a new page/route"
- Complexity hotspots:
- Files/functions with unusually high complexity, with summaries
- Dead code:
- Exported functions nothing imports
- Route handlers not reachable from routing config
Format: Structured list with code examples from the repo. Each pattern has "where to see it" references.
"What should I look at first?" guide.
Contents:
- Ordered reading list of 5-7 key files with brief annotations explaining why each is important and what to focus on
- Priority-ranked by architectural importance
- Areas of concern flagged (tech debt, disorganized directories, deprecated modules)
Format: Numbered reading list with file paths and one-line annotations.
A framework for measuring how well AI coding assistants will perform on a codebase. Scored across six dimensions, each independently, with a weighted overall score.
Weight: 15%
What it measures: Can the AI tool understand what the project is and how it's structured without guessing?
Scoring criteria:
- README exists and is substantive (not a template with placeholder text)
- AI context files exist (CLAUDE.md, copilot-instructions, .clinerules)
- Directory structure is self-documenting (clear names, logical grouping)
- Entry points are identifiable (obvious main files, clear routing structure)
Measurement: Check for presence and substantiveness of each artifact. Empty/template README = 0. README with setup, architecture, and contribution guidelines = high score.
Weight: 25%
What it measures: Can the AI infer types when generating code?
Scoring criteria:
- Percentage of functions with explicit return types
- Percentage of function parameters that are typed
- TypeScript strict mode enabled (or Python type hints, or equivalent)
- API boundaries have schema definitions (Zod, Pydantic, JSON Schema)
- Database models are explicitly defined (Prisma, SQLAlchemy, TypeORM)
Measurement: Deterministic — parse AST, count typed vs untyped functions, check tsconfig strict settings.
Weight: 25%
What it measures: Are patterns consistent enough for AI to learn from them?
Scoring criteria:
- Naming conventions are uniform (no camelCase/snake_case mixing)
- Error handling pattern is consistent across the codebase
- File organization follows a detectable convention
- Import styles are consistent (relative vs absolute, barrel exports vs direct)
- Linter/formatter config exists and is enforced
Measurement: Pattern analysis from Node 4's output — ratio of consistent to inconsistent pattern applications.
Weight: 15%
What it measures: Is the code modular with clear interfaces?
Scoring criteria:
- Number of circular dependency chains
- Modules have clear public interfaces (index files, barrel exports, explicit init.py)
- File size distribution (flag files over 500 lines)
- Function complexity distribution (flag functions over 50 lines or deeply nested)
Measurement: Deterministic — import graph analysis for circular deps, line counts from scanner, AST analysis for nesting depth.
Weight: 10%
What it measures: Can AI-generated code be verified?
Scoring criteria:
- Test suite exists and is runnable
- Approximate coverage (test file count relative to source files, or from coverage config)
- CI pipeline runs tests automatically
- Test commands are documented and discoverable
Measurement: File counting, CI config detection, test command extraction from scripts.
Weight: 10%
What it measures: Will AI tools suggest sensible imports?
Scoring criteria:
- Number of unused dependencies
- Duplicate libraries serving the same purpose (both axios and got, both moment and dayjs)
- Dependencies with known vulnerabilities (audit results)
- Dependency versions pinned vs loose ranges
Measurement: Deterministic — dependency analysis from Node 2, plus audit command execution.
The report has three layers:
Layer 1 — Dashboard view: Radar chart across six dimensions + overall score + one-line verdict.
Layer 2 — Detailed breakdown: Per-dimension score, what was measured, specific findings with file references, comparison to "fully optimized" baseline.
Layer 3 — Action plan: Prioritized improvements ordered by effort-to-impact ratio. Each item includes:
- Specific description ("Add return types to the 23 untyped functions in /lib/")
- Estimated effort ("~1 hour")
- Score improvement ("Type Safety: 5 → 7.5")
- Files to modify
Location: Repository root (CLAUDE.md)
Purpose: Automatically read by Claude Code when entering the project. Dense briefing document.
Contents:
- Project overview: what it is, what it does, who it's for
- Exact build/test/lint commands (extracted from package.json scripts, Makefile targets, CI configs — not guessed)
- Code style conventions actually followed (inferred from analysis): "use named exports, not default exports", "error handling uses Result types, not try-catch"
- Architecture summary: where things live and why
- Key constraints: "never import from /internal directly, always go through barrel exports in /lib"
- Common gotchas: circular dependency risks, similar-looking files that serve different purposes, deprecated modules
Style: Dense, precise, actionable. Bullet points and commands, not prose. Instructions to a capable developer who's never seen the repo.
Location: .github/copilot-instructions.md
Purpose: Workspace-level instruction file influencing code completions and chat in VS Code.
Contents:
- Coding conventions and style rules (influences completions)
- Preferred libraries: "use dayjs not moment, use zod for validation, use custom fetcher in /lib/api.ts instead of raw fetch"
- Patterns for common tasks: "new API routes should follow the pattern in /api/users/index.ts"
- Things to avoid: "don't suggest any types, don't use var, don't import from node_modules paths directly"
Style: Short, directive. Rules not explanations. Optimized for influencing autocomplete behavior.
Location: Repository root (.clinerules)
Purpose: Project-specific rules guiding Cline's autonomous coding behavior.
Contents:
- Project structure overview (where to create new files)
- Testing requirements: "every new function in /lib must have a corresponding test"
- Deployment constraints: "deploys to Vercel, no server-side file system access"
- Approval gates: "changes to /lib/auth or /lib/billing should be flagged for review"
- Architecture boundaries: "/client should never import from /server directly"
Style: Constraint-oriented. What not to do, what requires caution. Guardrails for autonomous operation.
Location: Repository root (.aider.conf.yml + optional convention files)
Contents:
- Repo map configuration (which files to always include in context, which to exclude)
- Convention notes guiding code generation
- Test command (so Aider can verify its changes)
- Lint command (same reason)
- File patterns to ignore (build artifacts, generated files, vendor directories)
Style: YAML structured. Explicit file include/exclude patterns. Configuration-oriented.
The generation flow is:
Accumulated CodebaseState
│
▼
Common Knowledge Extractor
(stack, commands, conventions, patterns, structure)
│
├──> Claude Formatter ──> CLAUDE.md
├──> Copilot Formatter ──> copilot-instructions.md
├──> Cline Formatter ──> .clinerules
└──> Aider Formatter ──> .aider.conf.yml
│
▼
Human Review & Edit (preview in UI, tweak, then export)
│
▼
Export (download files or auto-commit via GitHub API)
Design principle: One extraction, four formatters. When a tool changes its instruction format, only one formatter is updated — nothing else in the pipeline changes.
Uncertain fields: Any extracted data the system isn't confident about is marked with # VERIFY: [reason for uncertainty] comments in the generated files.
User selection: Users choose which agent tools they use via multi-select in the UI. Only selected formatters run. Don't generate files for tools the user doesn't use.
- Text input for GitHub URL
- Multi-select checkboxes for agent files to generate (Claude, Copilot, Cline, Aider)
- Analysis depth selector: Quick (15 files), Standard (30 files), Deep (50 files)
- Submit button triggers
POST /api/analyze
- WebSocket connection receives node-by-node updates
- Visual progress indicator showing:
- Which node is currently executing
- What it's currently analyzing ("Deep-diving into /lib/auth.ts...")
- Completed nodes with brief result summaries
- Estimated time remaining
- Tabbed interface with 7 tabs corresponding to the 7 output sections
- Section 4 (Dependency Graph) renders as an interactive D3.js/react-flow visualization
- Section 3 (API Map) renders as a sortable/filterable table
- All sections are searchable
- Export button: download as Markdown or JSON
- Side-by-side editor showing generated content for each selected agent file
- Syntax-highlighted Markdown/YAML editing
# VERIFYcomments highlighted in yellow for easy identification- Save/export button per file
- "Copy to clipboard" functionality
- Radar chart visualization (6 axes, one per dimension)
- Overall score prominently displayed
- One-line verdict
- Expandable detail cards per dimension
- Action plan as a prioritized list with effort/impact tags
- Chat interface for follow-up questions about the analyzed codebase
- Uses the cached
CodebaseState— no re-analysis needed - Questions like "How does authentication work?" or "What happens when a user creates a project?" are answered by referencing accumulated knowledge in state
- Responses grounded in actual codebase analysis, not generic knowledge
After generating the initial report, the user can ask follow-up questions about the codebase. The LangGraph state from the analysis is persisted (checkpointed), so the Q&A agent has full context without re-analyzing. This is where LangGraph's state persistence adds clear value.
If the user runs the agent again after the codebase has changed:
- Compare current commit hash to cached analysis
- Only re-analyze modified files (git diff)
- Update affected sections of the report
- Recalculate AI-readiness scores only for changed dimensions
- Previous run's state serves as baseline (LangGraph checkpointing)
Based on detected patterns, generate step-by-step guides for common tasks:
- "How to add a new API endpoint" — following the codebase's own conventions
- "How to add a new database model"
- "How to add a new page/route"
- "How to add a new test"
Each guide follows the patterns the agent detected in the existing codebase, not generic advice.
Flag files or functions with unusually high cyclomatic complexity, deep nesting, or excessive length. Framed as orientation, not linting: "This file is the most complex in the codebase — here's a summary of what it does so you don't have to parse 800 lines yourself."
Identify:
- Exported functions or components that nothing imports
- Route handlers not reachable from the app's routing config
- Useful for onboarding: "ignore these, they're probably deprecated"
LangGraph is Python-native. The graph definition, state schema (Pydantic models), and node functions are all Python. FastAPI provides native async support for WebSocket streaming and direct integration with LangGraph's astream_events. No cross-language bridging needed.
This is a single-user tool for portfolio purposes. SQLite stores analysis results keyed by repo_url + commit_hash. Re-running on the same repo at the same commit returns cached results instantly. Swapping to Postgres later is a config change, not an architectural one. SQLite keeps deployment simple — one process, one file.
The temptation is to send everything to the LLM. But AST parsing, import extraction, line counting, file type detection, and directory traversal are deterministic operations that Python libraries handle perfectly (tree-sitter, ast module, regex, os.walk). By doing this before any LLM call, the majority of context window pressure is eliminated and the pipeline is significantly cheaper to run.
Each deep-dive cycle sends the LLM one file's content plus compressed summaries of everything analyzed so far. This maximizes context for the current file while maintaining enough surrounding knowledge for cross-module reasoning. Batching multiple files would reduce per-file context and make structured output harder to enforce.
The three generators (docs, agent files, readiness report) are independent — they read from the same completed state but don't depend on each other. LangGraph's Send API runs them simultaneously, cutting final generation latency to the slowest generator instead of the sum of all three.
When Copilot changes its instruction format (and it will — these formats are all evolving), only one formatter needs updating. The Common Knowledge Extractor and all other formatters remain unchanged. This separation of concerns is the standard adapter pattern.
After analyzing a file, its full content (~2,000 tokens average) is replaced with a compressed summary (~150 tokens). For a 50-file analysis, this is the difference between 100,000 tokens of raw content and 7,500 tokens of summaries in state. The fine-grained detail was already captured in the structured ModuleSummary during the deep-dive; downstream nodes only need the compressed version for cross-module reasoning.
- Structure Scanner node (deterministic, no LLM)
- Dependency Analyzer node (one LLM call)
- Static pre-processing layer (tree-sitter AST parsing, import mapping)
- State schema (Pydantic models)
- Basic CLI interface for testing
- Milestone: Feed a repo, get structural skeleton + tech profile
- Module Deep-Diver node with cyclic execution
- Summary compression strategy
- Priority ranking logic
- Cycle termination conditions
- Conditional edge implementation
- Milestone: Feed a repo, get module-by-module analysis with summaries
- Test on: 3-4 real repos of varying sizes (small, medium, large)
- Pattern Detector node
- AI-Readiness Scorer node (6 dimensions)
- Deterministic metric computation
- Scoring calibration against known repos
- Milestone: Full analysis pipeline produces patterns, scores, and recommendations
- Documentation Generator (7 sections)
- Agent File Generator (4 formatters)
- AI-Readiness Report Generator
- Fan-out/fan-in pattern implementation
- Final Assembler with export formats (Markdown, JSON, agent files)
- Milestone: Complete pipeline produces all three output categories
- React app setup
- Submit screen with configuration options
- WebSocket streaming for live progress
- Tabbed report viewer
- Interactive dependency graph visualization (D3.js/react-flow)
- Agent file preview and edit interface
- AI-readiness dashboard with radar chart
- Q&A chat interface
- Milestone: Full end-to-end user experience
- Run against 10-15 diverse repos
- Document accuracy, edge cases, and failure modes
- Add diff-aware update support
- LangSmith integration for observability
- Write comprehensive README (technical blog post style)
- Deploy to free tier (Vercel + Railway/Render)
- Record demo video
- Milestone: Portfolio-ready project with evaluation metrics
The project's resume value depends on demonstrable quality. Evaluation should cover:
- Run against 10-15 repos of varying stacks and sizes
- Manually verify:
- Are extracted endpoints correct? (precision/recall)
- Are detected patterns real? (spot-check against manual review)
- Are setup instructions accurate? (actually follow them and see if the project runs)
- Are environment variables complete? (compare against actual .env.example)
- Score 5 repos manually using the 6-dimension rubric
- Compare manual scores to automated scores
- Report correlation and deviation
- Generate CLAUDE.md for 5 repos
- Use Claude Code with the generated CLAUDE.md on actual tasks
- Compare quality of Claude Code's output with vs without the generated file
- Analysis time by repo size (files, lines of code)
- Token usage per analysis
- Cost per analysis (OpenAI API spend)
- Problem statement and value proposition
- Architecture diagram
- Design decisions with tradeoffs (why LangGraph, why file-by-file, why static pre-processing)
- Evaluation methodology and concrete metrics
- LangSmith trace screenshot showing a full analysis run
- Demo video or GIF
- Known limitations and future improvements
This specification represents the complete architectural plan for the Codebase Onboarding Agent. Each section maps directly to implementation work. The build order provides an incremental path where each sprint produces a testable milestone.