Codebase Onboarding Agent

An intelligent, LangGraph-powered pipeline that analyzes any GitHub repository and generates comprehensive onboarding documentation, AI-assistant context files, and an AI-readiness score — so new developers can ramp up in hours, not weeks.

Problem Statement

Joining a new codebase is one of the most time-consuming parts of software development. Developers spend days reading scattered documentation, tracing code paths, and asking colleagues for tribal knowledge. AI coding assistants (Claude, Copilot, Cline, Aider) need manually curated context files to be effective on a new repo.

Codebase Onboarding Agent automates all of this. Point it at a GitHub repo and it produces everything a new developer — or an AI assistant — needs to be productive from day one.

What It Does

Given a GitHub repository URL, the agent:

Clones and scans the repository structure, entry points, and configuration
Analyzes dependencies — builds an import graph, extracts the tech stack, identifies env vars and external APIs
Deep-dives into modules — extracts interfaces, patterns, and relationships file-by-file using AST parsing and LLM analysis
Detects patterns — identifies coding conventions, inconsistencies, dead code, and complexity hotspots
Scores AI-readiness — evaluates the codebase across 6 dimensions (type safety, consistency, modularity, discoverability, test coverage, dependency hygiene)
Generates three outputs:
- A 12-section onboarding report (markdown) covering everything from quick start to suggested first tasks
- AI context files — CLAUDE.md, copilot-instructions.md, .clinerules, .aider.conf.yml — ready to drop into the repo
- An AI-readiness report with radar chart data, a verdict, and a prioritized action plan

Demo

codebase-onboarding-agent-demo.mp4

Architecture

Pipeline Overview

The system is built around a stateful, multi-node LangGraph pipeline with cyclic processing, conditional routing, and fan-out/fan-in output generation. A FastAPI backend exposes the pipeline over REST and WebSocket, and a React frontend provides an interactive report viewer with live progress streaming.

┌─────────────────────────────────────────────────────────────────────────┐
│                          FRONTEND (React)                               │
│   SubmitPage  ──►  ProgressPage (WebSocket)  ──►  ResultsPage           │
│                                                   ├─ ReportTab          │
│                                                   ├─ DashboardTab       │
│                                                   ├─ AgentFilesTab      │
│                                                   └─ QATab              │
└──────────────────────────────┬──────────────────────────────────────────┘
                               │ REST + WebSocket
┌──────────────────────────────▼──────────────────────────────────────────┐
│                        BACKEND (FastAPI)                                │
│   /api/analyze (POST)  ──►  Job Queue  ──►  LangGraph Pipeline          │
│   /api/progress/{id} (WS)   In-Memory       astream_events              │
│   /api/qa (POST)            Job Manager      Pipeline → state           │
└──────────────────────────────┬──────────────────────────────────────────┘
                               │
        ┌──────────────────────┼───────────────────────┐
        ▼                      ▼                       ▼
   GitHub API            Gemini API              SQLite Cache
   (clone repos)         (via LiteLLM)           (result caching)

The 10-Node LangGraph Pipeline

N1: Structure Scanner ─► N2: Dependency Analyzer ─► N3: Module Deep-Diver ─┐
         (no LLM)              (1 LLM call)           (1 call/file)        │
                                                                     ┌─────┘
                                                                     │ loops while
                                                                     │ pending modules
                                                                     └─────┐
                                                                           ▼
N5: AI-Readiness Scorer ◄─ N4: Pattern Detector ◄─────────────────── (exit cycle)
      (1 LLM call)              (1-2 LLM calls)
         │
         ▼
N6: Output Router ──┬──► N7a: Doc Generator ──────────┐
     (no LLM)       │         (2 LLM calls)           │
                    │                                  │
                    ├──► N7b: Agent File Generator ────┤  sequential
                    │         (multiple LLM calls)     │  fan-out
                    │                                  │
                    └──► N7c: Readiness Report ────────┘
                              (structured output)      │
                                                       ▼
                                              N8: Final Assembler
                                                   (no LLM)
                                                      │
                                                      ▼
                                                    END

Node	Purpose	LLM Calls	Key Output
N1: Structure Scanner	Clones repo, walks filesystem, finds entry points & config files	0	`MetadataState`
N2: Dependency Analyzer	Parses manifests, builds import graph, extracts env vars	1	`DependencyState`, `TechProfile`
N3: Module Deep-Diver	Analyzes each source file — interfaces, dependencies, patterns. Cycles while modules remain in the pending queue	1 per file	`ModuleState`
N4: Pattern Detector	Identifies conventions, inconsistencies, dead code, complexity hotspots	1-2	`PatternState`
N5: AI-Readiness Scorer	Scores across 6 weighted dimensions (0-10 scale)	1 + metrics	`ScoreState`
N6: Output Router	Routes to fan-out generators	0	Control flow
N7a: Doc Generator	Produces a 12-section markdown onboarding report	2	`report_markdown`
N7b: Agent File Generator	Generates `CLAUDE.md`, `copilot-instructions.md`, `.clinerules`, `.aider.conf.yml`	Multiple	`agent_files`
N7c: Readiness Report	Creates radar chart data, verdict, and prioritized action plan	Structured	`readiness_report`
N8: Final Assembler	Validates completeness, caches result in SQLite	0	Final state

State Management

The pipeline uses a central CodebaseState (Pydantic model) with an append-only design — nodes never overwrite data written by previous nodes. Key sections:

MetadataState — repo URL, commit hash, directory tree, entry points
DependencyState — tech stack, packages, import graph, env vars, external APIs
ModuleState — analyzed modules, pending queue, module connections, API endpoints
PatternState — conventions, inconsistencies, dead code, complexity hotspots
ScoreState — dimension scores, overall score, recommendations
OutputState — report markdown, agent files, readiness report (field-level merging across generators)

System Communication Flow

Frontend ──REST──► FastAPI ──triggers──► LangGraph Pipeline
Frontend ◄──WS─── FastAPI ◄──astream──  LangGraph Pipeline
                                            │
                              ┌──────────────┼──────────────┐
                              ▼              ▼              ▼
                         GitHub API     Gemini API     LangSmith
                         (clone)       (via LiteLLM)   (tracing)

REST for analysis submission and result retrieval
WebSocket for real-time, node-by-node progress streaming
SQLite for caching results keyed by (repo_url, commit_hash, depth)

Tech Stack

Backend

Technology	Version	Purpose
Python	3.12+	Runtime
LangGraph	0.4.1+	Pipeline orchestration — stateful graph with cycles, conditional edges, fan-out
FastAPI	0.115+	Async REST API + WebSocket server
Pydantic	2.11+	State schema, validation, settings management
LiteLLM	1.60+	Model-agnostic LLM wrapper (currently Gemini, swappable to any provider)
langchain-google-genai	2.1+	Gemini API integration
tree-sitter	0.24+	Multi-language AST parsing (Python, JavaScript, TypeScript)
GitPython	3.1+	Repository cloning and Git operations
aiosqlite	0.21+	Async SQLite for analysis caching
httpx	0.28+	Async HTTP client
uvicorn	0.34+	ASGI server
structlog	25.1+	Structured JSON logging
LangSmith	0.3+	Pipeline observability and tracing (optional)

Frontend

Technology	Version	Purpose
React	19	UI framework
TypeScript	5.9	Type-safe frontend development
Vite	5.4	Build tool and dev server
React Router	7.13	Client-side routing
Recharts	3.8	AI-readiness radar chart visualization
react-syntax-highlighter	16.1	Code block rendering in reports

Dev Tooling

Tool	Purpose
uv	Python package management (replaces pip/pipenv)
Ruff	Linting + formatting (replaces black, flake8, isort)
mypy	Static type checking with Pydantic plugin (strict mode)
pytest	Testing framework with async, coverage, and mock support
pre-commit	Git hooks for automated quality checks
ESLint	Frontend linting

Project Structure

codebase-onboarding-agent/
├── src/onboarding_agent/
│   ├── cli.py                          # CLI entry point
│   ├── config/
│   │   └── settings.py                 # Pydantic Settings — single env var entry point
│   ├── models/
│   │   ├── state.py                    # CodebaseState — central LangGraph state schema
│   │   └── types.py                    # Pydantic domain models
│   ├── services/
│   │   ├── llm.py                      # LiteLLM wrapper (model-agnostic, rate-limited)
│   │   ├── github.py                   # Repo cloning + URL validation
│   │   └── cache.py                    # SQLite analysis cache
│   ├── parsers/
│   │   ├── base.py                     # Abstract parser interface
│   │   ├── python_parser.py            # Python AST parser
│   │   └── typescript_parser.py        # JS/TS parser
│   ├── pipeline/
│   │   ├── graph.py                    # LangGraph graph definition
│   │   └── nodes/                      # One file per pipeline node
│   │       ├── structure_scanner.py    # N1
│   │       ├── dependency_analyzer.py  # N2
│   │       ├── module_deep_diver.py    # N3 (cyclic)
│   │       ├── pattern_detector.py     # N4
│   │       ├── ai_readiness_scorer.py  # N5
│   │       ├── output_router.py        # N6
│   │       ├── doc_generator.py        # N7a
│   │       ├── agent_file_generator.py # N7b
│   │       ├── readiness_report.py     # N7c
│   │       └── final_assembler.py      # N8
│   ├── api/
│   │   ├── app.py                      # FastAPI app with CORS, lifespan, middleware
│   │   ├── job_manager.py              # In-memory job queue with WebSocket subscriptions
│   │   └── routes/
│   │       ├── health.py               # Health check endpoint
│   │       ├── analyze.py              # Analysis submission + job status
│   │       └── qa.py                   # Follow-up Q&A endpoint
│   └── utils/
│       └── logging.py                  # structlog configuration
│
├── frontend/                           # React + TypeScript frontend
│   └── src/
│       ├── pages/                      # SubmitPage, ProgressPage, ResultsPage
│       ├── components/                 # ReportTab, DashboardTab, AgentFilesTab, QATab
│       ├── api/client.ts               # Backend HTTP + WebSocket client
│       └── types/                      # TypeScript type definitions
│
├── tests/
│   ├── conftest.py                     # Shared pytest fixtures
│   ├── unit/                           # Unit tests (no external deps)
│   ├── integration/                    # Integration tests (pipeline)
│   └── fixtures/sample_repos/          # Sample repos for parser tests
│
├── pyproject.toml                      # Single source of truth for all config
├── Makefile                            # Task runner
├── .env.example                        # Environment variable template
├── .pre-commit-config.yaml             # Git hooks (ruff, mypy, etc.)
└── .python-version                     # Python 3.12

Getting Started

Prerequisites

Python 3.12+
Node.js 18+ (for the frontend)
uv — Python package manager
A Gemini API key — get one at Google AI Studio
Git — for repository cloning

Installation

# Clone the repository
git clone https://github.com/your-username/codebase-onboarding-agent.git
cd codebase-onboarding-agent

# Install Python dependencies
make dev    # installs all deps + dev tools + pre-commit hooks

# Install frontend dependencies
cd frontend && npm install && cd ..

Environment Variables

Copy the example file and fill in your keys:

cp .env.example .env

Variable	Required	Default	Description
`GEMINI_API_KEY`	Yes	—	Gemini API key for LLM analysis
`GITHUB_TOKEN`	No	—	GitHub personal access token (required for private repos)
`LANGSMITH_API_KEY`	No	—	LangSmith API key for pipeline tracing
`LANGSMITH_PROJECT`	No	`codebase-onboarding-agent`	LangSmith project name
`LANGSMITH_TRACING`	No	`false`	Enable/disable LangSmith tracing
`LOG_LEVEL`	No	`INFO`	Logging level
`ANALYSIS_MAX_FILES`	No	`75`	Maximum files to analyze
`CLONE_DIR`	No	`/tmp/onboarding-agent-repos`	Temp directory for cloned repos

Running the Application

Option 1: API Server + Frontend (full web experience)

# Terminal 1 — Backend
make serve
# Runs on http://localhost:8000
# API docs at http://localhost:8000/docs

# Terminal 2 — Frontend
cd frontend && npm run dev
# Runs on http://localhost:5173

Option 2: CLI (quick one-off analysis)

onboarding-agent https://github.com/user/repo --depth standard --output-dir ./output

Usage

CLI

# Basic usage
onboarding-agent <repo_url>

# With options
onboarding-agent https://github.com/user/repo \
  --depth standard \              # quick (15 files) | standard (30) | deep (75)
  --output-dir ./results \        # where to write output files
  --agents claude copilot cline aider   # which AI context files to generate

Analysis Depths:

Depth	Max Files	Best For
`quick`	15	Fast overview, small repos
`standard`	30	Most repos (default)
`deep`	75	Large repos, thorough analysis

Web Interface

Submit — Enter a GitHub repo URL, choose analysis depth and agent files
Progress — Watch the pipeline execute node-by-node with live WebSocket updates
Results — Browse the interactive report with four tabs:
- Report — Full 12-section onboarding guide with syntax-highlighted code
- Dashboard — AI-readiness radar chart, overall score, and action items
- Agent Files — Preview and download generated context files
- Q&A — Ask follow-up questions about the analyzed codebase

API Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`POST`	`/api/analyze`	Submit a repo for analysis (returns `job_id`)
`GET`	`/api/job/{job_id}`	Poll job status and results
`WebSocket`	`/api/progress/{job_id}`	Stream real-time progress updates
`POST`	`/api/qa`	Ask follow-up questions about an analysis

Auto-generated API documentation is available at /docs (Swagger) and /redoc.

Output

The pipeline produces three artifacts:

1. Onboarding Report (`onboarding_report.md`)

A 12-section markdown document:

Project Identity Card — name, language, framework, key libraries
Quick Start — time to running, commands
Directory Structure — tree with purpose annotations
Configuration & Secrets — env vars, config files, examples
Architecture Overview — module dependency graph, entry points
Feature Flow Maps — critical user journeys traced through code
External Services / APIs — consumed services, auth, rate limits
Testing — framework, commands, coverage info
Development Workflow — CI/CD, linting, formatting, commit hooks
Patterns & Conventions — naming, error handling, data access patterns
Known Issues & Gotchas — pitfalls, debugging tips
Suggested First Tasks — orientation paths for new developers

2. AI Context Files

Ready-to-use configuration files for AI coding assistants:

File	AI Tool
`CLAUDE.md`	Claude Code
`copilot-instructions.md`	GitHub Copilot
`.clinerules`	Cline
`.aider.conf.yml`	Aider

3. AI-Readiness Report

A quantitative assessment across 6 weighted dimensions:

Dimension	Weight	What It Measures
Type Safety	25%	Type annotations, strict configs
Consistency	25%	Naming conventions, code patterns
Discoverability	15%	Documentation, file organization
Modularity	15%	Separation of concerns, coupling
Test Coverage	10%	Test presence, framework setup
Dependency Hygiene	10%	Pinned versions, minimal unused deps

Includes a radar chart visualization, an overall verdict, and a prioritized action plan.

Development

Code Quality

All tooling is configured in pyproject.toml — no separate config files:

make lint          # Ruff linting
make format        # Ruff formatting
make typecheck     # mypy with Pydantic plugin (strict mode)
make check         # lint + format-check + typecheck

Pre-commit hooks run automatically on every commit:

Ruff check (with auto-fix) + format
mypy type checking
Trailing whitespace, YAML/TOML validation, merge conflict detection

Testing

make test              # Full test suite with coverage
make test-unit         # Unit tests only (-m unit)
make test-integration  # Integration tests only (-m integration)
make test-fast         # Quick run without coverage

Tests use pytest-asyncio (auto mode), pytest-mock, and pytest-cov with branch coverage. Sample repos in tests/fixtures/ provide deterministic parser test inputs.

Makefile Commands

Command	Description
`make install`	Install production dependencies
`make dev`	Install all deps + dev tools + pre-commit hooks
`make lint`	Run Ruff linter
`make lint-fix`	Run Ruff linter with auto-fix
`make format`	Format code with Ruff
`make typecheck`	Run mypy (strict mode)
`make test`	Run pytest with coverage
`make test-fast`	Run pytest without coverage
`make all`	lint + format-check + typecheck + test
`make serve`	Start FastAPI dev server (port 8000)
`make clean`	Remove build artifacts and caches

Design Decisions

Decision	Rationale
LangGraph over raw LangChain	Needed cyclic processing (module deep-diver loop), conditional edges, and stateful fan-out — LangGraph's graph primitives model this naturally
Append-only state	Nodes write to their own sections of `CodebaseState` and never overwrite previous data, preventing data loss across iterations
LiteLLM as LLM wrapper	Keeps the system model-agnostic — swap Gemini for GPT-4, Claude, or any provider without changing pipeline code
tree-sitter for parsing	Language-agnostic AST parsing that scales to new languages via grammar packages, avoiding fragile regex-based approaches
Pydantic everywhere	Type-safe state schema, validated LLM outputs, settings management — one modeling library for the entire stack
SQLite caching	Zero-infrastructure caching keyed by `(repo_url, commit_hash, depth)` — avoids redundant analysis of unchanged repos
WebSocket progress streaming	Users see node-by-node pipeline progress in real time instead of waiting for a spinner
Single `pyproject.toml`	All tool configuration (Ruff, mypy, pytest, coverage) in one file — no config sprawl

Roadmap

Additional language support (Java, Go, Rust) via new tree-sitter grammars
Docker containerization for one-command deployment
Parallel fan-out execution for output generators (N7a/b/c)
Incremental re-analysis (only analyze changed files)
GitHub App integration for automatic analysis on PR events
Export reports to PDF / Notion

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/your-feature)
Set up the dev environment (make dev)
Make your changes and ensure all checks pass (make all)
Commit with a descriptive message
Open a pull request

Please ensure your code passes all linting, formatting, type checking, and tests before submitting.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Built by abixdev with LangGraph, FastAPI, and React.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
frontend		frontend
media		media
src/onboarding_agent		src/onboarding_agent
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
project_spec.md		project_spec.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Codebase Onboarding Agent

Table of Contents

Problem Statement

What It Does

Demo

Architecture

Pipeline Overview

The 10-Node LangGraph Pipeline

State Management

System Communication Flow

Tech Stack

Backend

Frontend

Dev Tooling

Project Structure

Getting Started

Prerequisites

Installation

Environment Variables

Running the Application

Usage

CLI

Web Interface

API Endpoints

Output

1. Onboarding Report (onboarding_report.md)

2. AI Context Files

3. AI-Readiness Report

Development

Code Quality

Testing

Makefile Commands

Design Decisions

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Onboarding Report (`onboarding_report.md`)

Packages