Skip to content
View nripankadas07's full-sized avatar

Block or report nripankadas07

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
nripankadas07/README.md

Nripanka Das

I build local-first evaluation infrastructure for coding agents.

Most AI coding demos answer a soft question: can the agent produce something that looks plausible? My work asks the harder engineering question:

Can an agent solve a real task mined from Git history, under hidden tests, with a trace we can inspect and a run ledger we can verify later?

That is the center of this GitHub profile: small, inspectable systems for agentic AI evaluation, RAG stress testing, reproducibility, context engineering, spec-driven workflow design, and correctness-focused developer tools.

Start Here

If you want to see... Open this first What to look for
A real coding-agent benchmark PatchGym Git-history task mining, hidden tests, oracle patches, reproducible runs.
A visual project demo SpecForge live / source Trend evidence, spec workflow graph, guardrails, and blueprint export.
Verifiable run evidence ProofDeck Static evidence bundles, audit scorecards, attestations, and Merkle roots.
Agent trace debugging TraceWeave Loop detection, causal edges, context drift, and failure-risk reports.

Best Projects To Star

If one of these solves a real problem for you, starring that repository helps other developers find the work. Each one is built to stand alone, with local setup, tests, CI, docs, and a clear inspection path.

Repository Star It If You Care About... Best First Action
PatchGym Local coding-agent benchmarks from real Git history Run the demo task and inspect the generated manifest.
SpecForge Turning high-signal project research into spec-driven build workflows Open the live demo, select inspirations, and export a blueprint.
ProofDeck Static, reviewable evidence bundles for agent runs Build the demo deck and verify the bundle.
TraceWeave Debugging agent traces, loops, tool churn, and context drift Run it on a PatchGym trace and read the risk report.
SandboxLedger Tamper-evident local run ledgers Ingest a PatchGym run and verify the previous-hash chain.
RAGNeedle Deterministic RAG retrieval stress tests Generate a needle corpus and compare citation quality.

The Flagship Stack

flowchart LR
  A["Git history"] --> B["PatchGym<br/>mine real coding-agent tasks"]
  B --> C["Hidden tests<br/>oracle patches<br/>validation command"]
  C --> D["Agent run"]
  D --> E["manifest.json<br/>trace.jsonl<br/>report.json"]
  E --> F["TraceWeave<br/>failure forensics"]
  E --> G["SandboxLedger<br/>tamper-evident ledger"]
  F --> L["ProofDeck<br/>static evidence deck"]
  G --> L
  H["Context Crucible"] --> D
  I["SpecMutate"] --> C
  J["RAGNeedle"] --> K["retrieval stress tests"]
  M["SpecForge<br/>trend evidence to spec workflow"] --> N["profile-grade project blueprint"]
  M --> H
Loading
System Role Why It Is Worth Reading
PatchGym Local SWE-bench-style task miner and runner Mines real Git history into hidden-test coding-agent tasks with auditable oracle patches.
TraceWeave Agent trajectory forensics Reads local traces and finds loops, tool churn, context drift, causal handoffs, and risk signals.
SandboxLedger Reproducibility ledger Hashes PatchGym run artifacts into an append-only ledger with previous-hash chaining and a Merkle root.
ProofDeck Static evidence deck Packages PatchGym, TraceWeave, and SandboxLedger artifacts into a verifiable HTML, JSON, and attestation bundle.
SpecForge Spec-driven workflow studio Ranks high-star GitHub/project signals, simulates guarded build workflows, and exports README-ready project blueprints.
Context Crucible Coding-agent context packer Scores repository files, budgets context, and guards against hidden-test or oracle leakage.
RAGNeedle Adversarial RAG benchmark generator Creates deterministic needle-in-corpus retrieval tasks with distractor pressure and citation metrics.
SpecMutate Metamorphic test generator Turns behavior specs into deterministic test vectors for parsers, CLIs, normalizers, and small tools.

One Run, Four Proof Layers

git clone https://github.com/nripankadas07/patchgym
cd patchgym
python -m pip install -e ".[dev]"
python -m pip install git+https://github.com/nripankadas07/traceweave
python -m pip install git+https://github.com/nripankadas07/sandboxledger
python -m pip install git+https://github.com/nripankadas07/proofdeck
patchgym demo --keep-dir /tmp/patchgym-proof
traceweave patchgym /tmp/patchgym-proof/runs/oracle --json
sandboxledger ingest-patchgym /tmp/patchgym-proof-ledger.jsonl /tmp/patchgym-proof/runs/oracle
sandboxledger verify /tmp/patchgym-proof-ledger.jsonl
proofdeck build /tmp/patchgym-proof/runs/oracle --ledger /tmp/patchgym-proof-ledger.jsonl --out /tmp/proofdeck-site
proofdeck verify /tmp/proofdeck-site/bundle.json

That flow produces:

  • a real mined coding-agent task;
  • hidden-test validation;
  • manifest.json with commit ids, patch hashes, artifact hashes, return codes, changed files, and totals;
  • trace.jsonl for forensic analysis;
  • a verifiable SandboxLedger record for the run;
  • a static ProofDeck site with a canonical bundle, audit scorecard, attestation file, and artifact Merkle root.

This is the profile thesis in executable form: agent evaluation should leave evidence, not just screenshots.

Deep Inspection Path

Time Read / Run
2 minutes PatchGym README and bash scripts/demo.sh
3 minutes SpecForge live demo and source
5 minutes PatchGym reproducible runs
7 minutes TraceWeave PatchGym traces
10 minutes SandboxLedger PatchGym ingestion
12 minutes ProofDeck and proofdeck demo --out /tmp/proofdeck-demo
15 minutes Visible Agent Evaluation

Why This Portfolio Exists

I use AI heavily, but I do not want AI-assisted software to be judged by vibes. The systems here are built around harder boundaries:

  • hidden tests instead of self-reported success;
  • traces instead of opaque agent transcripts;
  • manifests instead of loose claims;
  • hash ledgers instead of mutable screenshots;
  • local-first demos instead of hosted black boxes;
  • small parsers and utilities with adversarial tests instead of broad, untestable abstractions.

The result is a portfolio with one technical identity:

local-first infrastructure for evaluating, debugging, and hardening coding agents.

Supporting Systems

These repositories support the flagship stack without competing with it.

Area Projects
Agent and eval infrastructure agent-framework, rag-pipeline, prompt-eval, token-counter, ai-toolkit
Correctness substrate safejson, tomlmini, bencode, csvinfer, urlnorm, jsonptr, jsonpatch-lite
TypeScript systems primitives decimal-ts, lru-ts, task-queue, tokenring-ts, eventbus-ts, decoder-ts
Local-first product labs SpecForge, lanbeam, rssdeck, passhouse, syncplan, readmine, photoflow, dnswarden, medialoom, chatmux, uptimelog

Every active repository is expected to have tests, CI, license metadata, issue templates, a pull request template, security notes, contribution notes, and a clear docs or examples surface.

Community Surface

The flagship repositories have Discussions enabled for design questions, evaluation ideas, benchmark comparisons, and integration notes:

Issues are kept for reproducible bugs, docs gaps, and scoped feature requests. For launch planning, copy, channel strategy, and star-growth operating notes, see the star growth playbook.

Public Audit

Last audited on June 13, 2026 across the live public GitHub profile.

Signal Current State
Public repositories 118 total: 117 active, 1 archived scratchpad
Active repo hygiene 117/117 have README, license metadata, license file, CI, issue templates, and PR templates
Latest completed CI 117/117 active repos passing or queued at audit time
Docs/examples surface 117/117 active repos
Research launch 5 new local-first agent/eval projects shipped on May 28, 2026
Evidence launch ProofDeck shipped on June 6, 2026 as the static review layer for the flagship stack
Spec workflow launch SpecForge shipped on June 13, 2026 as the profile-grade project selection and workflow studio
Flagship integration PatchGym emits run manifests and traces; TraceWeave analyzes them; SandboxLedger records them; ProofDeck packages them
Open issue load 0 open issues across active repositories at audit time

Audit notes:

How I Work

I use AI for scaffolding, test generation, edge-case brainstorming, and first-pass documentation. The architecture, project boundaries, quality bar, final review, and public positioning are mine.

AI-assisted output has to survive source-checkout setup, local tests, CI, security notes, limitation notes, and manual review before it becomes part of the public portfolio. That is why the profile emphasizes reproducible demos and auditable artifacts instead of fake adoption badges or inflated benchmark claims.

Essays

Professional Context

This GitHub profile is intentionally code-first. Career credentials, product leadership context, and publication context live on LinkedIn.

For bugs, design questions, or focused collaboration, open an issue on the relevant repository. For profile-level context, use nripankadas07/nripankadas07.

Pinned Loading

  1. patchgym patchgym Public

    Turn any Git repository into a local SWE-bench-style coding-agent benchmark.

    Python

  2. agent-framework agent-framework Public

    Tiny inspectable agent runtime with tools, memory, tracing, and safe no-key examples.

    Python

  3. rag-pipeline rag-pipeline Public

    Local-first RAG pipeline with chunking, retrieval, evaluation, and reports.

    Python

  4. decimal-ts decimal-ts Public

    Exact fixed-point decimal arithmetic for money-style calculations in TypeScript.

    TypeScript

  5. prompt-eval prompt-eval Public

    Prompt regression testing for CI with deterministic judges and no-key demos.

    Python

  6. safejson safejson Public

    Security-conscious JSON parsing with duplicate-key detection and depth/size limits.

    Python