Skip to content

feat: YAML/code ingestion pipeline + search_kubeflow_code MCP tool#143

Closed
kmr-rohit wants to merge 7 commits into
kubeflow:mainfrom
kmr-rohit:feat/code-pipeline-clean
Closed

feat: YAML/code ingestion pipeline + search_kubeflow_code MCP tool#143
kmr-rohit wants to merge 7 commits into
kubeflow:mainfrom
kmr-rohit:feat/code-pipeline-clean

Conversation

@kmr-rohit
Copy link
Copy Markdown

@kmr-rohit kmr-rohit commented Mar 18, 2026

Summary

  • Add YAML/code ingestion KFP pipeline (code-pipeline.py) with YAML-aware chunking (splits at --- boundaries, extracts K8s metadata: kind, name, namespace) and Python AST-aware chunking (splits at function/class boundaries)
  • Add search_kubeflow_code MCP tool with optional resource_kind filter to disambiguate similar resource types (e.g., Service vs ServiceAccount)
  • Update Agent CRD with multi-tool strategy: call both search_kubeflow_docs and search_kubeflow_code when query spans docs and code, or when confidence is low

New files

  • pipelines/code-pipeline.py — 3-step KFP pipeline: download → chunk+embed → store in Milvus code_rag collection
  • pipelines/code_utils.py — YAML/Python/JSON parsing utilities with RecursiveCharacterTextSplitter fallback for oversized chunks

Modified files

  • kagent-feast-mcp/mcp-server/server.py — Added search_kubeflow_code tool, extracted _search_collection() helper, added resource_kind filter support
  • kagent-feast-mcp/manifests/kagent/setup.yaml — Updated Agent CRD with 3-tool routing and multi-tool strategy
  • pipelines/requirements.txt — Added pyyaml

Milvus code_rag schema

Same as docs_rag plus: resource_kind (VARCHAR 128), resource_name (VARCHAR 256), resource_namespace (VARCHAR 256), file_type (VARCHAR 64)

Tested on OCI cluster

  • Ingested kubeflow/manifestsapplications/pipeline/upstream: 279 files → 477 chunks
  • Semantic search verified: "metadata grpc service" → Service/metadata-grpc-service at 0.73 cosine similarity
  • E2E tested through kagent UI with code, docs, and cross-tool queries

Pipeline bugs fixed during testing

  • CRI-O requires fully qualified Docker image names (docker.io/...)
  • Pinned sentence-transformers==3.3.1 + transformers==4.44.2 for PyTorch 2.3 compatibility
  • Replaced langchain with langchain-text-splitters (module moved in newer versions)

Test plan

  • Compile pipeline: cd pipelines && python code-pipeline.py — produces code_rag_pipeline.yaml
  • Run pipeline on KFP — ingested 279 files from kubeflow/manifestsapplications/pipeline/upstream
  • Verify code_rag collection — 477 chunks with correct resource_kind, resource_name, resource_namespace metadata
  • Test search_kubeflow_code with resource_kind filter — filtering by Service returns only Service resources, not ServiceAccounts
  • Test agent routing — code queries route to search_kubeflow_code, docs queries route to search_kubeflow_docs
  • Test multi-tool strategy — ambiguous queries (e.g., "How does Prometheus scrape pipeline metrics?") invoke both tools

kmr-rohit and others added 7 commits March 3, 2026 15:54
…mage

Builds a multi-arch (amd64 + arm64) Docker image from server-https/
and publishes it to ghcr.io on push to main, version tags, or manual dispatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows testing against any OpenAI-compatible API (OpenAI, vLLM, LiteLLM)
by passing LLM_API_KEY env var. Falls back to no auth header when unset,
preserving existing in-cluster KServe behaviour.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements Phase 2 of the GSoC plan:
- pipelines/code_utils.py: YAML-aware (split at --- boundaries, extract K8s
  metadata) and Python AST-aware (split at function/class boundaries) chunking
- pipelines/code-pipeline.py: KFP pipeline that downloads code from GitHub repos,
  chunks with structure-aware parsing, embeds, and stores in code_rag Milvus collection
- MCP server: add search_kubeflow_code tool with resource metadata display,
  refactor shared _search_collection helper
- Agent CRD: register search_kubeflow_code tool, update system message with
  3-tool routing (docs, issues, code)

https://claude.ai/code/session_01VJvZ2bJirKM5eU9GoYeMho
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kmr-rohit added a commit to kmr-rohit/docs-agent that referenced this pull request Mar 18, 2026
Mark Phase 3 as DONE — code ingestion pipeline + search_kubeflow_code
MCP tool shipped in PR kubeflow#143. Update PR table, success criteria,
KEP requirements, and next steps.
@SanthoshToorpu
Copy link
Copy Markdown
Contributor

Hey @kmr-rohit what does this PR do exactly is it like a repo code searcher or something else?

@kmr-rohit
Copy link
Copy Markdown
Author

kmr-rohit commented Mar 19, 2026

Hey @kmr-rohit what does this PR do exactly is it like a repo code searcher or something else?

Hi @SanthoshToorpu , this pr adds initial draft of code/yaml file ingestion pipeline. It accepts repo base url to get all .yaml , .yml , .py , .json files ( Deployments, Service, ConfigMaps etc. ). Introduced a new mcp tool which can query for manifest file chunks with Resource kind filter. So agent can query for Manifiest files , vars used in pipelines , from ingested code chunks. This also introduces multi tool call in case of low confidence for better understanding.

So a query like : How does Prometheus scrape pipeline metrics? will probably invoke both code search tool and documentation search tool.

There are still some improvements i have on my list : Adding more code file type chunk support. More robust tool calling with react loop. Also this will help in building Developer Integeration support as mentioned here :
https://github.com/kubeflow/docs-agent/blob/main/gsoc2026_agentic_rag.md#6-direct-ide-integration-the-byo-agent-experience

@kmr-rohit
Copy link
Copy Markdown
Author

Closing in favor of #205, which consolidates this PR along with #140 into a single clean PR.

#205 includes all the work from here (code pipeline, search_kubeflow_code tool) plus:

Test coverage for all of this is in #206.

@kmr-rohit kmr-rohit closed this May 6, 2026
@cursor cursor Bot deleted the feat/code-pipeline-clean branch May 28, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants