feat: YAML/code ingestion pipeline + search_kubeflow_code MCP tool#143
feat: YAML/code ingestion pipeline + search_kubeflow_code MCP tool#143kmr-rohit wants to merge 7 commits into
Conversation
…mage Builds a multi-arch (amd64 + arm64) Docker image from server-https/ and publishes it to ghcr.io on push to main, version tags, or manual dispatch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows testing against any OpenAI-compatible API (OpenAI, vLLM, LiteLLM) by passing LLM_API_KEY env var. Falls back to no auth header when unset, preserving existing in-cluster KServe behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements Phase 2 of the GSoC plan: - pipelines/code_utils.py: YAML-aware (split at --- boundaries, extract K8s metadata) and Python AST-aware (split at function/class boundaries) chunking - pipelines/code-pipeline.py: KFP pipeline that downloads code from GitHub repos, chunks with structure-aware parsing, embeds, and stores in code_rag Milvus collection - MCP server: add search_kubeflow_code tool with resource metadata display, refactor shared _search_collection helper - Agent CRD: register search_kubeflow_code tool, update system message with 3-tool routing (docs, issues, code) https://claude.ai/code/session_01VJvZ2bJirKM5eU9GoYeMho
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Mark Phase 3 as DONE — code ingestion pipeline + search_kubeflow_code MCP tool shipped in PR kubeflow#143. Update PR table, success criteria, KEP requirements, and next steps.
|
Hey @kmr-rohit what does this PR do exactly is it like a repo code searcher or something else? |
Hi @SanthoshToorpu , this pr adds initial draft of code/yaml file ingestion pipeline. It accepts repo base url to get all .yaml , .yml , .py , .json files ( Deployments, Service, ConfigMaps etc. ). Introduced a new mcp tool which can query for manifest file chunks with Resource kind filter. So agent can query for Manifiest files , vars used in pipelines , from ingested code chunks. This also introduces multi tool call in case of low confidence for better understanding. So a query like : How does Prometheus scrape pipeline metrics? will probably invoke both code search tool and documentation search tool. There are still some improvements i have on my list : Adding more code file type chunk support. More robust tool calling with react loop. Also this will help in building Developer Integeration support as mentioned here : |
|
Closing in favor of #205, which consolidates this PR along with #140 into a single clean PR. #205 includes all the work from here (code pipeline,
Test coverage for all of this is in #206. |
Summary
code-pipeline.py) with YAML-aware chunking (splits at---boundaries, extracts K8s metadata: kind, name, namespace) and Python AST-aware chunking (splits at function/class boundaries)search_kubeflow_codeMCP tool with optionalresource_kindfilter to disambiguate similar resource types (e.g., Service vs ServiceAccount)search_kubeflow_docsandsearch_kubeflow_codewhen query spans docs and code, or when confidence is lowNew files
pipelines/code-pipeline.py— 3-step KFP pipeline: download → chunk+embed → store in Milvuscode_ragcollectionpipelines/code_utils.py— YAML/Python/JSON parsing utilities withRecursiveCharacterTextSplitterfallback for oversized chunksModified files
kagent-feast-mcp/mcp-server/server.py— Addedsearch_kubeflow_codetool, extracted_search_collection()helper, addedresource_kindfilter supportkagent-feast-mcp/manifests/kagent/setup.yaml— Updated Agent CRD with 3-tool routing and multi-tool strategypipelines/requirements.txt— AddedpyyamlMilvus
code_ragschemaSame as
docs_ragplus:resource_kind(VARCHAR 128),resource_name(VARCHAR 256),resource_namespace(VARCHAR 256),file_type(VARCHAR 64)Tested on OCI cluster
kubeflow/manifests→applications/pipeline/upstream: 279 files → 477 chunksService/metadata-grpc-serviceat 0.73 cosine similarityPipeline bugs fixed during testing
docker.io/...)sentence-transformers==3.3.1+transformers==4.44.2for PyTorch 2.3 compatibilitylangchainwithlangchain-text-splitters(module moved in newer versions)Test plan
cd pipelines && python code-pipeline.py— producescode_rag_pipeline.yamlkubeflow/manifests→applications/pipeline/upstreamcode_ragcollection — 477 chunks with correctresource_kind,resource_name,resource_namespacemetadatasearch_kubeflow_codewithresource_kindfilter — filtering byServicereturns only Service resources, not ServiceAccountssearch_kubeflow_code, docs queries route tosearch_kubeflow_docs