Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions programs/lfx-mentorship/2026/02-Jun-Aug/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,11 @@ Mentee application instructions can be found on the [Program Guidelines](https:/
- [Integrating Kmesh into Headlamp UI](#integrating-kmesh-into-headlamp-ui)
- [Knative Functions](#knative-functions)
- [End-to-End Agentic Workflow for Serverless Functions](#end-to-end-agentic-workflow-for-serverless-functions)
- [Koordinator](#koordinator)
- [End-to-End Performance and Scalability Test Harness for Koordinator Scheduler](#end-to-end-performance-and-scalability-test-harness-for-koordinator-scheduler)
- [Fragmentation-Aware Rescheduling and Scale-Down Binpack for Koordinator](#fragmentation-aware-rescheduling-and-scale-down-binpack-for-koordinator)
- [Native Support for AI Agent Orchestration and High-Concurrency Sandbox Scheduling](#native-support-for-ai-agent-orchestration-and-high-concurrency-sandbox-scheduling)
- [Plugin-level Cache and Nominator Shared Across Scheduler Profiles](#plugin-level-cache-and-nominator-shared-across-scheduler-profiles)
- [krkn - Chaos](#krkn---chaos)
- [Automated Documentation Sync Bot for Krkn-Chaos Projects](#automated-documentation-sync-bot-for-krkn-chaos-projects)
- [Dynamic Cluster-Aware Configuration Generation for Krkn-AI](#dynamic-cluster-aware-configuration-generation-for-krkn-ai)
Expand Down Expand Up @@ -372,6 +377,96 @@ CNCF - Knative Functions: End-to-End Agentic Workflow for Serverless Functions (
- Upstream Issue: https://github.com/knative/func/issues/3646
- LFX URL: https://mentorship.lfx.linuxfoundation.org/project/26086dfb-1c88-487c-a2de-e11cfc857c1a

### Koordinator

#### Native Support for AI Agent Orchestration and High-Concurrency Sandbox Scheduling

CNCF - Koordinator: Native Support for AI Agent Orchestration and Sandbox Scheduling (2026 Term 2)

- Description: AI agentic workflows (multi-agent collaboration frameworks, automated evaluation systems such as SWE-bench) frequently spawn and tear down large numbers of short-lived sandboxes (gVisor, Wasm, Kata Containers) to execute skills or tools. Current Kubernetes scheduling models are too heavy for these transient workloads, causing high latency, weak isolation, and APIServer pressure. This project will extend Koordinator to provide first-class support for AI agent scenarios by integrating sandbox runtimes with Koordinator QoS classes, implementing equivalence-class scheduling for near-identical sandbox requests, and introducing a dedicated Sandbox Pipeline that handles pre-warming, capacity reservation, and actionable scheduling feedback to enable agent self-healing.
- Expected Outcome:
- Native integration of agent sandbox runtimes (gVisor/Wasm/Kata) with Koordinator QoS classes (LSR, LS, BE) and resource hardening templates
- Equivalence-class scheduling logic in the Koordinator scheduler that skips redundant Predicate/Priority calculations for batches of similar sandbox pods
- A Sandbox Pipeline mechanism with pre-warming, automated environment provisioning, and reservation hooks
- Annotation/Status-based scheduling diagnostics that surface actionable next-step suggestions to upstream agents
- End-to-end tests and benchmarks demonstrating throughput and latency improvements for high-concurrency sandbox workloads
- Recommended Skills:
- Go programming
- Kubernetes scheduler-framework, CRDs, controllers
- Familiarity with sandbox runtimes (gVisor/Kata/Wasm)
- Performance profiling and benchmarking
- Mentor(s):
- Rougang Han (@saintube, saintube@foxmail.com)
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/issues/2879
- LFX URL: https://mentorship.lfx.linuxfoundation.org/project/ad7652bf-cf72-4f6b-b26c-93ea4e8fb48a

#### End-to-End Performance and Scalability Test Harness for Koordinator Scheduler

CNCF - Koordinator: End-to-End Performance and Scalability Test Harness for Scheduler (2026 Term 2)

- Description: As the Koordinator scheduler grows additional plugins (NUMA, Device Sharing, Reservation, Coscheduling, ElasticQuota), the project needs a continuous, lightweight performance harness to detect regressions and validate scaling targets defined in the 2026 roadmap. This project will build a Kwok-based simulation framework that drives realistic workload mixes (LSR/LS/BE, gang, reservations, GPU sharing) at scale (target 10k-100k pods), captures throughput, queueing latency, and per-plugin profiling data, and publishes results from a scheduled GitHub Actions workflow.
- Expected Outcome:
- Kwok-based simulation harness with reusable workload generators for the major Koordinator plugins
- Scheduled GitHub Actions workflow that runs the harness on each release branch and publishes Markdown reports
- Baseline performance metrics and regression-detection thresholds for the 2026 release line
- Profiling integration (pprof, scheduler_perf) and dashboards for triage
- Documentation for contributors on adding new scenarios
- Recommended Skills:
- Go programming
- Kwok, kube-scheduler internals, scheduler_perf
- GitHub Actions and shell scripting
- Mentor(s):
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Rougang Han (@saintube, saintube@foxmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/milestone/19
- LFX URL: https://mentorship.lfx.linuxfoundation.org/project/ffaf5547-3e90-42e5-a136-d9bc2e80c001

#### Plugin-level Cache and Nominator Shared Across Scheduler Profiles

CNCF - Koordinator: Plugin-level Cache and Nominator Shared Across Scheduler Profiles (2026 Term 2)

- Description: Today most Koordinator scheduler plugins (DeviceShare, NodeNUMAResource, Reservation, etc.) keep their own private cache per scheduling framework/profile, so configuring multiple profiles in a single `KubeSchedulerConfiguration` produces duplicated plugin caches even though the upstream `SchedulerCache` and `PodNominator` are singletons per scheduler instance. This duplication wastes memory, desynchronizes state across profiles, and exposes correctness bugs: in `RunFilterPluginsWithNominatedPods`, `AddPod` is invoked for pods that are still in scheduling and are therefore missing from `nodeDeviceCache`'s used map, which causes DeviceShare and NodeNUMAResource to fail to reserve resources for high-priority pods in RDMA / VF / NV-Switch scenarios. This project will introduce a unified plugin-level Cache and Nominator abstraction that is shared across all profiles in the same scheduler instance, correctly integrates with `Reserve` / `UnReserve` lifecycle hooks, and fixes the `AddPod` / `RemovePod` handling for high-priority device and NUMA workloads.
- Expected Outcome:
- A shared plugin-level `Cache` and `Nominator` framework exposed to Koordinator scheduler plugins, with a lifecycle tied to the scheduler instance rather than individual profiles
- Migration of DeviceShare, NodeNUMAResource, and Reservation plugin caches to the shared implementation, removing per-profile duplication
- Correct `AddPod` / `RemovePod` handling via reused `Reserve` logic that records assignment placements in the nominator, cleared symmetrically on `Reserve` / `UnReserve`
- Regression and stress tests covering multi-profile scheduling, preemption with nominated pods, and high-priority RDMA / VF / NV-Switch device assignment
- Developer documentation describing the shared Cache / Nominator API and migration guidance for other plugins
- Recommended Skills:
- Go programming
- Deep understanding of the Kubernetes scheduler-framework (Cache, PodNominator, Reserve/UnReserve)
- Familiarity with device plugins, NUMA topology, and preemption semantics
- Concurrency and cache-invalidation reasoning
- Mentor(s):
- Rougang Han (@saintube, saintube@foxmail.com)
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/issues/2749 , https://github.com/koordinator-sh/koordinator/issues/1959
- LFX URL: https://mentorship.lfx.linuxfoundation.org/project/f4a8321c-4135-49df-bbfc-e8b297bb4b73

#### Fragmentation-Aware Rescheduling and Scale-Down Binpack for Koordinator

CNCF - Koordinator: Fragmentation-Aware Rescheduling and Scale-Down Binpack (2026 Term 2)

- Description: Resource fragmentation in a cluster has two common sources: (1) intra-node imbalance between resource types — e.g. a node reaches 90% CPU allocation while memory allocation stays at 50%, leaving memory stranded; and (2) scale-down that evicts the "wrong" pods, leaving many nodes partially filled instead of fully drained. Koordinator currently has no dedicated strategy for either case. This project will add two coordinated capabilities: a descheduler plugin that detects nodes whose per-resource allocation imbalance (measured via standard deviation, consistent with the kube-scheduler `NodeResourcesBalancedAllocation` plugin) exceeds a configurable threshold and evicts pods whose removal most reduces the std; and a scale-down binpack strategy that, when upper-layer controllers scale workloads down, picks pod victims so that remaining allocations are concentrated on fewer nodes (inspired by Volcano's scale-down binpack). Both capabilities will reuse Koordinator's `ReservationFirst` eviction mode to preserve workload stability, and cooperate with the scheduler's `NodeResourcesBalancedAllocation` so rebalanced pods land on better-balanced nodes.
- Expected Outcome:
- New descheduler plugin that detects intra-node resource-type imbalance above a configurable threshold using standard deviation across CPU/memory (and optionally other resources)
- Pod eviction prioritization that picks pods maximally reducing the per-node std after eviction
- Scale-down binpack strategy exposed to upper-layer controllers so scale-down victims are chosen to reduce cluster-level fragmentation
- Integration with the `ReservationFirst` eviction mode to preserve workload stability
- Coordination with the scheduler `NodeResourcesBalancedAllocation` plugin so evicted pods land on better-balanced nodes
- Unit tests, e2e tests, and a documented benchmark showing reduced resource fragmentation for both rescheduling and scale-down scenarios
- User-facing documentation and example descheduler / scale-down configurations
- Recommended Skills:
- Go programming
- Kubernetes descheduler and scheduler-framework internals
- Understanding of resource allocation metrics, binpack algorithms, and eviction safety
- Mentor(s):
- Tao Song (@songtao98, songtao2603060@gmail.com)
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/issues/2332 , https://github.com/koordinator-sh/koordinator/issues/2790
- LFX URL: https://mentorship.lfx.linuxfoundation.org/project/448a259f-07a9-4e1e-94c2-d9994396498d

### krkn - Chaos

#### Automated Documentation Sync Bot for Krkn-Chaos Projects
Expand Down
77 changes: 0 additions & 77 deletions programs/lfx-mentorship/2026/02-Jun-Aug/project_ideas.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,80 +18,3 @@

## Proposed Project ideas

### Koordinator

#### Native Support for AI Agent Orchestration and High-Concurrency Sandbox Scheduling

- Description: AI agentic workflows (multi-agent collaboration frameworks, automated evaluation systems such as SWE-bench) frequently spawn and tear down large numbers of short-lived sandboxes (gVisor, Wasm, Kata Containers) to execute skills or tools. Current Kubernetes scheduling models are too heavy for these transient workloads, causing high latency, weak isolation, and APIServer pressure. This project will extend Koordinator to provide first-class support for AI agent scenarios by integrating sandbox runtimes with Koordinator QoS classes, implementing equivalence-class scheduling for near-identical sandbox requests, and introducing a dedicated Sandbox Pipeline that handles pre-warming, capacity reservation, and actionable scheduling feedback to enable agent self-healing.
- Expected Outcome:
- Native integration of agent sandbox runtimes (gVisor/Wasm/Kata) with Koordinator QoS classes (LSR, LS, BE) and resource hardening templates
- Equivalence-class scheduling logic in the Koordinator scheduler that skips redundant Predicate/Priority calculations for batches of similar sandbox pods
- A Sandbox Pipeline mechanism with pre-warming, automated environment provisioning, and reservation hooks
- Annotation/Status-based scheduling diagnostics that surface actionable next-step suggestions to upstream agents
- End-to-end tests and benchmarks demonstrating throughput and latency improvements for high-concurrency sandbox workloads
- Recommended Skills:
- Go programming
- Kubernetes scheduler-framework, CRDs, controllers
- Familiarity with sandbox runtimes (gVisor/Kata/Wasm)
- Performance profiling and benchmarking
- Mentor(s):
- Rougang Han (@saintube, saintube@foxmail.com)
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/issues/2879

#### End-to-End Performance and Scalability Test Harness for Koordinator Scheduler

- Description: As the Koordinator scheduler grows additional plugins (NUMA, Device Sharing, Reservation, Coscheduling, ElasticQuota), the project needs a continuous, lightweight performance harness to detect regressions and validate scaling targets defined in the 2026 roadmap. This project will build a Kwok-based simulation framework that drives realistic workload mixes (LSR/LS/BE, gang, reservations, GPU sharing) at scale (target 10k-100k pods), captures throughput, queueing latency, and per-plugin profiling data, and publishes results from a scheduled GitHub Actions workflow.
- Expected Outcome:
- Kwok-based simulation harness with reusable workload generators for the major Koordinator plugins
- Scheduled GitHub Actions workflow that runs the harness on each release branch and publishes Markdown reports
- Baseline performance metrics and regression-detection thresholds for the 2026 release line
- Profiling integration (pprof, scheduler_perf) and dashboards for triage
- Documentation for contributors on adding new scenarios
- Recommended Skills:
- Go programming
- Kwok, kube-scheduler internals, scheduler_perf
- GitHub Actions and shell scripting
- Mentor(s):
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Rougang Han (@saintube, saintube@foxmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/milestone/19

#### Plugin-level Cache and Nominator Shared Across Scheduler Profiles

- Description: Today most Koordinator scheduler plugins (DeviceShare, NodeNUMAResource, Reservation, etc.) keep their own private cache per scheduling framework/profile, so configuring multiple profiles in a single `KubeSchedulerConfiguration` produces duplicated plugin caches even though the upstream `SchedulerCache` and `PodNominator` are singletons per scheduler instance. This duplication wastes memory, desynchronizes state across profiles, and exposes correctness bugs: in `RunFilterPluginsWithNominatedPods`, `AddPod` is invoked for pods that are still in scheduling and are therefore missing from `nodeDeviceCache`'s used map, which causes DeviceShare and NodeNUMAResource to fail to reserve resources for high-priority pods in RDMA / VF / NV-Switch scenarios. This project will introduce a unified plugin-level Cache and Nominator abstraction that is shared across all profiles in the same scheduler instance, correctly integrates with `Reserve` / `UnReserve` lifecycle hooks, and fixes the `AddPod` / `RemovePod` handling for high-priority device and NUMA workloads.
- Expected Outcome:
- A shared plugin-level `Cache` and `Nominator` framework exposed to Koordinator scheduler plugins, with a lifecycle tied to the scheduler instance rather than individual profiles
- Migration of DeviceShare, NodeNUMAResource, and Reservation plugin caches to the shared implementation, removing per-profile duplication
- Correct `AddPod` / `RemovePod` handling via reused `Reserve` logic that records assignment placements in the nominator, cleared symmetrically on `Reserve` / `UnReserve`
- Regression and stress tests covering multi-profile scheduling, preemption with nominated pods, and high-priority RDMA / VF / NV-Switch device assignment
- Developer documentation describing the shared Cache / Nominator API and migration guidance for other plugins
- Recommended Skills:
- Go programming
- Deep understanding of the Kubernetes scheduler-framework (Cache, PodNominator, Reserve/UnReserve)
- Familiarity with device plugins, NUMA topology, and preemption semantics
- Concurrency and cache-invalidation reasoning
- Mentor(s):
- Rougang Han (@saintube, saintube@foxmail.com)
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/issues/2749 , https://github.com/koordinator-sh/koordinator/issues/1959

#### Fragmentation-Aware Rescheduling and Scale-Down Binpack for Koordinator

- Description: Resource fragmentation in a cluster has two common sources: (1) intra-node imbalance between resource types — e.g. a node reaches 90% CPU allocation while memory allocation stays at 50%, leaving memory stranded; and (2) scale-down that evicts the "wrong" pods, leaving many nodes partially filled instead of fully drained. Koordinator currently has no dedicated strategy for either case. This project will add two coordinated capabilities: a descheduler plugin that detects nodes whose per-resource allocation imbalance (measured via standard deviation, consistent with the kube-scheduler `NodeResourcesBalancedAllocation` plugin) exceeds a configurable threshold and evicts pods whose removal most reduces the std; and a scale-down binpack strategy that, when upper-layer controllers scale workloads down, picks pod victims so that remaining allocations are concentrated on fewer nodes (inspired by Volcano's scale-down binpack). Both capabilities will reuse Koordinator's `ReservationFirst` eviction mode to preserve workload stability, and cooperate with the scheduler's `NodeResourcesBalancedAllocation` so rebalanced pods land on better-balanced nodes.
- Expected Outcome:
- New descheduler plugin that detects intra-node resource-type imbalance above a configurable threshold using standard deviation across CPU/memory (and optionally other resources)
- Pod eviction prioritization that picks pods maximally reducing the per-node std after eviction
- Scale-down binpack strategy exposed to upper-layer controllers so scale-down victims are chosen to reduce cluster-level fragmentation
- Integration with the `ReservationFirst` eviction mode to preserve workload stability
- Coordination with the scheduler `NodeResourcesBalancedAllocation` plugin so evicted pods land on better-balanced nodes
- Unit tests, e2e tests, and a documented benchmark showing reduced resource fragmentation for both rescheduling and scale-down scenarios
- User-facing documentation and example descheduler / scale-down configurations
- Recommended Skills:
- Go programming
- Kubernetes descheduler and scheduler-framework internals
- Understanding of resource allocation metrics, binpack algorithms, and eviction safety
- Mentor(s):
- Tao Song (@songtao98, songtao2603060@gmail.com)
- Jianyu Wang (@ZiMengSheng, zmsjianyu@gmail.com)
- Upstream Issue: https://github.com/koordinator-sh/koordinator/issues/2332 , https://github.com/koordinator-sh/koordinator/issues/2790
Loading