Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 3 additions & 7 deletions docs/design/custom-engine.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
# Custom Engine Design

> **Implementation status**: This document describes the full Custom Engine design
> (both the `local` and `http` transports). The current phase (Phase 1) implements
> `transport: local`; `transport: http` is fully designed and its config schema is
> parsed and validated, but the implementation lands in a later PR. Selecting `http`
> today is rejected by validation with a clear "not yet implemented" error.
> **Implementation status**: Both the `local` and `http` transports are implemented.
> `http` covers the JSON request/response core, multipart `http.files` upload, and
> `artifacts.files[].url` download into the report directory.

This document defines the Custom Engine configuration interface and result contract
for `skill-up`. A Custom Engine is used to integrate agent executors that are not
Expand Down Expand Up @@ -521,8 +519,6 @@ need to enter the report directory, the agent must explicitly declare them in th

## HTTP transport

> Not yet implemented in Phase 1; this section is the full design.

The `http` transport is used for a remote agent service or a local HTTP agent service.
It receives the standard `SessionInput`, and after execution returns text, transcript,
and artifact declarations through `SessionResult`. `skill-up` only downloads or
Expand Down
37 changes: 34 additions & 3 deletions docs/guide/writing-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ The matched file's path **relative to the workspace root is preserved**, so `rep

### Custom Engine

When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Only `transport: local` is implemented today; `transport: http` is reserved and currently fails validation with "not yet implemented".
When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Both `transport: local` (run a command inside the runtime) and `transport: http` (call an HTTP agent service) are supported.

```yaml
engine:
Expand All @@ -166,7 +166,7 @@ engine:
provider: anthropic
name: claude-sonnet-4-6
custom:
transport: local # local (implemented) | http (planned)
transport: local # local | http
response_format: session_result # session_result (default) | text
timeout_seconds: 300
env: # credentials and secrets — NEVER reference these in command/args
Expand All @@ -189,7 +189,7 @@ Key fields (full contract in [docs/design/custom-engine.md](../design/custom-eng

- **`transport`** (required) — how skill-up invokes your agent.
- `local`: run `local.command` inside the current runtime via `runtime.Exec`. The agent process can read the runtime workspace, installed skills, fixtures, MCP config, and process environment variables.
- `http`: call a remote (or local) HTTP agent service. Designed in Phase 2 and rejected by validation today with an explicit "not yet implemented".
- `http`: POST the `SessionInput` to a remote (or local) HTTP agent service and read its `SessionResult` from the response body. Configure it under `custom.http` (see below).
- **`response_format`** (optional, default `session_result`) — how skill-up parses the agent's output.
- `session_result`: read a full `SessionResult` JSON from `local.output_file` (when configured) or stdout. Carries `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`. **Recommended**: keeps the full context for judges and reports.
- `text`: take stdout verbatim as `final_message`. skill-up synthesises a minimal transcript (input messages + the assistant reply) so judges still receive a conversation. Use only for simple scripts that do not produce structured output.
Expand All @@ -205,6 +205,37 @@ Secret-handling rules (enforced at config load):
- `${api_key}` and any kwarg whose key looks like a credential (`token`, `secret`, `api_key`, `apiKey`, `bearerToken`, …) cannot be referenced from `command` / `args` / `cwd` / `input_file` / `output_file`. Pass them through `engine.custom.env`, where they reach your agent as process environment variables instead of leaking into process listings.
- `${SOMEVAR:-...}` defaults that contain recognizable credential shapes (`sk-...`, `sk-ant-...`, `ghp_...`, `AIza...`, `AKIA...`, JWTs) are likewise rejected in command-line contexts.

For `transport: http`, configure the call under `custom.http` instead of `custom.local`:

```yaml
engine:
name: remote-review-agent
custom:
transport: http
response_format: session_result
timeout_seconds: 300
kwargs:
profile: production
http:
url: ${CUSTOM_AGENT_ENDPOINT}/v1/run # required
method: POST # only POST is supported
headers:
Authorization: Bearer ${api_key} # reference secrets here, not in the URL
files: # optional: upload workspace files as multipart
- path: diff.patch
required: true
- path: "src/**/*.go"
required: false
request_body: ${session_input} # optional; defaults to ${session_input}
```

HTTP specifics:

- The request body defaults to the `SessionInput` JSON. A `request_body` field whose value is exactly `${session_input}`, `${messages}`, or `${kwargs}` is injected as a JSON structure (not a string).
- With `http.files`, the request becomes `multipart/form-data`: the JSON body moves to the `payload` field and each matched file is uploaded as a separate part under the fixed form field name `files`, with its workspace-relative path carried in that part's `filename` (so the server reads each `files` part's `filename`, not a per-path form key). `path` is a workspace-relative file or glob; `required: false` skips a missing/empty match.
- Credentials must be referenced from `headers` (or `request_body`), never from `http.url` — a URL that renders `${api_key}` is rejected to keep the key out of request logs.
- A non-2xx response is treated as an invocation error. Artifacts the agent returns under `artifacts.files[].url` are GET-downloaded (http/https only, no redirects, size/time bounded) into the report directory.

See `docs/design/custom-engine.md` for the full SessionInput / SessionResult schema your agent must conform to.

### MCP configuration
Expand Down
37 changes: 34 additions & 3 deletions docs/zh/guide/writing-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ report:

### 自定义 Engine(Custom Engine)

当 `engine.name` 不是内置引擎(`claude_code` / `codex` / `qodercli`)时,必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。当前只实现了 `transport: local``transport: http` 已设计但尚未实现,validation 会直接报 "not yet implemented"
当 `engine.name` 不是内置引擎(`claude_code` / `codex` / `qodercli`)时,必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。`transport: local`(在 runtime 内执行命令)和 `transport: http`(调用 HTTP agent 服务)均已支持

```yaml
engine:
Expand All @@ -149,7 +149,7 @@ engine:
provider: anthropic
name: claude-sonnet-4-6
custom:
transport: local # local(已实现)| http(规划中)
transport: local # local | http
response_format: session_result # session_result(默认)| text
timeout_seconds: 300
env: # 凭据 / 敏感参数 —— 不要在 command/args 里引用这些
Expand All @@ -172,7 +172,7 @@ engine:

- **`transport`**(必填)—— skill-up 调用 agent 的方式。
- `local`:通过 `runtime.Exec` 在当前 runtime 内执行 `local.command`,agent 进程可访问 runtime workspace、已安装的 skill、fixture、MCP 配置以及进程环境变量。
- `http`:调用远程(或本地)HTTP agent 服务。Phase 2 已完成设计,本版本 validate 阶段直接拒绝并提示 "not yet implemented"
- `http`:把 `SessionInput` POST 到远程(或本地)HTTP agent 服务,从响应体读取 `SessionResult`。相关字段配置在 `custom.http` 下(见下文)
- **`response_format`**(可选,默认 `session_result`)—— skill-up 如何解析 agent 输出。
- `session_result`:从 `local.output_file`(若配置)或 stdout 读出完整的 `SessionResult` JSON,包含 `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`。**推荐保留默认**,可以让 judge 和报告拿到完整上下文。
- `text`:把 stdout 整体当作 `final_message`,skill-up 自动按输入消息 + 助手回复合成 minimal transcript,使 judge 仍能拿到一段对话。仅适合不输出结构化结果的简易脚本。
Expand All @@ -188,6 +188,37 @@ engine:
- `${api_key}` 以及任何看起来像凭据的 kwarg key(`token` / `secret` / `api_key` / `apiKey` / `bearerToken` 等)都不允许出现在 `command` / `args` / `cwd` / `input_file` / `output_file` 中,必须经由 `engine.custom.env` 注入到子进程环境变量里。
- `${SOMEVAR:-...}` 默认值如果匹配常见凭据特征(`sk-...`、`sk-ant-...`、`ghp_...`、`AIza...`、`AKIA...`、JWT 等),同样会在命令行场景被拒。

`transport: http` 时,把调用配置写在 `custom.http` 下(替代 `custom.local`):

```yaml
engine:
name: remote-review-agent
custom:
transport: http
response_format: session_result
timeout_seconds: 300
kwargs:
profile: production
http:
url: ${CUSTOM_AGENT_ENDPOINT}/v1/run # 必填
method: POST # 仅支持 POST
headers:
Authorization: Bearer ${api_key} # 凭据写在 header,不要写进 URL
files: # 可选:以 multipart 上传 workspace 文件
- path: diff.patch
required: true
- path: "src/**/*.go"
required: false
request_body: ${session_input} # 可选;默认即 ${session_input}
```

HTTP 要点:

- 请求体默认是 `SessionInput` JSON。`request_body` 中某个值若恰好是 `${session_input}` / `${messages}` / `${kwargs}`,会以 JSON 结构注入(而非字符串)。
- 配置了 `http.files` 后请求变为 `multipart/form-data`:JSON 体移到 `payload` 字段,每个命中的文件作为独立 part 上传到固定表单字段 `files`,其 workspace 相对路径放在该 part 的 `filename` 里(服务端应读取每个 `files` part 的 `filename`,而不是按路径找表单 key)。`path` 为 workspace 相对文件或 glob;`required: false` 时缺失/未命中会跳过。
- 凭据必须从 `headers`(或 `request_body`)引用,不能写进 `http.url` —— URL 渲染出 `${api_key}` 会被拒绝,避免泄漏到请求日志。
- 非 2xx 响应按调用错误处理。Agent 在 `artifacts.files[].url` 返回的产物会被 GET 下载(仅 http/https、不跟随重定向、有大小与时间上限)到报告目录。

SessionInput / SessionResult 的完整 JSON 契约见 `docs/design/custom-engine.md`。

### MCP 配置说明
Expand Down
116 changes: 116 additions & 0 deletions e2e/custom_engine_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,14 @@ package e2e

import (
"encoding/json"
"io"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"runtime"
"strings"
"sync"
"testing"
"time"

Expand Down Expand Up @@ -119,6 +123,118 @@ func TestPipeline_CustomEngine_LocalTransport(t *testing.T) {
}
}

// TestPipeline_CustomEngine_HTTPTransport runs the entire evaluation pipeline
// against a Custom Engine using transport: http. An in-process httptest server
// stands in for the remote agent: the framework POSTs the SessionInput JSON to
// it and parses the SessionResult JSON it returns. The test asserts that:
// 1. The eval loads and the http custom engine block is env-resolved/validated.
// 2. The framework POSTs a SessionInput body carrying the case prompt.
// 3. The server's SessionResult final_message reaches the report and the
// expect.must_contain rule passes.
//
// Unlike the local transport test, this needs no POSIX shell fixture, so it
// also runs on Windows.
func TestPipeline_CustomEngine_HTTPTransport(t *testing.T) {
t.Parallel()

var (
mu sync.Mutex
gotBody []byte
gotPath string
gotCType string
)
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
body, _ := io.ReadAll(r.Body)
mu.Lock()
gotBody = body
gotPath = r.URL.Path
gotCType = r.Header.Get("Content-Type")
mu.Unlock()

w.Header().Set("Content-Type", "application/json")
_, _ = io.WriteString(w, `{
"exit_code": 0,
"final_message": "custom-engine-handled",
"turns": 1,
"transcript": [
{"role": "user", "content": "ping"},
{"role": "assistant", "content": "custom-engine-handled"}
]
}`)
}))
defer srv.Close()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval-http.yaml")

outputDir := t.TempDir()
result := Run(t, RunConfig{
Env: []string{
"PATH=" + os.Getenv("PATH"),
"HOME=" + os.Getenv("HOME"),
"CUSTOM_AGENT_ENDPOINT=" + srv.URL,
},
WorkDir: dir,
Timeout: 60 * time.Second,
}, "run", evalPath, "--no-delete", "--output-dir", outputDir)

if result.ExitCode != 0 {
t.Fatalf("custom engine http run failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
result.ExitCode, result.Stdout, result.Stderr)
}
if !strings.Contains(result.Stdout, "1 passed") {
t.Errorf("expected the summary line to report 1 passed, got:\n%s", result.Stdout)
}

// --- Assert the server actually received a SessionInput POST ---
mu.Lock()
body, reqPath, cType := gotBody, gotPath, gotCType
mu.Unlock()

if reqPath != "/v1/run" {
t.Errorf("expected request path /v1/run, got %q", reqPath)
}
if !strings.Contains(cType, "application/json") {
t.Errorf("expected JSON content-type, got %q", cType)
}
var sess struct {
Messages []struct {
Role string `json:"role"`
Content string `json:"content"`
} `json:"messages"`
}
if err := json.Unmarshal(body, &sess); err != nil {
t.Fatalf("failed to parse posted SessionInput body: %v\nbody:\n%s", err, body)
}
if len(sess.Messages) == 0 || !strings.Contains(sess.Messages[len(sess.Messages)-1].Content, "ping the custom engine") {
t.Errorf("expected posted SessionInput to carry the case prompt, got messages: %#v", sess.Messages)
}

// --- Assert generated report.json content ---
reportPath := filepath.Join(outputDir, "iteration-1", "report.json")
reportData, err := os.ReadFile(reportPath)
if err != nil {
t.Fatalf("failed to read report.json at %s: %v", reportPath, err)
}
var rpt report.Input
if err := json.Unmarshal(reportData, &rpt); err != nil {
t.Fatalf("failed to parse report.json: %v", err)
}
if len(rpt.CaseResults) != 1 {
t.Fatalf("expected exactly 1 case result, got %d", len(rpt.CaseResults))
}
cr := rpt.CaseResults[0]
if cr.Response != "custom-engine-handled" {
t.Errorf("expected response (final_message) = %q, got %q", "custom-engine-handled", cr.Response)
}
if cr.Grading == nil {
t.Fatalf("expected grading to be non-nil")
}
if cr.Grading.Status != judge.StatusPass {
t.Errorf("expected grading status = %q, got %q", judge.StatusPass, cr.Grading.Status)
}
}

// TestPipeline_CustomEngine_Validate runs `skill-up validate` against the
// custom engine eval.yaml. This exercises the post-load
// config.ResolveCustomEngineConfig path that validate.go invokes, including
Expand Down
19 changes: 14 additions & 5 deletions e2e/testdata/custom-engine/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
# Custom Engine e2e fixture

This directory is a minimal skill-up fixture that exercises the **Custom
Engine** local transport end-to-end. It is consumed by `e2e/custom_engine_test.go`.
Engine** local and http transports end-to-end. It is consumed by
`e2e/custom_engine_test.go`.

`agent.sh` is a deterministic stand-in for a real custom agent: it reads the
`SessionInput` JSON the framework writes to `${input_file}` and emits a fixed
`SessionResult` on stdout. The case asserts that `final_message` flows from
the agent's stdout into the report.
`agent.sh` + `evals/eval.yaml` cover the **local** transport: `agent.sh` is a
deterministic stand-in for a real custom agent that reads the `SessionInput`
JSON the framework writes to `${input_file}` and emits a fixed `SessionResult`
on stdout. The case asserts that `final_message` flows from the agent's stdout
into the report.

`evals/eval-http.yaml` exercises the **http** transport against the same
`hello.yaml` case. `TestPipeline_CustomEngine_HTTPTransport` starts an
in-process `httptest` server, points `${CUSTOM_AGENT_ENDPOINT}` at it, and
asserts the posted `SessionInput` reaches the server and the returned
`SessionResult` flows into the report — so the http stand-in agent lives in the
test rather than in `agent.sh`.
29 changes: 29 additions & 0 deletions e2e/testdata/custom-engine/evals/eval-http.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
schema_version: v1alpha1

environment:
type: none

mcp:
servers: []

engine:
name: e2e-custom-http-agent
custom:
transport: http
response_format: session_result
http:
url: ${CUSTOM_AGENT_ENDPOINT}/v1/run
method: POST
headers:
Content-Type: application/json

cases:
files:
- evals/cases/hello.yaml
defaults:
timeout_seconds: 60
max_turns: 3
parallelism: 1

report:
formats: [json]
2 changes: 1 addition & 1 deletion internal/agent/custom.go
Original file line number Diff line number Diff line change
Expand Up @@ -982,7 +982,7 @@ func (a *CustomAgent) errorResult(exitCode int) *SessionResult {

// SessionInput is the documented contract handed to a Custom Engine for each
// case (see docs/design/custom-engine.md). Both the local transport (written
// to ${input_file}) and the planned http transport (sent as the request body
// to ${input_file}) and the http transport (sent as the request body
// or multipart payload field) use the same shape, so it is exported here as
// a public type rather than buried in a transport implementation.
type SessionInput struct {
Expand Down
Loading