diff --git a/docs/design/custom-engine.md b/docs/design/custom-engine.md index 77b3b00a..c759ea62 100644 --- a/docs/design/custom-engine.md +++ b/docs/design/custom-engine.md @@ -1,10 +1,8 @@ # Custom Engine Design -> **Implementation status**: This document describes the full Custom Engine design -> (both the `local` and `http` transports). The current phase (Phase 1) implements -> `transport: local`; `transport: http` is fully designed and its config schema is -> parsed and validated, but the implementation lands in a later PR. Selecting `http` -> today is rejected by validation with a clear "not yet implemented" error. +> **Implementation status**: Both the `local` and `http` transports are implemented. +> `http` covers the JSON request/response core, multipart `http.files` upload, and +> `artifacts.files[].url` download into the report directory. This document defines the Custom Engine configuration interface and result contract for `skill-up`. A Custom Engine is used to integrate agent executors that are not @@ -521,8 +519,6 @@ need to enter the report directory, the agent must explicitly declare them in th ## HTTP transport -> Not yet implemented in Phase 1; this section is the full design. - The `http` transport is used for a remote agent service or a local HTTP agent service. It receives the standard `SessionInput`, and after execution returns text, transcript, and artifact declarations through `SessionResult`. `skill-up` only downloads or diff --git a/docs/guide/writing-evals.md b/docs/guide/writing-evals.md index df9ed9ba..6559ebe9 100644 --- a/docs/guide/writing-evals.md +++ b/docs/guide/writing-evals.md @@ -157,7 +157,7 @@ The matched file's path **relative to the workspace root is preserved**, so `rep ### Custom Engine -When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Only `transport: local` is implemented today; `transport: http` is reserved and currently fails validation with "not yet implemented". +When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Both `transport: local` (run a command inside the runtime) and `transport: http` (call an HTTP agent service) are supported. ```yaml engine: @@ -166,7 +166,7 @@ engine: provider: anthropic name: claude-sonnet-4-6 custom: - transport: local # local (implemented) | http (planned) + transport: local # local | http response_format: session_result # session_result (default) | text timeout_seconds: 300 env: # credentials and secrets — NEVER reference these in command/args @@ -189,7 +189,7 @@ Key fields (full contract in [docs/design/custom-engine.md](../design/custom-eng - **`transport`** (required) — how skill-up invokes your agent. - `local`: run `local.command` inside the current runtime via `runtime.Exec`. The agent process can read the runtime workspace, installed skills, fixtures, MCP config, and process environment variables. - - `http`: call a remote (or local) HTTP agent service. Designed in Phase 2 and rejected by validation today with an explicit "not yet implemented". + - `http`: POST the `SessionInput` to a remote (or local) HTTP agent service and read its `SessionResult` from the response body. Configure it under `custom.http` (see below). - **`response_format`** (optional, default `session_result`) — how skill-up parses the agent's output. - `session_result`: read a full `SessionResult` JSON from `local.output_file` (when configured) or stdout. Carries `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`. **Recommended**: keeps the full context for judges and reports. - `text`: take stdout verbatim as `final_message`. skill-up synthesises a minimal transcript (input messages + the assistant reply) so judges still receive a conversation. Use only for simple scripts that do not produce structured output. @@ -205,6 +205,37 @@ Secret-handling rules (enforced at config load): - `${api_key}` and any kwarg whose key looks like a credential (`token`, `secret`, `api_key`, `apiKey`, `bearerToken`, …) cannot be referenced from `command` / `args` / `cwd` / `input_file` / `output_file`. Pass them through `engine.custom.env`, where they reach your agent as process environment variables instead of leaking into process listings. - `${SOMEVAR:-...}` defaults that contain recognizable credential shapes (`sk-...`, `sk-ant-...`, `ghp_...`, `AIza...`, `AKIA...`, JWTs) are likewise rejected in command-line contexts. +For `transport: http`, configure the call under `custom.http` instead of `custom.local`: + +```yaml +engine: + name: remote-review-agent + custom: + transport: http + response_format: session_result + timeout_seconds: 300 + kwargs: + profile: production + http: + url: ${CUSTOM_AGENT_ENDPOINT}/v1/run # required + method: POST # only POST is supported + headers: + Authorization: Bearer ${api_key} # reference secrets here, not in the URL + files: # optional: upload workspace files as multipart + - path: diff.patch + required: true + - path: "src/**/*.go" + required: false + request_body: ${session_input} # optional; defaults to ${session_input} +``` + +HTTP specifics: + +- The request body defaults to the `SessionInput` JSON. A `request_body` field whose value is exactly `${session_input}`, `${messages}`, or `${kwargs}` is injected as a JSON structure (not a string). +- With `http.files`, the request becomes `multipart/form-data`: the JSON body moves to the `payload` field and each matched file is uploaded as a separate part under the fixed form field name `files`, with its workspace-relative path carried in that part's `filename` (so the server reads each `files` part's `filename`, not a per-path form key). `path` is a workspace-relative file or glob; `required: false` skips a missing/empty match. +- Credentials must be referenced from `headers` (or `request_body`), never from `http.url` — a URL that renders `${api_key}` is rejected to keep the key out of request logs. +- A non-2xx response is treated as an invocation error. Artifacts the agent returns under `artifacts.files[].url` are GET-downloaded (http/https only, no redirects, size/time bounded) into the report directory. + See `docs/design/custom-engine.md` for the full SessionInput / SessionResult schema your agent must conform to. ### MCP configuration diff --git a/docs/zh/guide/writing-evals.md b/docs/zh/guide/writing-evals.md index def70817..a0bdc402 100644 --- a/docs/zh/guide/writing-evals.md +++ b/docs/zh/guide/writing-evals.md @@ -140,7 +140,7 @@ report: ### 自定义 Engine(Custom Engine) -当 `engine.name` 不是内置引擎(`claude_code` / `codex` / `qodercli`)时,必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。当前只实现了 `transport: local`;`transport: http` 已设计但尚未实现,validation 会直接报 "not yet implemented"。 +当 `engine.name` 不是内置引擎(`claude_code` / `codex` / `qodercli`)时,必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。`transport: local`(在 runtime 内执行命令)和 `transport: http`(调用 HTTP agent 服务)均已支持。 ```yaml engine: @@ -149,7 +149,7 @@ engine: provider: anthropic name: claude-sonnet-4-6 custom: - transport: local # local(已实现)| http(规划中) + transport: local # local | http response_format: session_result # session_result(默认)| text timeout_seconds: 300 env: # 凭据 / 敏感参数 —— 不要在 command/args 里引用这些 @@ -172,7 +172,7 @@ engine: - **`transport`**(必填)—— skill-up 调用 agent 的方式。 - `local`:通过 `runtime.Exec` 在当前 runtime 内执行 `local.command`,agent 进程可访问 runtime workspace、已安装的 skill、fixture、MCP 配置以及进程环境变量。 - - `http`:调用远程(或本地)HTTP agent 服务。Phase 2 已完成设计,本版本 validate 阶段直接拒绝并提示 "not yet implemented"。 + - `http`:把 `SessionInput` POST 到远程(或本地)HTTP agent 服务,从响应体读取 `SessionResult`。相关字段配置在 `custom.http` 下(见下文)。 - **`response_format`**(可选,默认 `session_result`)—— skill-up 如何解析 agent 输出。 - `session_result`:从 `local.output_file`(若配置)或 stdout 读出完整的 `SessionResult` JSON,包含 `exit_code` / `final_message` / `transcript` / `turns` / `input_tokens` / `output_tokens` / `artifacts`。**推荐保留默认**,可以让 judge 和报告拿到完整上下文。 - `text`:把 stdout 整体当作 `final_message`,skill-up 自动按输入消息 + 助手回复合成 minimal transcript,使 judge 仍能拿到一段对话。仅适合不输出结构化结果的简易脚本。 @@ -188,6 +188,37 @@ engine: - `${api_key}` 以及任何看起来像凭据的 kwarg key(`token` / `secret` / `api_key` / `apiKey` / `bearerToken` 等)都不允许出现在 `command` / `args` / `cwd` / `input_file` / `output_file` 中,必须经由 `engine.custom.env` 注入到子进程环境变量里。 - `${SOMEVAR:-...}` 默认值如果匹配常见凭据特征(`sk-...`、`sk-ant-...`、`ghp_...`、`AIza...`、`AKIA...`、JWT 等),同样会在命令行场景被拒。 +`transport: http` 时,把调用配置写在 `custom.http` 下(替代 `custom.local`): + +```yaml +engine: + name: remote-review-agent + custom: + transport: http + response_format: session_result + timeout_seconds: 300 + kwargs: + profile: production + http: + url: ${CUSTOM_AGENT_ENDPOINT}/v1/run # 必填 + method: POST # 仅支持 POST + headers: + Authorization: Bearer ${api_key} # 凭据写在 header,不要写进 URL + files: # 可选:以 multipart 上传 workspace 文件 + - path: diff.patch + required: true + - path: "src/**/*.go" + required: false + request_body: ${session_input} # 可选;默认即 ${session_input} +``` + +HTTP 要点: + +- 请求体默认是 `SessionInput` JSON。`request_body` 中某个值若恰好是 `${session_input}` / `${messages}` / `${kwargs}`,会以 JSON 结构注入(而非字符串)。 +- 配置了 `http.files` 后请求变为 `multipart/form-data`:JSON 体移到 `payload` 字段,每个命中的文件作为独立 part 上传到固定表单字段 `files`,其 workspace 相对路径放在该 part 的 `filename` 里(服务端应读取每个 `files` part 的 `filename`,而不是按路径找表单 key)。`path` 为 workspace 相对文件或 glob;`required: false` 时缺失/未命中会跳过。 +- 凭据必须从 `headers`(或 `request_body`)引用,不能写进 `http.url` —— URL 渲染出 `${api_key}` 会被拒绝,避免泄漏到请求日志。 +- 非 2xx 响应按调用错误处理。Agent 在 `artifacts.files[].url` 返回的产物会被 GET 下载(仅 http/https、不跟随重定向、有大小与时间上限)到报告目录。 + SessionInput / SessionResult 的完整 JSON 契约见 `docs/design/custom-engine.md`。 ### MCP 配置说明 diff --git a/e2e/custom_engine_test.go b/e2e/custom_engine_test.go index 757894e2..d9017270 100644 --- a/e2e/custom_engine_test.go +++ b/e2e/custom_engine_test.go @@ -4,10 +4,14 @@ package e2e import ( "encoding/json" + "io" + "net/http" + "net/http/httptest" "os" "path/filepath" "runtime" "strings" + "sync" "testing" "time" @@ -119,6 +123,118 @@ func TestPipeline_CustomEngine_LocalTransport(t *testing.T) { } } +// TestPipeline_CustomEngine_HTTPTransport runs the entire evaluation pipeline +// against a Custom Engine using transport: http. An in-process httptest server +// stands in for the remote agent: the framework POSTs the SessionInput JSON to +// it and parses the SessionResult JSON it returns. The test asserts that: +// 1. The eval loads and the http custom engine block is env-resolved/validated. +// 2. The framework POSTs a SessionInput body carrying the case prompt. +// 3. The server's SessionResult final_message reaches the report and the +// expect.must_contain rule passes. +// +// Unlike the local transport test, this needs no POSIX shell fixture, so it +// also runs on Windows. +func TestPipeline_CustomEngine_HTTPTransport(t *testing.T) { + t.Parallel() + + var ( + mu sync.Mutex + gotBody []byte + gotPath string + gotCType string + ) + srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + body, _ := io.ReadAll(r.Body) + mu.Lock() + gotBody = body + gotPath = r.URL.Path + gotCType = r.Header.Get("Content-Type") + mu.Unlock() + + w.Header().Set("Content-Type", "application/json") + _, _ = io.WriteString(w, `{ + "exit_code": 0, + "final_message": "custom-engine-handled", + "turns": 1, + "transcript": [ + {"role": "user", "content": "ping"}, + {"role": "assistant", "content": "custom-engine-handled"} + ] +}`) + })) + defer srv.Close() + + dir := getCustomEngineTestdataDir() + evalPath := filepath.Join(dir, "evals", "eval-http.yaml") + + outputDir := t.TempDir() + result := Run(t, RunConfig{ + Env: []string{ + "PATH=" + os.Getenv("PATH"), + "HOME=" + os.Getenv("HOME"), + "CUSTOM_AGENT_ENDPOINT=" + srv.URL, + }, + WorkDir: dir, + Timeout: 60 * time.Second, + }, "run", evalPath, "--no-delete", "--output-dir", outputDir) + + if result.ExitCode != 0 { + t.Fatalf("custom engine http run failed: exit=%d\nstdout:\n%s\nstderr:\n%s", + result.ExitCode, result.Stdout, result.Stderr) + } + if !strings.Contains(result.Stdout, "1 passed") { + t.Errorf("expected the summary line to report 1 passed, got:\n%s", result.Stdout) + } + + // --- Assert the server actually received a SessionInput POST --- + mu.Lock() + body, reqPath, cType := gotBody, gotPath, gotCType + mu.Unlock() + + if reqPath != "/v1/run" { + t.Errorf("expected request path /v1/run, got %q", reqPath) + } + if !strings.Contains(cType, "application/json") { + t.Errorf("expected JSON content-type, got %q", cType) + } + var sess struct { + Messages []struct { + Role string `json:"role"` + Content string `json:"content"` + } `json:"messages"` + } + if err := json.Unmarshal(body, &sess); err != nil { + t.Fatalf("failed to parse posted SessionInput body: %v\nbody:\n%s", err, body) + } + if len(sess.Messages) == 0 || !strings.Contains(sess.Messages[len(sess.Messages)-1].Content, "ping the custom engine") { + t.Errorf("expected posted SessionInput to carry the case prompt, got messages: %#v", sess.Messages) + } + + // --- Assert generated report.json content --- + reportPath := filepath.Join(outputDir, "iteration-1", "report.json") + reportData, err := os.ReadFile(reportPath) + if err != nil { + t.Fatalf("failed to read report.json at %s: %v", reportPath, err) + } + var rpt report.Input + if err := json.Unmarshal(reportData, &rpt); err != nil { + t.Fatalf("failed to parse report.json: %v", err) + } + if len(rpt.CaseResults) != 1 { + t.Fatalf("expected exactly 1 case result, got %d", len(rpt.CaseResults)) + } + cr := rpt.CaseResults[0] + if cr.Response != "custom-engine-handled" { + t.Errorf("expected response (final_message) = %q, got %q", "custom-engine-handled", cr.Response) + } + if cr.Grading == nil { + t.Fatalf("expected grading to be non-nil") + } + if cr.Grading.Status != judge.StatusPass { + t.Errorf("expected grading status = %q, got %q", judge.StatusPass, cr.Grading.Status) + } +} + // TestPipeline_CustomEngine_Validate runs `skill-up validate` against the // custom engine eval.yaml. This exercises the post-load // config.ResolveCustomEngineConfig path that validate.go invokes, including diff --git a/e2e/testdata/custom-engine/SKILL.md b/e2e/testdata/custom-engine/SKILL.md index 98672ac1..bb857300 100644 --- a/e2e/testdata/custom-engine/SKILL.md +++ b/e2e/testdata/custom-engine/SKILL.md @@ -1,9 +1,18 @@ # Custom Engine e2e fixture This directory is a minimal skill-up fixture that exercises the **Custom -Engine** local transport end-to-end. It is consumed by `e2e/custom_engine_test.go`. +Engine** local and http transports end-to-end. It is consumed by +`e2e/custom_engine_test.go`. -`agent.sh` is a deterministic stand-in for a real custom agent: it reads the -`SessionInput` JSON the framework writes to `${input_file}` and emits a fixed -`SessionResult` on stdout. The case asserts that `final_message` flows from -the agent's stdout into the report. +`agent.sh` + `evals/eval.yaml` cover the **local** transport: `agent.sh` is a +deterministic stand-in for a real custom agent that reads the `SessionInput` +JSON the framework writes to `${input_file}` and emits a fixed `SessionResult` +on stdout. The case asserts that `final_message` flows from the agent's stdout +into the report. + +`evals/eval-http.yaml` exercises the **http** transport against the same +`hello.yaml` case. `TestPipeline_CustomEngine_HTTPTransport` starts an +in-process `httptest` server, points `${CUSTOM_AGENT_ENDPOINT}` at it, and +asserts the posted `SessionInput` reaches the server and the returned +`SessionResult` flows into the report — so the http stand-in agent lives in the +test rather than in `agent.sh`. diff --git a/e2e/testdata/custom-engine/evals/eval-http.yaml b/e2e/testdata/custom-engine/evals/eval-http.yaml new file mode 100644 index 00000000..bd1b43db --- /dev/null +++ b/e2e/testdata/custom-engine/evals/eval-http.yaml @@ -0,0 +1,29 @@ +schema_version: v1alpha1 + +environment: + type: none + +mcp: + servers: [] + +engine: + name: e2e-custom-http-agent + custom: + transport: http + response_format: session_result + http: + url: ${CUSTOM_AGENT_ENDPOINT}/v1/run + method: POST + headers: + Content-Type: application/json + +cases: + files: + - evals/cases/hello.yaml + defaults: + timeout_seconds: 60 + max_turns: 3 + parallelism: 1 + +report: + formats: [json] diff --git a/internal/agent/custom.go b/internal/agent/custom.go index 4481feb2..389fd815 100644 --- a/internal/agent/custom.go +++ b/internal/agent/custom.go @@ -982,7 +982,7 @@ func (a *CustomAgent) errorResult(exitCode int) *SessionResult { // SessionInput is the documented contract handed to a Custom Engine for each // case (see docs/design/custom-engine.md). Both the local transport (written -// to ${input_file}) and the planned http transport (sent as the request body +// to ${input_file}) and the http transport (sent as the request body // or multipart payload field) use the same shape, so it is exported here as // a public type rather than buried in a transport implementation. type SessionInput struct {