Remote eval server by delner · Pull Request #48 · braintrustdata/braintrust-sdk-go

David Elner (delner) · 2026-03-25T04:18:46Z

Remote Eval Server

Adds a new server package that lets users run evaluations from the Braintrust playground against code on their own infrastructure. The server exposes locally-registered evals over HTTP, allowing the Braintrust UI to trigger evals, stream progress in real time, and display results — without the evaluation code ever leaving the user's environment.

This is useful when evaluations need access to internal APIs, proprietary data, custom runtimes, or complex multi-step workflows that can't run in Braintrust's cloud.

Setup

1. Define your eval and start the server:

package main

import (
    "context"
    "log"
    "strings"

    "github.com/braintrustdata/braintrust-sdk-go/eval"
    "github.com/braintrustdata/braintrust-sdk-go/server"
)

func main() {
    // Define an eval: the task to run and the scorers to apply.
    classify := &eval.Eval[string, string]{
        Name:        "classify",
        ProjectName: "my-project",
        Task: eval.T(func(ctx context.Context, input string) (string, error) {
            if strings.Contains(strings.ToLower(input), "apple") {
                return "fruit", nil
            }
            return "unknown", nil
        }),
        Scorers: []eval.Scorer[string, string]{
            eval.NewScorer("exact_match", func(ctx context.Context, r eval.TaskResult[string, string]) (eval.Scores, error) {
                if r.Output == r.Expected {
                    return eval.S(1.0), nil
                }
                return eval.S(0.0), nil
            }),
        },
    }

    // Start a server and register the eval.
    srv := server.New(
        server.WithAddress("localhost:8300"),
        server.WithNoAuth(), // remove for production
    )
    server.RegisterEval(srv, classify, server.RegisterEvalOpts{})

    log.Fatal(srv.Start())
}

The same eval definition can also be run locally with braintrust.NewEval:

client, _ := braintrust.New(tp)
e := braintrust.NewEval(client, classify)
result, err := e.Run(ctx, eval.RunOpts[string, string]{Dataset: dataset})

2. Configure in Braintrust: Go to your project's Settings > Remote evals and add http://localhost:8300.

3. Use from the playground: Select "Remote eval" as the task type, pick your evaluator, choose a dataset, and run.

See the full working example at examples/internal/eval-server/main.go.

Design

The server implements the same HTTP protocol used by the Ruby and Java SDKs, ensuring compatibility with the Braintrust backend.

┌──────────────────┐         POST /eval           ┌───────────────────┐
│                  │  ──────────────────────────> │                   │
│  Braintrust UI   │                              │  Go Eval Server   │
│  (playground)    │  <─── SSE: progress ──────── │                   │
│                  │  <─── SSE: progress ──────── │  ┌─────────────┐  │
│                  │  <─── SSE: summary ───────── │  │ Eval        │  │
│                  │  <─── SSE: done ──────────── │  │ (task +     │  │
└──────────────────┘                              │  │  scorers)   │  │
                                                  │  └─────────────┘  │
       Clerk token validated via                  │         │         │
       POST /api/apikey/login ────────────────>   │         ▼         │
                                                  │  OTel spans ──>   │
                                                  │  Braintrust API   │
                                                  └───────────────────┘

Endpoints:

GET / — Health check
GET|POST /list — Returns registered evals with their scorer names and parameter schemas
POST /eval — Runs an evaluation and streams results as Server-Sent Events (progress per case, summary with aggregated scores, done)
OPTIONS * — CORS preflight for braintrust.dev origins

Components:

Eval definitions — eval.Eval[I, R] captures the task, scorers, and project name. braintrust.NewEval(client, &eval.Eval{...}) attaches infrastructure (session, tracer, API client) so it can be run locally via e.Run(). The same definition can also be registered with a server via server.RegisterEval().
Registration — server.RegisterEval[I, R]() is a generic package-level function that wraps typed eval definitions behind a non-generic interface using JSON-based type erasure at the HTTP boundary. This lets the server store evals with different type signatures in a single map.
Authentication — Each request's Clerk token is validated against Braintrust's /api/apikey/login endpoint. Validated sessions are cached in an LRU (max 64 entries) to avoid re-authenticating on every request. Failed sessions are evicted immediately. A WithNoAuth() option is available for local development.
Per-request isolation — Each eval request creates its own Evaluator (with per-request TracerProvider and auth.Session from the caller's token), ensuring traces are attributed to the correct user and organization. The default evaluator on the Eval is not used for remote requests.
SSE streaming — A thread-safe SSE writer streams progress events as each case completes. If the client disconnects mid-eval, the write failure cancels the eval context, stopping remaining work.
CORS — Allows requests from braintrust.dev and braintrustdata.dev origins (HTTP and HTTPS for local dev) with Private Network Access support. Sets Vary: Origin to prevent cache poisoning.

Key design decisions:

net/http stdlib only — No new HTTP framework dependencies. Go 1.22+ http.ServeMux provides method-based routing ("POST /eval"), which is sufficient.
server.RegisterEval[I,R]() as a package-level function — Go does not allow generic methods on non-generic types. A package-level function is the idiomatic workaround and keeps the API clean.
Handler() method for embedding — Users can mount the eval server's http.Handler inside their own router alongside other endpoints, rather than requiring a standalone process.
OnCaseComplete callback on eval.Opts — A minimal, backward-compatible addition (nil defaults to no-op) that enables SSE streaming without reimplementing the eval engine. The callback is general-purpose and useful beyond the server (e.g., CLI progress bars).
eval.Eval holds an optional default Evaluator — Set by braintrust.NewEval for local runs. The remote eval server ignores it and creates a per-request Evaluator with the caller's session, so traces are attributed correctly.

Abhijeet Prasad (AbhiPrasad)

ergonomics overall seems good! Review just catches some stuff I found after chatting with the llm.

server/auth.go

Abhijeet Prasad (AbhiPrasad) · 2026-03-26T17:00:01Z

server/auth.go

+	if len(c.entries) >= c.maxSize && len(c.order) > 0 {
+		oldest := c.order[0]
+		if evicted, ok := c.entries[oldest]; ok {
+			evicted.session.Close()


should we wait for inflight requests before evicting the session?

server/server.go

server/types.go

server/auth.go

server/cors.go

server/register.go

server/server.go

Matt Perpick (clutchski) · 2026-03-28T01:29:11Z

server/server.go

+	}
+
+	httpClient := https.NewClient(cfg.APIKey, appURL, s.logger)
+	session, err := auth.NewSession(context.Background(), auth.Options{


why would we create another global session ... don't we already have one or can't we get this frmo braintrust.Client()

Sessions with remote evals are tricky: we're not supposed to re-use the application's auth/session. Each eval request is supposed to be scoped to the user that triggers it, so we have to use their sessions to access their resources. The user may be working with a resource that is not accessible to the eval server which would otherwise error.

Similarly, there can be multiple concurrent user sessions, so we have to be careful not to use the wrong one. Maybe creating another "global" session is not correct here, but I think we may have to create one or re-use an existing matching one (that's what the LRU is for.)

I feel like whatever the approach is, we can have less code duplication and just create a client here

Exploring this a bit more, to try to address your feedback, IIUC, we could allow the user to pass in a client e.g. WithClient(*braintrust.Client).

Claude makes the point that "the duplication is small (15 lines), and tying the server to braintrust.Client creates a coupling that doesn't exist today. The server is intentionally standalone." Decoupling is a principle I was trying to lead with, but if its worth the deduplication, I'm cool with that.

What's your call?

I would vote for not coupling this for now. I guess we can extract the common code into a helper? But I can't see it being a blocker for users that this isn't part of the public API.

Matt Perpick (clutchski) · 2026-03-28T01:30:38Z

server/server.go

+}
+
+// Start starts the HTTP server and blocks until it is shut down.
+func (s *Server) Start() error {


why have start vs just running the server on creation? these needs not needed

Maybe that makes more sense. However, the intent here was you can create a server, register evaluators, then start it. I think if you make it auto-start, you can't modify it before you start. In which case I think you have to pass all your evaluators on new(): would you prefer that?

I feel like this pattern matches idiomatic go, like http.Server also makes you register handlers, construct the server, and then call ListenAndServe.

Matt Perpick (clutchski)

Made a bunch of comments. I'm very not sure about a per request tracer provider because I don't think user spans will make it.

David Elner (delner) · 2026-03-29T20:34:27Z

Most feedback is addressed; just a couple of outstanding questions for Matt Perpick (@clutchski) and some failing tests to fix.

Matt Perpick (clutchski) · 2026-03-30T14:43:57Z

eval/eval.go

+
+	// Generation is injected into braintrust.span_attributes on every span
+	// when set. Used by the remote eval server to link spans in a trace hierarchy.
+	Generation any


I need to understand this.

Generation is basically a "version"

eval/eval.go

eval/eval_test.go

server/server.go

Abhijeet Prasad (AbhiPrasad) · 2026-03-30T14:30:23Z

server/server.go

+}
+
+// Start starts the HTTP server and blocks until it is shut down.
+func (s *Server) Start() error {


I feel like this pattern matches idiomatic go, like http.Server also makes you register handlers, construct the server, and then call ListenAndServe.

Abhijeet Prasad (AbhiPrasad) · 2026-03-30T14:45:31Z

server/server.go

+	}
+
+	httpClient := https.NewClient(cfg.APIKey, appURL, s.logger)
+	session, err := auth.NewSession(context.Background(), auth.Options{


I would vote for not coupling this for now. I guess we can extract the common code into a helper? But I can't see it being a blocker for users that this isn't part of the public API.

server/auth.go

server/register.go

stale

eval/eval.go

Matt Perpick (clutchski) · 2026-04-02T13:44:03Z

eval/eval.go

 		scorers,
 		parallelism,
-		true, // quiet=true for tests
+		true,             // quiet=true for tests


i might be inclined to do this in code

quiet := true, callback:=noop() and then the code self documents

eval/eval_definition.go

Matt Perpick (clutchski) · 2026-04-02T13:51:15Z

eval/evaluator.go

 }
+
+// RunEval executes an evaluation from a reusable [Eval] definition.
+func (e *Evaluator[I, R]) RunEval(ctx context.Context, ev *Eval[I, R], opts RunOpts[I, R]) (*Result, error) {


Run() i think is fine.

There's already a Run() so I think we'd have to make a potentially breaking change to the existing Run() to support. What's your call?

eval/eval_definition.go

David Elner (delner) requested a review from Matt Perpick (clutchski) March 25, 2026 04:18

David Elner (delner) self-assigned this Mar 25, 2026

David Elner (delner) added the enhancement New feature or request label Mar 25, 2026

Abhijeet Prasad (AbhiPrasad) reviewed Mar 26, 2026

View reviewed changes