Skip to content

Remote eval server#48

Open
David Elner (delner) wants to merge 17 commits intomainfrom
feature/remote_eval_server
Open

Remote eval server#48
David Elner (delner) wants to merge 17 commits intomainfrom
feature/remote_eval_server

Conversation

@delner
Copy link
Copy Markdown
Contributor

@delner David Elner (delner) commented Mar 25, 2026

Remote Eval Server

Adds a new server package that lets users run evaluations from the Braintrust playground against code on their own infrastructure. The server exposes locally-registered evals over HTTP, allowing the Braintrust UI to trigger evals, stream progress in real time, and display results — without the evaluation code ever leaving the user's environment.

This is useful when evaluations need access to internal APIs, proprietary data, custom runtimes, or complex multi-step workflows that can't run in Braintrust's cloud.

Setup

1. Define your eval and start the server:

package main

import (
    "context"
    "log"
    "strings"

    "github.com/braintrustdata/braintrust-sdk-go/eval"
    "github.com/braintrustdata/braintrust-sdk-go/server"
)

func main() {
    // Define an eval: the task to run and the scorers to apply.
    classify := &eval.Eval[string, string]{
        Name:        "classify",
        ProjectName: "my-project",
        Task: eval.T(func(ctx context.Context, input string) (string, error) {
            if strings.Contains(strings.ToLower(input), "apple") {
                return "fruit", nil
            }
            return "unknown", nil
        }),
        Scorers: []eval.Scorer[string, string]{
            eval.NewScorer("exact_match", func(ctx context.Context, r eval.TaskResult[string, string]) (eval.Scores, error) {
                if r.Output == r.Expected {
                    return eval.S(1.0), nil
                }
                return eval.S(0.0), nil
            }),
        },
    }

    // Start a server and register the eval.
    srv := server.New(
        server.WithAddress("localhost:8300"),
        server.WithNoAuth(), // remove for production
    )
    server.RegisterEval(srv, classify, server.RegisterEvalOpts{})

    log.Fatal(srv.Start())
}

The same eval definition can also be run locally with braintrust.NewEval:

client, _ := braintrust.New(tp)
e := braintrust.NewEval(client, classify)
result, err := e.Run(ctx, eval.RunOpts[string, string]{Dataset: dataset})

2. Configure in Braintrust: Go to your project's Settings > Remote evals and add http://localhost:8300.

3. Use from the playground: Select "Remote eval" as the task type, pick your evaluator, choose a dataset, and run.

See the full working example at examples/internal/eval-server/main.go.


Design

The server implements the same HTTP protocol used by the Ruby and Java SDKs, ensuring compatibility with the Braintrust backend.

┌──────────────────┐         POST /eval           ┌───────────────────┐
│                  │  ──────────────────────────> │                   │
│  Braintrust UI   │                              │  Go Eval Server   │
│  (playground)    │  <─── SSE: progress ──────── │                   │
│                  │  <─── SSE: progress ──────── │  ┌─────────────┐  │
│                  │  <─── SSE: summary ───────── │  │ Eval        │  │
│                  │  <─── SSE: done ──────────── │  │ (task +     │  │
└──────────────────┘                              │  │  scorers)   │  │
                                                  │  └─────────────┘  │
       Clerk token validated via                  │         │         │
       POST /api/apikey/login ────────────────>   │         ▼         │
                                                  │  OTel spans ──>   │
                                                  │  Braintrust API   │
                                                  └───────────────────┘

Endpoints:

  • GET / — Health check
  • GET|POST /list — Returns registered evals with their scorer names and parameter schemas
  • POST /eval — Runs an evaluation and streams results as Server-Sent Events (progress per case, summary with aggregated scores, done)
  • OPTIONS * — CORS preflight for braintrust.dev origins

Components:

  • Eval definitionseval.Eval[I, R] captures the task, scorers, and project name. braintrust.NewEval(client, &eval.Eval{...}) attaches infrastructure (session, tracer, API client) so it can be run locally via e.Run(). The same definition can also be registered with a server via server.RegisterEval().
  • Registrationserver.RegisterEval[I, R]() is a generic package-level function that wraps typed eval definitions behind a non-generic interface using JSON-based type erasure at the HTTP boundary. This lets the server store evals with different type signatures in a single map.
  • Authentication — Each request's Clerk token is validated against Braintrust's /api/apikey/login endpoint. Validated sessions are cached in an LRU (max 64 entries) to avoid re-authenticating on every request. Failed sessions are evicted immediately. A WithNoAuth() option is available for local development.
  • Per-request isolation — Each eval request creates its own Evaluator (with per-request TracerProvider and auth.Session from the caller's token), ensuring traces are attributed to the correct user and organization. The default evaluator on the Eval is not used for remote requests.
  • SSE streaming — A thread-safe SSE writer streams progress events as each case completes. If the client disconnects mid-eval, the write failure cancels the eval context, stopping remaining work.
  • CORS — Allows requests from braintrust.dev and braintrustdata.dev origins (HTTP and HTTPS for local dev) with Private Network Access support. Sets Vary: Origin to prevent cache poisoning.

Key design decisions:

  • net/http stdlib only — No new HTTP framework dependencies. Go 1.22+ http.ServeMux provides method-based routing ("POST /eval"), which is sufficient.
  • server.RegisterEval[I,R]() as a package-level function — Go does not allow generic methods on non-generic types. A package-level function is the idiomatic workaround and keeps the API clean.
  • Handler() method for embedding — Users can mount the eval server's http.Handler inside their own router alongside other endpoints, rather than requiring a standalone process.
  • OnCaseComplete callback on eval.Opts — A minimal, backward-compatible addition (nil defaults to no-op) that enables SSE streaming without reimplementing the eval engine. The callback is general-purpose and useful beyond the server (e.g., CLI progress bars).
  • eval.Eval holds an optional default Evaluator — Set by braintrust.NewEval for local runs. The remote eval server ignores it and creates a per-request Evaluator with the caller's session, so traces are attributed correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ergonomics overall seems good! Review just catches some stuff I found after chatting with the llm.

if len(c.entries) >= c.maxSize && len(c.order) > 0 {
oldest := c.order[0]
if evicted, ok := c.entries[oldest]; ok {
evicted.session.Close()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we wait for inflight requests before evicting the session?

server/server.go Outdated
}

httpClient := https.NewClient(cfg.APIKey, appURL, s.logger)
session, err := auth.NewSession(context.Background(), auth.Options{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we create another global session ... don't we already have one or can't we get this frmo braintrust.Client()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sessions with remote evals are tricky: we're not supposed to re-use the application's auth/session. Each eval request is supposed to be scoped to the user that triggers it, so we have to use their sessions to access their resources. The user may be working with a resource that is not accessible to the eval server which would otherwise error.

Similarly, there can be multiple concurrent user sessions, so we have to be careful not to use the wrong one. Maybe creating another "global" session is not correct here, but I think we may have to create one or re-use an existing matching one (that's what the LRU is for.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like whatever the approach is, we can have less code duplication and just create a client here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploring this a bit more, to try to address your feedback, IIUC, we could allow the user to pass in a client e.g. WithClient(*braintrust.Client).

Claude makes the point that "the duplication is small (15 lines), and tying the server to braintrust.Client creates a coupling that doesn't exist today. The server is intentionally standalone." Decoupling is a principle I was trying to lead with, but if its worth the deduplication, I'm cool with that.

What's your call?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for not coupling this for now. I guess we can extract the common code into a helper? But I can't see it being a blocker for users that this isn't part of the public API.

}

// Start starts the HTTP server and blocks until it is shut down.
func (s *Server) Start() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have start vs just running the server on creation? these needs not needed

Copy link
Copy Markdown
Contributor Author

@delner David Elner (delner) Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that makes more sense. However, the intent here was you can create a server, register evaluators, then start it. I think if you make it auto-start, you can't modify it before you start. In which case I think you have to pass all your evaluators on new(): would you prefer that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this pattern matches idiomatic go, like http.Server also makes you register handlers, construct the server, and then call ListenAndServe.

Copy link
Copy Markdown
Contributor

@clutchski Matt Perpick (clutchski) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a bunch of comments. I'm very not sure about a per request tracer provider because I don't think user spans will make it.

@delner David Elner (delner) force-pushed the feature/remote_eval_server branch from bddc044 to d47c8c8 Compare March 29, 2026 19:21
@delner
Copy link
Copy Markdown
Contributor Author

Most feedback is addressed; just a couple of outstanding questions for Matt Perpick (@clutchski) and some failing tests to fix.

@delner David Elner (delner) marked this pull request as ready for review March 29, 2026 20:43

// Generation is injected into braintrust.span_attributes on every span
// when set. Used by the remote eval server to link spans in a trace hierarchy.
Generation any
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to understand this.

Copy link
Copy Markdown
Contributor Author

@delner David Elner (delner) Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generation is basically a "version"

}

// Start starts the HTTP server and blocks until it is shut down.
func (s *Server) Start() error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this pattern matches idiomatic go, like http.Server also makes you register handlers, construct the server, and then call ListenAndServe.

server/server.go Outdated
}

httpClient := https.NewClient(cfg.APIKey, appURL, s.logger)
session, err := auth.NewSession(context.Background(), auth.Options{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for not coupling this for now. I guess we can extract the common code into a helper? But I can't see it being a blocker for users that this isn't part of the public API.

@delner David Elner (delner) force-pushed the feature/remote_eval_server branch from da426cb to ea0960f Compare March 31, 2026 16:18
scorers,
parallelism,
true, // quiet=true for tests
true, // quiet=true for tests
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might be inclined to do this in code

quiet := true, callback:=noop() and then the code self documents

}

// RunEval executes an evaluation from a reusable [Eval] definition.
func (e *Evaluator[I, R]) RunEval(ctx context.Context, ev *Eval[I, R], opts RunOpts[I, R]) (*Result, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run() i think is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a Run() so I think we'd have to make a potentially breaking change to the existing Run() to support. What's your call?

@delner David Elner (delner) force-pushed the feature/remote_eval_server branch from 5ff574d to 320dcd4 Compare April 2, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants