Conversation
Abhijeet Prasad (AbhiPrasad)
left a comment
There was a problem hiding this comment.
ergonomics overall seems good! Review just catches some stuff I found after chatting with the llm.
| if len(c.entries) >= c.maxSize && len(c.order) > 0 { | ||
| oldest := c.order[0] | ||
| if evicted, ok := c.entries[oldest]; ok { | ||
| evicted.session.Close() |
There was a problem hiding this comment.
should we wait for inflight requests before evicting the session?
server/server.go
Outdated
| } | ||
|
|
||
| httpClient := https.NewClient(cfg.APIKey, appURL, s.logger) | ||
| session, err := auth.NewSession(context.Background(), auth.Options{ |
There was a problem hiding this comment.
why would we create another global session ... don't we already have one or can't we get this frmo braintrust.Client()
There was a problem hiding this comment.
Sessions with remote evals are tricky: we're not supposed to re-use the application's auth/session. Each eval request is supposed to be scoped to the user that triggers it, so we have to use their sessions to access their resources. The user may be working with a resource that is not accessible to the eval server which would otherwise error.
Similarly, there can be multiple concurrent user sessions, so we have to be careful not to use the wrong one. Maybe creating another "global" session is not correct here, but I think we may have to create one or re-use an existing matching one (that's what the LRU is for.)
There was a problem hiding this comment.
I feel like whatever the approach is, we can have less code duplication and just create a client here
There was a problem hiding this comment.
Exploring this a bit more, to try to address your feedback, IIUC, we could allow the user to pass in a client e.g. WithClient(*braintrust.Client).
Claude makes the point that "the duplication is small (15 lines), and tying the server to braintrust.Client creates a coupling that doesn't exist today. The server is intentionally standalone." Decoupling is a principle I was trying to lead with, but if its worth the deduplication, I'm cool with that.
What's your call?
There was a problem hiding this comment.
I would vote for not coupling this for now. I guess we can extract the common code into a helper? But I can't see it being a blocker for users that this isn't part of the public API.
| } | ||
|
|
||
| // Start starts the HTTP server and blocks until it is shut down. | ||
| func (s *Server) Start() error { |
There was a problem hiding this comment.
why have start vs just running the server on creation? these needs not needed
There was a problem hiding this comment.
Maybe that makes more sense. However, the intent here was you can create a server, register evaluators, then start it. I think if you make it auto-start, you can't modify it before you start. In which case I think you have to pass all your evaluators on new(): would you prefer that?
There was a problem hiding this comment.
I feel like this pattern matches idiomatic go, like http.Server also makes you register handlers, construct the server, and then call ListenAndServe.
Matt Perpick (clutchski)
left a comment
There was a problem hiding this comment.
Made a bunch of comments. I'm very not sure about a per request tracer provider because I don't think user spans will make it.
bddc044 to
d47c8c8
Compare
|
Most feedback is addressed; just a couple of outstanding questions for Matt Perpick (@clutchski) and some failing tests to fix. |
|
|
||
| // Generation is injected into braintrust.span_attributes on every span | ||
| // when set. Used by the remote eval server to link spans in a trace hierarchy. | ||
| Generation any |
There was a problem hiding this comment.
I need to understand this.
There was a problem hiding this comment.
Generation is basically a "version"
| } | ||
|
|
||
| // Start starts the HTTP server and blocks until it is shut down. | ||
| func (s *Server) Start() error { |
There was a problem hiding this comment.
I feel like this pattern matches idiomatic go, like http.Server also makes you register handlers, construct the server, and then call ListenAndServe.
server/server.go
Outdated
| } | ||
|
|
||
| httpClient := https.NewClient(cfg.APIKey, appURL, s.logger) | ||
| session, err := auth.NewSession(context.Background(), auth.Options{ |
There was a problem hiding this comment.
I would vote for not coupling this for now. I guess we can extract the common code into a helper? But I can't see it being a blocker for users that this isn't part of the public API.
da426cb to
ea0960f
Compare
| scorers, | ||
| parallelism, | ||
| true, // quiet=true for tests | ||
| true, // quiet=true for tests |
There was a problem hiding this comment.
i might be inclined to do this in code
quiet := true, callback:=noop() and then the code self documents
eval/evaluator.go
Outdated
| } | ||
|
|
||
| // RunEval executes an evaluation from a reusable [Eval] definition. | ||
| func (e *Evaluator[I, R]) RunEval(ctx context.Context, ev *Eval[I, R], opts RunOpts[I, R]) (*Result, error) { |
There was a problem hiding this comment.
Run() i think is fine.
There was a problem hiding this comment.
There's already a Run() so I think we'd have to make a potentially breaking change to the existing Run() to support. What's your call?
5ff574d to
320dcd4
Compare
Remote Eval Server
Adds a new
serverpackage that lets users run evaluations from the Braintrust playground against code on their own infrastructure. The server exposes locally-registered evals over HTTP, allowing the Braintrust UI to trigger evals, stream progress in real time, and display results — without the evaluation code ever leaving the user's environment.This is useful when evaluations need access to internal APIs, proprietary data, custom runtimes, or complex multi-step workflows that can't run in Braintrust's cloud.
Setup
1. Define your eval and start the server:
The same eval definition can also be run locally with
braintrust.NewEval:2. Configure in Braintrust: Go to your project's Settings > Remote evals and add
http://localhost:8300.3. Use from the playground: Select "Remote eval" as the task type, pick your evaluator, choose a dataset, and run.
See the full working example at
examples/internal/eval-server/main.go.Design
The server implements the same HTTP protocol used by the Ruby and Java SDKs, ensuring compatibility with the Braintrust backend.
Endpoints:
GET /— Health checkGET|POST /list— Returns registered evals with their scorer names and parameter schemasPOST /eval— Runs an evaluation and streams results as Server-Sent Events (progress per case, summary with aggregated scores, done)OPTIONS *— CORS preflight forbraintrust.devoriginsComponents:
eval.Eval[I, R]captures the task, scorers, and project name.braintrust.NewEval(client, &eval.Eval{...})attaches infrastructure (session, tracer, API client) so it can be run locally viae.Run(). The same definition can also be registered with a server viaserver.RegisterEval().server.RegisterEval[I, R]()is a generic package-level function that wraps typed eval definitions behind a non-generic interface using JSON-based type erasure at the HTTP boundary. This lets the server store evals with different type signatures in a single map./api/apikey/loginendpoint. Validated sessions are cached in an LRU (max 64 entries) to avoid re-authenticating on every request. Failed sessions are evicted immediately. AWithNoAuth()option is available for local development.Evaluator(with per-requestTracerProviderandauth.Sessionfrom the caller's token), ensuring traces are attributed to the correct user and organization. The default evaluator on theEvalis not used for remote requests.braintrust.devandbraintrustdata.devorigins (HTTP and HTTPS for local dev) with Private Network Access support. SetsVary: Originto prevent cache poisoning.Key design decisions:
net/httpstdlib only — No new HTTP framework dependencies. Go 1.22+http.ServeMuxprovides method-based routing ("POST /eval"), which is sufficient.server.RegisterEval[I,R]()as a package-level function — Go does not allow generic methods on non-generic types. A package-level function is the idiomatic workaround and keeps the API clean.Handler()method for embedding — Users can mount the eval server'shttp.Handlerinside their own router alongside other endpoints, rather than requiring a standalone process.OnCaseCompletecallback oneval.Opts— A minimal, backward-compatible addition (nil defaults to no-op) that enables SSE streaming without reimplementing the eval engine. The callback is general-purpose and useful beyond the server (e.g., CLI progress bars).eval.Evalholds an optional defaultEvaluator— Set bybraintrust.NewEvalfor local runs. The remote eval server ignores it and creates a per-requestEvaluatorwith the caller's session, so traces are attributed correctly.