Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions examples/Realtime/realtime-rag/utils/docs-sample/eval-ui.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ The following steps require access to a Braintrust organization, which represent
Navigate to the [AI providers](/app/settings?subroute=secrets) page in your settings and configure at least one API key. For this quickstart, be sure to add your OpenAI API key. After completing this initial setup, you can access models from many providers through a single, unified API.

<Callout>
For more advanced use cases where you want to use custom models or avoid plugging your API key into Braintrust, you may want to check out the [SDK](/docs/start/eval-sdk) quickstart.
For more advanced use cases where you want to use custom models or avoid
plugging your API key into Braintrust, you may want to check out the
[SDK](/docs/start/eval-sdk) quickstart.
</Callout>

</Step>
Expand All @@ -27,12 +29,13 @@ For more advanced use cases where you want to use custom models or avoid pluggin
### Create a new project

For every AI feature your organization is building, the first thing you'll do is create a project.

</Step>

<Step>
### Create a new prompt

Navigate to **Library** in the top menu bar, then select **Prompts**. Create a new prompt in your project called "movie matcher". A prompt is the input you provide to the model to generate a response. Choose `GPT 4o` for your model, and type this for your system prompt:
Navigate to **Prompts**. Create a new prompt in your project called "movie matcher". A prompt is the input you provide to the model to generate a response. Choose `GPT 4o` for your model, and type this for your system prompt:

```
Based on the following description, identify the movie title. In your response, simply provide the name of the movie.
Expand All @@ -49,6 +52,7 @@ Prompts can use [mustache](https://mustache.github.io/mustache.5.html) templatin
![First prompt](./movie-matcher-prompt.png)

Select **Save as custom prompt** to save your prompt.

</Step>

<Step>
Expand All @@ -57,6 +61,7 @@ Select **Save as custom prompt** to save your prompt.
Scroll to the bottom of the prompt viewer, and select **Create playground with prompt**. This will open the prompt you just created in the [prompt playground](https://www.braintrust.dev/docs/guides/playground), a tool for exploring, comparing, and evaluating prompts. In the prompt playground, you can evaluate prompts with data from your [datasets](https://www.braintrust.dev/docs/guides/datasets).

![Prompt playground](./prompt-playground.png)

</Step>

<Step>
Expand Down Expand Up @@ -89,6 +94,7 @@ In this example, the Data is the dataset you uploaded, the Task is the prompt yo
![Create experiment](./create-experiment.png)

Creating an experiment from the playground will automatically log your results to Braintrust.

</Step>

<Step>
Expand Down
120 changes: 63 additions & 57 deletions examples/Realtime/realtime.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

The OpenAI [Realtime API](https://platform.openai.com/docs/guides/realtime), designed for building advanced multimodal conversational experiences, unlocks even more use cases in AI applications. However, evaluating this and other audio models' outputs in practice is an unsolved problem. In this cookbook, we'll build a robust application with the Realtime API, incorporating tool-calling and user input. Then, we'll evaluate the results. Let's get started!

## Getting started
In this cookbook, we're going to build a speech-to-speech RAG agent that answers questions about the Braintrust documentation.
## Getting started

In this cookbook, we're going to build a speech-to-speech RAG agent that answers questions about the Braintrust documentation.

To get started, you'll need a few accounts:

Expand Down Expand Up @@ -37,7 +38,7 @@ of your account, and set the `PINECONE_API_KEY` environment variable in the [Env

<Callout type="info">
We'll use the local environment variables to embed and upload the vectors, and
the Braintrust variables to run the RAG tool and LLM calls remotely.
the Braintrust variables to run the RAG tool and LLM calls remotely.
</Callout>

## Upload the vectors
Expand All @@ -50,7 +51,7 @@ npx tsx upload-vectors.ts

This script reads all the files from the `docs-sample` directory, breaks them into sections based on headings, and creates vector embeddings for each section using OpenAI's API. It then stores those embeddings along with the section's title and content in Pinecone.

That's it for setup! Now let's dig into the code.
That's it for setup! Now let's dig into the code.

## Accessing the Realtime API

Expand Down Expand Up @@ -114,7 +115,8 @@ export default async function Home() {
```

<Callout>
You can also use our proxy with an AI provider’s API key, but you will not have access to other Braintrust features, like logging.
You can also use our proxy with an AI provider’s API key, but you will not
have access to other Braintrust features, like logging.
</Callout>

## Creating a RAG tool
Expand All @@ -123,37 +125,44 @@ The retrieval logic also happens on the server side. We set up the helper functi

```typescript
client.addTool(
{
name: 'pinecone_retrieval',
description: 'Retrieves relevant information from Braintrust documentation.',
parameters: {
type: 'object',
properties: {
query: {
type: 'string',
description: 'The search query to find relevant documentation.'
}
},
required: ['query']
{
name: "pinecone_retrieval",
description:
"Retrieves relevant information from Braintrust documentation.",
parameters: {
type: "object",
properties: {
query: {
type: "string",
description: "The search query to find relevant documentation.",
},
},
async ({ query }: { query: string }) => {
try {
setLastQuery(query);
const results = await fetchFromPinecone(query);
setRetrievalResults(results);
return results
.map(result => `[Score: ${result.score.toFixed(2)}] ${result.metadata.title}\n${result.metadata.content}`)
.join('\n\n');
} catch (error) {
throw error;
}
}
);
required: ["query"],
},
},
async ({ query }: { query: string }) => {
try {
setLastQuery(query);
const results = await fetchFromPinecone(query);
setRetrievalResults(results);
return results
.map(
(result) =>
`[Score: ${result.score.toFixed(2)}] ${result.metadata.title}\n${
result.metadata.content
}`,
)
.join("\n\n");
} catch (error) {
throw error;
}
},
);
```

<Callout type="info">
Currently, because of the way the Realtime API works, we have to use OpenAI tool calling here instead of Braintrust tool functions.
Currently, because of the way the Realtime API works, we have to use OpenAI
tool calling here instead of Braintrust tool functions.
</Callout>

## Setting up the system prompt
Expand Down Expand Up @@ -183,13 +192,13 @@ Personality:
`;
```

Feel free to play around with the system prompt at any point, and see how it impacts the LLM's responses in the app.
Feel free to play around with the system prompt at any point, and see how it impacts the LLM's responses in the app.

## Running the app

To run the app, navigate to `/web` and run `npm run dev`. You should have the app load on `localhost:3000`.

Start a new conversation, and ask a few questions about Braintrust. Feel free to interrupt the bot, or ask unrelated questions, and see what happens. When you're finished, end the conversation. Have a couple of conversations to get a feel for some of the limitations and nuances of the bot - each conversation will come in handy in the next step.
Start a new conversation, and ask a few questions about Braintrust. Feel free to interrupt the bot, or ask unrelated questions, and see what happens. When you're finished, end the conversation. Have a couple of conversations to get a feel for some of the limitations and nuances of the bot - each conversation will come in handy in the next step.

## Logging in Braintrust

Expand All @@ -199,24 +208,24 @@ In addition to client-side authentication, you’ll also get the other benefits

## Online evaluations

In Braintrust, you can run server-side online evaluations that are automatically run asynchronously as you upload logs. This makes it easier to evaluate your app in situations like this, where the prompt and tool might not be synced to Braintrust.
In Braintrust, you can run server-side online evaluations that are automatically run asynchronously as you upload logs. This makes it easier to evaluate your app in situations like this, where the prompt and tool might not be synced to Braintrust.

Audio evals are complex, because there are multiple aspects of your application you can focus on. In this cookbook, we'll use the vector search query as a proxy for the quality of the Realtime API's interpretation of the user's input.
Audio evals are complex, because there are multiple aspects of your application you can focus on. In this cookbook, we'll use the vector search query as a proxy for the quality of the Realtime API's interpretation of the user's input.

### Setting up your scorer

We'll need to create a scorer that captures the criteria we want to evaluate. Since we're dealing with complex RAG outputs, we'll use a custom LLM-as-a-judge scorer.
For an LLM-as-a-judge scorer, you define a prompt that evaluates the output and maps its choices to specific scores.
We'll need to create a scorer that captures the criteria we want to evaluate. Since we're dealing with complex RAG outputs, we'll use a custom LLM-as-a-judge scorer.
For an LLM-as-a-judge scorer, you define a prompt that evaluates the output and maps its choices to specific scores.

Navigate to **Library** > **Scorers** and create a new scorer. Call your scorer **BraintrustRAG** and add the following prompt:
Navigate to **Scorers** and create a new scorer. Call your scorer **BraintrustRAG** and add the following prompt:

```javascript
Consider the following question:

{{input.arguments.query}}

and answer:

{{output}}

How well does the answer answer the question?
Expand All @@ -225,49 +234,50 @@ b) Reasonably well
c) Not well
```

The prompt uses mustache syntax to map the input to the query that gets sent to Pinecone, and get the output. We'll also assign choice score to the options we included in the prompt.
The prompt uses mustache syntax to map the input to the query that gets sent to Pinecone, and get the output. We'll also assign choice score to the options we included in the prompt.

![RAG scorer](./assets/rag-scorer.png)

### Configuring your online eval

Navigate to **Configuration** and scroll down to **Online scoring**. Select **Add rule** to configure your online scoring rule. Select the scorer we just created from the menu, and deselect **Apply to root span**. We'll filter to the **function** span since that's where our tool is called.
Navigate to **Configuration** and scroll down to **Online scoring**. Select **Add rule** to configure your online scoring rule. Select the scorer we just created from the menu, and deselect **Apply to root span**. We'll filter to the **function** span since that's where our tool is called.

![Configure score](./assets/configure-score.png)

The score will now automatically run at the specified sampling rate for all logs in the project.

### Viewing your evaluations

Now that you've set up your online evaluations, you can view the scores from within your logs. Underneath each function span that was included in the sampling rate, you'll have an additional span with the score.
Now that you've set up your online evaluations, you can view the scores from within your logs. Underneath each function span that was included in the sampling rate, you'll have an additional span with the score.

![Scoring span](./assets/scoring-span.png)

This particular function call was scored a 0. But if we take a closer look at the logs, we can see that the question was actually answered pretty well.
This particular function call was scored a 0. But if we take a closer look at the logs, we can see that the question was actually answered pretty well.
You may notice this pattern for other logs as well - so is our function actually not performing well?

## Improving your evals
## Improving your evals

There are three main ways to improve your evals:

- Refine the scoring function to ensure it accurately reflects the success criteria.
- Add new scoring functions to capture different performance aspects (for example, correctness or efficiency).
- Expand your dataset with more diverse or challenging test cases.

In this case, we need to be more precise about what we're testing for in our scoring function. In our application, we're asking for answers within the specific context of Braintrust, but our current scoring function is attempting to judge the responses to our questions objectively.
In this case, we need to be more precise about what we're testing for in our scoring function. In our application, we're asking for answers within the specific context of Braintrust, but our current scoring function is attempting to judge the responses to our questions objectively.

Let's edit our scoring function to test for that as precisely as possible.
Let's edit our scoring function to test for that as precisely as possible.

### Improving our existing scorer

Let's change the prompt for our scoring function to:
Let's change the prompt for our scoring function to:

```javascript
Consider the following question from an existing Braintrust user:

{{input.arguments.query}}

and answer:

{{output}}

How helpful is the answer, assuming the question is always in the context of Braintrust?
Expand All @@ -276,7 +286,7 @@ b) Reasonably helpful
c) Not helpful
```

As you continue to iterate on your scoring function and generate more logs, you should aim to see your scores go up.
As you continue to iterate on your scoring function and generate more logs, you should aim to see your scores go up.

![Logs over time](./assets/logs-over-time.png)

Expand All @@ -286,7 +296,3 @@ As you continue to build more AI applications with complex function calls and ne

- [I ran an eval. Now what?](/blog/after-evals)
- [What to do when a new AI model comes out](/blog/new-model)




22 changes: 12 additions & 10 deletions examples/ToolOCR/ToolOCR.mdx
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Using Python functions to extract text from images

From digitizing and archiving images of your handwritten notes, to automating invoice processing, there are a multitude of reasons you’d want to extract text from an image. You could use an LLM for image processing, but doing so can sometimes be inaccurate, expensive, and slow. Optical character recognition, or OCR, is a great pre-processing step that allows you to convert raw image data into text that can then be processed or summarized by an LLM.
From digitizing and archiving images of your handwritten notes, to automating invoice processing, there are a multitude of reasons you’d want to extract text from an image. You could use an LLM for image processing, but doing so can sometimes be inaccurate, expensive, and slow. Optical character recognition, or OCR, is a great pre-processing step that allows you to convert raw image data into text that can then be processed or summarized by an LLM.

Maybe you find the perfect recipe on the internet, but it’s surrounded by ads and people’s life stories, or you want to digitize an old recipe written by your grandmother.
Maybe you find the perfect recipe on the internet, but it’s surrounded by ads and people’s life stories, or you want to digitize an old recipe written by your grandmother.

![100 good cookies](assets/recipe.png)

Expand Down Expand Up @@ -43,10 +43,12 @@ of your account.
Optical character recognition, or OCR, is any type of technology that converts images of typed, handwritten or printed text into machine-encoded text. There are many well known libraries for OCR — in this cookbook, we’ll use [OCR.Space](https://ocr.space/), a free API you can use for testing without creating an account.

<Callout type="info">
For this cookbook, we're using the free version of OCR.Space that limits the number of requests. You may exceed rate limits and need to upgrade your account to experiment further with this application.
For this cookbook, we're using the free version of OCR.Space that limits the
number of requests. You may exceed rate limits and need to upgrade your
account to experiment further with this application.
</Callout>

In Braintrust, you can create tools and then run them in the UI, API, and, of course, via prompts. This will make it easier to iterate on your prompt without having to worry about the OCR logic.
In Braintrust, you can create tools and then run them in the UI, API, and, of course, via prompts. This will make it easier to iterate on your prompt without having to worry about the OCR logic.

The OCR tool is defined in `ocr.py`:

Expand Down Expand Up @@ -81,7 +83,7 @@ def ocr_image(**kwargs) -> str:
raise ValueError(f"Failed to perform OCR: {e}")
```

In just a few lines of code, it takes an image URL, parses and extracts the text, and returns the text contained in the image.
In just a few lines of code, it takes an image URL, parses and extracts the text, and returns the text contained in the image.

To push the tool to Braintrust along with all its dependencies, run:

Expand All @@ -91,15 +93,15 @@ braintrust push ocr.py --requirements requirements.txt

### Try out the tool

To try out the tool, visit the **toolOCR** project in Braintrust, and navigate to the **Tools** section of your **Library**. Here, you can test different images and see what kinds of outputs you're getting from the tool.
To try out the tool, visit the **toolOCR** project in Braintrust, and navigate **Tools**. Here, you can test different images and see what kinds of outputs you're getting from the tool.

![Try gif](assets/try-tool.gif)

This is helpful information for deciding if you'd like to do any additional post processing to the text output. For example, you may notice that your output contains `/n` to indicate new lines in the parsed text. You could include additional processing in your tool to do this. If you change your code, just run `braintrust push ocr.py --requirements requirements.txt` again to sync the tool with Braintrust.
This is helpful information for deciding if you'd like to do any additional post processing to the text output. For example, you may notice that your output contains `/n` to indicate new lines in the parsed text. You could include additional processing in your tool to do this. If you change your code, just run `braintrust push ocr.py --requirements requirements.txt` again to sync the tool with Braintrust.

## Try out the prompt
## Try out the prompt

When we pushed the tool to Braintrust, we also included an initial definition of the prompt:
When we pushed the tool to Braintrust, we also included an initial definition of the prompt:

```python #skip-compile
prompt = project.prompts.create(
Expand Down Expand Up @@ -144,7 +146,7 @@ Your playground is now set up with a prompt, model choice, dataset, and the tool

## Iterating on the prompt

Now that we have an interactive environment to test out our prompt and tool call, we can tweak the prompt and model until we get the desired results.
Now that we have an interactive environment to test out our prompt and tool call, we can tweak the prompt and model until we get the desired results.

Hit the copy icon to duplicate your prompt and start tweaking. You can also tweak the original prompt and save your changes there if you'd like. For example, you can try instructing the model to always list the quantity of each ingredient you need to purchase.

Expand Down
2 changes: 1 addition & 1 deletion examples/ToolRAG/ToolRAG.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ The output should be:

### Try out the tool

To try out the tool, visit the project in Braintrust, and navigate to the **Tools** section of your **Library**.
To try out the tool, visit the project in Braintrust, and navigate to **Tools**.

![Test tool](./assets/Test-tool.gif)

Expand Down