Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions site/tips/AI/ManagedMCPServers.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: "Getting Started with Managed MCP Servers on Databricks"
description: "Learn how Managed MCP Servers let your AI agents securely connect to Databricks resources and external APIs using the Model Context Protocol."
date-modified: "07/02/2026"
date-format: "DD/MM/YYYY"
categories: [AI, Agents, Unity Catalog]
toc: true
toc-title: Navigation
tags:
- databricks
- mcp
- ai-agents
- unity-catalog
- generative-ai
- tips
draft: false
---

## Summary

- Managed MCP Servers provide ready-to-use connections between your AI agents and Databricks resources like Unity Catalog, Vector Search, and Genie
- Four server types are available out of the box: Unity Catalog Functions, Vector Search, Genie Space, and DBSQL
- All access is governed by Unity Catalog permissions, so agents can only reach data they are authorised to use

## Introduction

Building AI agents that can interact with your data platform has historically meant writing and maintaining custom tool integrations for every resource your agent needs to reach. With Managed MCP Servers, now in Public Preview as of January 2026, Databricks provides a standardised way for agents to connect to platform resources without custom plumbing.

The [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) is an open-source standard that connects AI agents to tools, resources, and contextual information. The key benefit is standardisation: you build a tool once and any MCP-compatible agent can use it, whether it is something you have built yourself or a third-party agent like Claude Code, Cursor, or Codex.

## What Managed MCP Servers Are Available?

Databricks provides four ready-to-use server types:

| Server Type | What It Does | Access Mode |
|---|---|---|
| **Unity Catalog Functions** | Run predefined SQL queries as agent tools | Read |
Comment on lines +35 to +37
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table header rows start with ||, which likely produces an unintended empty first column in rendered output. Use standard Markdown/Quarto table syntax with a single leading | per row.

Copilot uses AI. Check for mistakes.
| **Vector Search** | Query Vector Search indexes to retrieve relevant documents | Read |
| **Genie Space** | Analyse structured data using natural language via Genie | Read |
| **DBSQL** | Run AI-generated SQL to author data pipelines | Read & Write |

Each server enforces Unity Catalog permissions at every call. If a user does not have access to a table, neither does their agent.

::: {.callout-tip title="Pro Tip" appearance="simple"}
You can connect an agent to multiple servers simultaneously. For example, a customer support agent could use Vector Search for ticket retrieval, Genie for billing queries, and UC Functions for account operations — all in a single conversation.
:::

## Connecting to a Managed MCP Server

To get started locally, you need Python 3.12+ and the `databricks-mcp` package. Authentication uses OAuth via the Databricks SDK.

``` python
pip install databricks-mcp mcp>=1.9 databricks-sdk[openai] mlflow>=3.1.0
Comment on lines +52 to +53
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code fence is labeled as python but contains a shell command. Also, unquoted version specifiers like mcp>=1.9 / mlflow>=3.1.0 will be interpreted by most shells as output redirection, so the install command will fail. Use a shell fence (bash/sh) and quote (or escape) the requirement specifiers.

Suggested change
``` python
pip install databricks-mcp mcp>=1.9 databricks-sdk[openai] mlflow>=3.1.0
``` bash
pip install "databricks-mcp" "mcp>=1.9" "databricks-sdk[openai]" "mlflow>=3.1.0"

Copilot uses AI. Check for mistakes.
```

Once installed, your agent can dynamically discover available tools at runtime by listing what the MCP server exposes. Databricks recommends against hardcoding tool names since the set of available tools may change as new capabilities are added.

``` python
from databricks_mcp import DatabricksMCPClient

# The client authenticates via Databricks SDK OAuth
client = DatabricksMCPClient(workspace_url="https://your-workspace.databricks.com")

# Dynamically discover available tools
tools = await client.list_tools()
for tool in tools:
print(f"Tool: {tool.name} - {tool.description}")
Comment on lines +59 to +67
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await client.list_tools() is used at top-level; this will raise a SyntaxError in standard Python (outside an async function). Wrap the example in an async def main() and run it via asyncio.run(...), or note that it must be executed in a notebook/REPL that supports top-level await.

Suggested change
from databricks_mcp import DatabricksMCPClient
# The client authenticates via Databricks SDK OAuth
client = DatabricksMCPClient(workspace_url="https://your-workspace.databricks.com")
# Dynamically discover available tools
tools = await client.list_tools()
for tool in tools:
print(f"Tool: {tool.name} - {tool.description}")
import asyncio
from databricks_mcp import DatabricksMCPClient
async def main() -> None:
# The client authenticates via Databricks SDK OAuth
client = DatabricksMCPClient(workspace_url="https://your-workspace.databricks.com")
# Dynamically discover available tools
tools = await client.list_tools()
for tool in tools:
print(f"Tool: {tool.name} - {tool.description}")
if __name__ == "__main__":
asyncio.run(main())

Copilot uses AI. Check for mistakes.
```

::: {.callout-warning title="Best Practices" appearance="simple"}
- **Do not hardcode tool names** — let your agent discover tools dynamically at runtime
- **Do not parse tool output programmatically** — output formats may change, so let your LLM interpret responses
- **Let the LLM decide** which tools to call based on the user's request and tool descriptions
:::

## Beyond Managed: Other MCP Options

If the four built-in servers do not cover your use case, Databricks also supports:

- **External MCP Servers** — connect to MCP servers hosted outside Databricks using managed connections
- **Custom MCP Servers** — host your own MCP server as a Databricks App, giving you full control over the tools exposed

## Further Reading

- [Model Context Protocol (MCP) on Databricks](https://docs.databricks.com/aws/en/generative-ai/mcp/)
- [Use Databricks Managed MCP Servers](https://docs.databricks.com/aws/en/generative-ai/mcp/managed-mcp)
- [MCP Specification](https://modelcontextprotocol.io/)
- [January 2026 Release Notes](https://docs.databricks.com/aws/en/release-notes/product/2026/january)
132 changes: 132 additions & 0 deletions site/tips/MLflow/TracesInUnityCatalog.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: "Storing MLflow Traces in Unity Catalog"
description: "Learn how to store MLflow traces in Unity Catalog tables using OpenTelemetry format for governed, queryable observability of your AI applications."
date-modified: "07/02/2026"
date-format: "DD/MM/YYYY"
categories: [MLflow, Unity Catalog, Observability]
toc: true
toc-title: Navigation
tags:
- databricks
- mlflow
- unity-catalog
- opentelemetry
- observability
- generative-ai
- tips
draft: false
---

## Summary

- Store MLflow traces as Delta tables in Unity Catalog for unlimited retention, SQL querying, and governed access control
- Traces use OpenTelemetry-compatible format, making them interoperable with your existing observability stack
- Access control is managed through Unity Catalog permissions rather than experiment-level ACLs

## Introduction

If you are building generative AI applications on Databricks, observability matters. You need to understand what your models and agents are doing, how they are performing, and where things go wrong. MLflow has long provided tracing capabilities, but traces were historically tied to MLflow experiments with limited querying and access control options.

As of January 2026, you can now store MLflow traces directly in Unity Catalog tables using an OpenTelemetry-compatible (OTEL) format. This means your trace data lives alongside the rest of your governed data assets — queryable with SQL, secured with UC permissions, and stored in Delta tables with unlimited retention. This feature is currently in Beta.

## Why Store Traces in Unity Catalog?

Compared to the default experiment-based storage, Unity Catalog trace storage gives you several advantages:

- **Governed access control** — permissions are managed through UC schema and table-level grants, not experiment ACLs
- **SQL queryability** — query trace data directly from any Databricks SQL warehouse
- **Unlimited storage** — Delta tables handle long-term retention without the constraints of experiment storage
- **Broad visibility** — anyone with table access can view traces regardless of which experiment produced them
- **OTEL compatibility** — trace IDs use URI format, improving interoperability with external observability tools

## Getting Started

You will need MLflow 3.9.0 or later and a Unity Catalog-enabled workspace.

``` python
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code fence is labeled python but contains a shell command (pip install ...). Consider switching the fence language to bash/sh so syntax highlighting and copy/paste expectations match the content.

Suggested change
``` python
```bash

Copilot uses AI. Check for mistakes.
pip install "mlflow[databricks]>=3.9.0" --upgrade
```

### Link an Experiment to a UC Schema

First, create or select an experiment and link it to a Unity Catalog schema. Three Delta tables are created automatically to store spans, metrics, and logs.

``` python
import mlflow
from mlflow.entities import UCSchemaLocation
from mlflow.tracing.enablement import set_experiment_trace_location

mlflow.set_tracking_uri("databricks")

experiment = mlflow.get_experiment_by_name("/my-genai-experiment")
if not experiment:
experiment_id = mlflow.create_experiment(name="/my-genai-experiment")
else:
experiment_id = experiment.experiment_id

# Link to Unity Catalog schema — tables are created automatically
set_experiment_trace_location(
location=UCSchemaLocation(catalog_name="ml_catalog", schema_name="traces"),
experiment_id=experiment_id,
)
```

### Log Traces

Once linked, point your tracing destination at the UC schema and traces flow into Delta tables automatically.

``` python
import mlflow
from mlflow.entities import UCSchemaLocation

mlflow.set_tracking_uri("databricks")

mlflow.tracing.set_destination(
destination=UCSchemaLocation(
catalog_name="ml_catalog",
schema_name="traces",
)
)

@mlflow.trace
def classify_ticket(text):
# Your model inference logic here
return {"category": "billing", "confidence": 0.94}

classify_ticket("I was charged twice for my subscription")
```

### Query Traces with SQL

Because traces are stored in Delta tables, you can run standard SQL against them for analysis and monitoring.

``` sql
-- Find slow spans in the last 24 hours
SELECT
trace_id,
span_name,
duration_ms,
status_code
FROM ml_catalog.traces.mlflow_experiment_trace_otel_spans
WHERE start_time > current_timestamp() - INTERVAL 24 HOURS
AND duration_ms > 5000
ORDER BY duration_ms DESC;
```

::: {.callout-warning title="Limitations to Know" appearance="simple"}
- Ingestion is limited to 100 traces/second per workspace and 100MB/second per table
- UI performance may degrade beyond 2TB of stored trace data
- Individual trace deletion is not supported through the UI — use SQL `DELETE` statements directly on the UC tables
- Currently in Beta with regional availability limited to `eastus`, `eastus2`, and `westeurope`
:::

::: {.callout-tip title="Pro Tip" appearance="simple"}
You can export traces to both Unity Catalog and an external OpenTelemetry service simultaneously using MLflow's dual export configuration. This lets you keep your existing Datadog, Grafana, or other OTEL-compatible tooling while also getting the governance and SQL queryability of UC.
:::

## Further Reading

- [Store MLflow Traces in Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/mlflow3/genai/tracing/trace-unity-catalog)
- [MLflow Tracing Concepts](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/tracing-101)
- [OpenTelemetry Export from MLflow](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/integrations/open-telemetry)
- [January 2026 Release Notes](https://docs.databricks.com/aws/en/release-notes/product/2026/january)
136 changes: 136 additions & 0 deletions site/tips/Observability/LakeflowJobsSystemTables.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
title: "Monitoring Your Jobs with Lakeflow System Tables"
description: "Learn how to use the now GA Lakeflow system tables to monitor job runs, track task performance, and analyse compute costs across your Databricks account."
date-modified: "07/02/2026"
date-format: "DD/MM/YYYY"
categories: [Lakeflow, Observability, Data Engineering]
toc: true
toc-title: Navigation
tags:
- databricks
- lakeflow
- jobs
- system-tables
- monitoring
- observability
- tips
draft: false
---

## Summary

- The Lakeflow system tables `jobs`, `job_tasks`, `job_run_timeline`, and `job_task_run_timeline` are now Generally Available as of January 2026
- These tables provide account-wide visibility into all job definitions, run history, and task-level execution metrics
- Join with `billing.usage` to calculate cost per job run for precise spend attribution
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary references billing.usage, but the example query below joins system.billing.usage. Update the summary to match the actual system table path to avoid confusion for readers.

Suggested change
- Join with `billing.usage` to calculate cost per job run for precise spend attribution
- Join with `system.billing.usage` to calculate cost per job run for precise spend attribution

Copilot uses AI. Check for mistakes.

## Introduction

Understanding what your Databricks jobs are doing — and what they are costing you — has always required stitching together information from the Jobs UI, cluster logs, and billing exports. With the GA release of the Lakeflow system tables in January 2026, Databricks now provides a unified, SQL-queryable record of every job and job run across your entire account.

These tables live in the `system.lakeflow` schema (previously called `system.workflow` — the content is identical) and cover all workspaces deployed in the same cloud region. They retain 365 days of data at no additional cost, and they support streaming reads so you can build real-time monitoring pipelines on top of them.

## What Tables Are Available?

The `system.lakeflow` schema contains four GA tables and two in Public Preview:

| Table | Description | Type |
|---|---|---|
| **jobs** | All job definitions in your account | SCD2 (slowly changing) |
Comment on lines +36 to +38
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table header rows start with ||, which will render an extra empty column (or break table parsing) in most Markdown/Quarto renderers. Use standard table syntax with a single leading | for each row.

Copilot uses AI. Check for mistakes.
| **job_tasks** | Task definitions within each job | SCD2 |
| **job_run_timeline** | Run-level execution history and metrics | Immutable |
| **job_task_run_timeline** | Task-level execution history and metrics | Immutable |
| *pipelines* (Preview) | Pipeline definitions | SCD2 |
| *pipeline_update_timeline* (Preview) | Pipeline update history | Immutable |

::: {.callout-tip title="Beginner Tip" appearance="simple"}
The `jobs` table is a slowly changing dimension (SCD2) table. When a job definition changes, a new row is emitted rather than updating the existing one. This means you get a full audit trail of every configuration change.
:::

## Practical Examples

### Find Failed Jobs in the Last 24 Hours

A quick query to surface recent failures across your account:

``` sql
SELECT
j.name AS job_name,
r.run_id,
r.result_state,
r.termination_code,
r.period_start_time,
r.period_end_time
FROM system.lakeflow.job_run_timeline r
JOIN system.lakeflow.jobs j
ON r.workspace_id = j.workspace_id
AND r.job_id = j.job_id
WHERE r.result_state = 'FAILED'
AND r.period_start_time > current_timestamp() - INTERVAL 24 HOURS
ORDER BY r.period_start_time DESC;
```

### Calculate Cost Per Job Run

One of the most valuable patterns is joining the run timeline with the billing system table to get actual cost per execution:

``` sql
SELECT
j.name AS job_name,
r.run_id,
r.result_state,
SUM(b.usage_quantity * lp.pricing.default) AS estimated_cost
FROM system.lakeflow.job_run_timeline r
JOIN system.lakeflow.jobs j
ON r.workspace_id = j.workspace_id
AND r.job_id = j.job_id
JOIN system.billing.usage b
ON r.job_id = b.usage_metadata.job_id
AND r.run_id = b.usage_metadata.job_run_id
JOIN system.billing.list_prices lp
ON b.sku_name = lp.sku_name
AND b.usage_date = lp.price_start_time
GROUP BY j.name, r.run_id, r.result_state
ORDER BY estimated_cost DESC;
```

::: {.callout-warning title="Important" appearance="simple"}
Jobs running on all-purpose (interactive) compute share resources with other workloads, so cost attribution will not be precise. For accurate per-job costing, use dedicated job compute or serverless compute.
:::

### Monitor Job Duration Trends

Track whether your jobs are getting slower over time:

``` sql
SELECT
j.name AS job_name,
DATE(r.period_start_time) AS run_date,
AVG(r.run_duration_ms) / 1000 AS avg_duration_seconds,
COUNT(*) AS run_count
FROM system.lakeflow.job_run_timeline r
JOIN system.lakeflow.jobs j
ON r.workspace_id = j.workspace_id
AND r.job_id = j.job_id
WHERE r.period_start_time > current_timestamp() - INTERVAL 30 DAYS
AND r.result_state = 'SUCCESS'
GROUP BY j.name, DATE(r.period_start_time)
ORDER BY j.name, run_date;
```

## Access Requirements

To query these tables, you need one of:

- **Metastore admin** and **account admin** roles, or
- Explicit `USE` and `SELECT` grants on the `system.lakeflow` schema

::: {.callout-note title="Regional Scope" appearance="simple"}
System tables contain records from all workspaces in the same cloud region. To see jobs from another region, query from a workspace deployed in that region.
:::

## Further Reading

- [Jobs System Table Reference](https://docs.databricks.com/aws/en/admin/system-tables/jobs)
- [Monitoring and Observability for Lakeflow Jobs](https://docs.databricks.com/aws/en/jobs/monitor)
- [Lakeflow Jobs Overview](https://docs.databricks.com/aws/en/jobs/)
- [January 2026 Release Notes](https://docs.databricks.com/aws/en/release-notes/product/2026/january)