diff --git a/site/tips/AI/ManagedMCPServers.qmd b/site/tips/AI/ManagedMCPServers.qmd new file mode 100644 index 0000000..d78956d --- /dev/null +++ b/site/tips/AI/ManagedMCPServers.qmd @@ -0,0 +1,88 @@ +--- +title: "Getting Started with Managed MCP Servers on Databricks" +description: "Learn how Managed MCP Servers let your AI agents securely connect to Databricks resources and external APIs using the Model Context Protocol." +date-modified: "07/02/2026" +date-format: "DD/MM/YYYY" +categories: [AI, Agents, Unity Catalog] +toc: true +toc-title: Navigation +tags: + - databricks + - mcp + - ai-agents + - unity-catalog + - generative-ai + - tips +draft: false +--- + +## Summary + +- Managed MCP Servers provide ready-to-use connections between your AI agents and Databricks resources like Unity Catalog, Vector Search, and Genie +- Four server types are available out of the box: Unity Catalog Functions, Vector Search, Genie Space, and DBSQL +- All access is governed by Unity Catalog permissions, so agents can only reach data they are authorised to use + +## Introduction + +Building AI agents that can interact with your data platform has historically meant writing and maintaining custom tool integrations for every resource your agent needs to reach. With Managed MCP Servers, now in Public Preview as of January 2026, Databricks provides a standardised way for agents to connect to platform resources without custom plumbing. + +The [Model Context Protocol](https://modelcontextprotocol.io/) (MCP) is an open-source standard that connects AI agents to tools, resources, and contextual information. The key benefit is standardisation: you build a tool once and any MCP-compatible agent can use it, whether it is something you have built yourself or a third-party agent like Claude Code, Cursor, or Codex. + +## What Managed MCP Servers Are Available? + +Databricks provides four ready-to-use server types: + +| Server Type | What It Does | Access Mode | +|---|---|---| +| **Unity Catalog Functions** | Run predefined SQL queries as agent tools | Read | +| **Vector Search** | Query Vector Search indexes to retrieve relevant documents | Read | +| **Genie Space** | Analyse structured data using natural language via Genie | Read | +| **DBSQL** | Run AI-generated SQL to author data pipelines | Read & Write | + +Each server enforces Unity Catalog permissions at every call. If a user does not have access to a table, neither does their agent. + +::: {.callout-tip title="Pro Tip" appearance="simple"} +You can connect an agent to multiple servers simultaneously. For example, a customer support agent could use Vector Search for ticket retrieval, Genie for billing queries, and UC Functions for account operations — all in a single conversation. +::: + +## Connecting to a Managed MCP Server + +To get started locally, you need Python 3.12+ and the `databricks-mcp` package. Authentication uses OAuth via the Databricks SDK. + +``` python +pip install databricks-mcp mcp>=1.9 databricks-sdk[openai] mlflow>=3.1.0 +``` + +Once installed, your agent can dynamically discover available tools at runtime by listing what the MCP server exposes. Databricks recommends against hardcoding tool names since the set of available tools may change as new capabilities are added. + +``` python +from databricks_mcp import DatabricksMCPClient + +# The client authenticates via Databricks SDK OAuth +client = DatabricksMCPClient(workspace_url="https://your-workspace.databricks.com") + +# Dynamically discover available tools +tools = await client.list_tools() +for tool in tools: + print(f"Tool: {tool.name} - {tool.description}") +``` + +::: {.callout-warning title="Best Practices" appearance="simple"} +- **Do not hardcode tool names** — let your agent discover tools dynamically at runtime +- **Do not parse tool output programmatically** — output formats may change, so let your LLM interpret responses +- **Let the LLM decide** which tools to call based on the user's request and tool descriptions +::: + +## Beyond Managed: Other MCP Options + +If the four built-in servers do not cover your use case, Databricks also supports: + +- **External MCP Servers** — connect to MCP servers hosted outside Databricks using managed connections +- **Custom MCP Servers** — host your own MCP server as a Databricks App, giving you full control over the tools exposed + +## Further Reading + +- [Model Context Protocol (MCP) on Databricks](https://docs.databricks.com/aws/en/generative-ai/mcp/) +- [Use Databricks Managed MCP Servers](https://docs.databricks.com/aws/en/generative-ai/mcp/managed-mcp) +- [MCP Specification](https://modelcontextprotocol.io/) +- [January 2026 Release Notes](https://docs.databricks.com/aws/en/release-notes/product/2026/january) diff --git a/site/tips/MLflow/TracesInUnityCatalog.qmd b/site/tips/MLflow/TracesInUnityCatalog.qmd new file mode 100644 index 0000000..ab7fae1 --- /dev/null +++ b/site/tips/MLflow/TracesInUnityCatalog.qmd @@ -0,0 +1,132 @@ +--- +title: "Storing MLflow Traces in Unity Catalog" +description: "Learn how to store MLflow traces in Unity Catalog tables using OpenTelemetry format for governed, queryable observability of your AI applications." +date-modified: "07/02/2026" +date-format: "DD/MM/YYYY" +categories: [MLflow, Unity Catalog, Observability] +toc: true +toc-title: Navigation +tags: + - databricks + - mlflow + - unity-catalog + - opentelemetry + - observability + - generative-ai + - tips +draft: false +--- + +## Summary + +- Store MLflow traces as Delta tables in Unity Catalog for unlimited retention, SQL querying, and governed access control +- Traces use OpenTelemetry-compatible format, making them interoperable with your existing observability stack +- Access control is managed through Unity Catalog permissions rather than experiment-level ACLs + +## Introduction + +If you are building generative AI applications on Databricks, observability matters. You need to understand what your models and agents are doing, how they are performing, and where things go wrong. MLflow has long provided tracing capabilities, but traces were historically tied to MLflow experiments with limited querying and access control options. + +As of January 2026, you can now store MLflow traces directly in Unity Catalog tables using an OpenTelemetry-compatible (OTEL) format. This means your trace data lives alongside the rest of your governed data assets — queryable with SQL, secured with UC permissions, and stored in Delta tables with unlimited retention. This feature is currently in Beta. + +## Why Store Traces in Unity Catalog? + +Compared to the default experiment-based storage, Unity Catalog trace storage gives you several advantages: + +- **Governed access control** — permissions are managed through UC schema and table-level grants, not experiment ACLs +- **SQL queryability** — query trace data directly from any Databricks SQL warehouse +- **Unlimited storage** — Delta tables handle long-term retention without the constraints of experiment storage +- **Broad visibility** — anyone with table access can view traces regardless of which experiment produced them +- **OTEL compatibility** — trace IDs use URI format, improving interoperability with external observability tools + +## Getting Started + +You will need MLflow 3.9.0 or later and a Unity Catalog-enabled workspace. + +``` python +pip install "mlflow[databricks]>=3.9.0" --upgrade +``` + +### Link an Experiment to a UC Schema + +First, create or select an experiment and link it to a Unity Catalog schema. Three Delta tables are created automatically to store spans, metrics, and logs. + +``` python +import mlflow +from mlflow.entities import UCSchemaLocation +from mlflow.tracing.enablement import set_experiment_trace_location + +mlflow.set_tracking_uri("databricks") + +experiment = mlflow.get_experiment_by_name("/my-genai-experiment") +if not experiment: + experiment_id = mlflow.create_experiment(name="/my-genai-experiment") +else: + experiment_id = experiment.experiment_id + +# Link to Unity Catalog schema — tables are created automatically +set_experiment_trace_location( + location=UCSchemaLocation(catalog_name="ml_catalog", schema_name="traces"), + experiment_id=experiment_id, +) +``` + +### Log Traces + +Once linked, point your tracing destination at the UC schema and traces flow into Delta tables automatically. + +``` python +import mlflow +from mlflow.entities import UCSchemaLocation + +mlflow.set_tracking_uri("databricks") + +mlflow.tracing.set_destination( + destination=UCSchemaLocation( + catalog_name="ml_catalog", + schema_name="traces", + ) +) + +@mlflow.trace +def classify_ticket(text): + # Your model inference logic here + return {"category": "billing", "confidence": 0.94} + +classify_ticket("I was charged twice for my subscription") +``` + +### Query Traces with SQL + +Because traces are stored in Delta tables, you can run standard SQL against them for analysis and monitoring. + +``` sql +-- Find slow spans in the last 24 hours +SELECT + trace_id, + span_name, + duration_ms, + status_code +FROM ml_catalog.traces.mlflow_experiment_trace_otel_spans +WHERE start_time > current_timestamp() - INTERVAL 24 HOURS + AND duration_ms > 5000 +ORDER BY duration_ms DESC; +``` + +::: {.callout-warning title="Limitations to Know" appearance="simple"} +- Ingestion is limited to 100 traces/second per workspace and 100MB/second per table +- UI performance may degrade beyond 2TB of stored trace data +- Individual trace deletion is not supported through the UI — use SQL `DELETE` statements directly on the UC tables +- Currently in Beta with regional availability limited to `eastus`, `eastus2`, and `westeurope` +::: + +::: {.callout-tip title="Pro Tip" appearance="simple"} +You can export traces to both Unity Catalog and an external OpenTelemetry service simultaneously using MLflow's dual export configuration. This lets you keep your existing Datadog, Grafana, or other OTEL-compatible tooling while also getting the governance and SQL queryability of UC. +::: + +## Further Reading + +- [Store MLflow Traces in Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/mlflow3/genai/tracing/trace-unity-catalog) +- [MLflow Tracing Concepts](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/tracing-101) +- [OpenTelemetry Export from MLflow](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/integrations/open-telemetry) +- [January 2026 Release Notes](https://docs.databricks.com/aws/en/release-notes/product/2026/january) diff --git a/site/tips/Observability/LakeflowJobsSystemTables.qmd b/site/tips/Observability/LakeflowJobsSystemTables.qmd new file mode 100644 index 0000000..d6dd4ed --- /dev/null +++ b/site/tips/Observability/LakeflowJobsSystemTables.qmd @@ -0,0 +1,136 @@ +--- +title: "Monitoring Your Jobs with Lakeflow System Tables" +description: "Learn how to use the now GA Lakeflow system tables to monitor job runs, track task performance, and analyse compute costs across your Databricks account." +date-modified: "07/02/2026" +date-format: "DD/MM/YYYY" +categories: [Lakeflow, Observability, Data Engineering] +toc: true +toc-title: Navigation +tags: + - databricks + - lakeflow + - jobs + - system-tables + - monitoring + - observability + - tips +draft: false +--- + +## Summary + +- The Lakeflow system tables `jobs`, `job_tasks`, `job_run_timeline`, and `job_task_run_timeline` are now Generally Available as of January 2026 +- These tables provide account-wide visibility into all job definitions, run history, and task-level execution metrics +- Join with `billing.usage` to calculate cost per job run for precise spend attribution + +## Introduction + +Understanding what your Databricks jobs are doing — and what they are costing you — has always required stitching together information from the Jobs UI, cluster logs, and billing exports. With the GA release of the Lakeflow system tables in January 2026, Databricks now provides a unified, SQL-queryable record of every job and job run across your entire account. + +These tables live in the `system.lakeflow` schema (previously called `system.workflow` — the content is identical) and cover all workspaces deployed in the same cloud region. They retain 365 days of data at no additional cost, and they support streaming reads so you can build real-time monitoring pipelines on top of them. + +## What Tables Are Available? + +The `system.lakeflow` schema contains four GA tables and two in Public Preview: + +| Table | Description | Type | +|---|---|---| +| **jobs** | All job definitions in your account | SCD2 (slowly changing) | +| **job_tasks** | Task definitions within each job | SCD2 | +| **job_run_timeline** | Run-level execution history and metrics | Immutable | +| **job_task_run_timeline** | Task-level execution history and metrics | Immutable | +| *pipelines* (Preview) | Pipeline definitions | SCD2 | +| *pipeline_update_timeline* (Preview) | Pipeline update history | Immutable | + +::: {.callout-tip title="Beginner Tip" appearance="simple"} +The `jobs` table is a slowly changing dimension (SCD2) table. When a job definition changes, a new row is emitted rather than updating the existing one. This means you get a full audit trail of every configuration change. +::: + +## Practical Examples + +### Find Failed Jobs in the Last 24 Hours + +A quick query to surface recent failures across your account: + +``` sql +SELECT + j.name AS job_name, + r.run_id, + r.result_state, + r.termination_code, + r.period_start_time, + r.period_end_time +FROM system.lakeflow.job_run_timeline r +JOIN system.lakeflow.jobs j + ON r.workspace_id = j.workspace_id + AND r.job_id = j.job_id +WHERE r.result_state = 'FAILED' + AND r.period_start_time > current_timestamp() - INTERVAL 24 HOURS +ORDER BY r.period_start_time DESC; +``` + +### Calculate Cost Per Job Run + +One of the most valuable patterns is joining the run timeline with the billing system table to get actual cost per execution: + +``` sql +SELECT + j.name AS job_name, + r.run_id, + r.result_state, + SUM(b.usage_quantity * lp.pricing.default) AS estimated_cost +FROM system.lakeflow.job_run_timeline r +JOIN system.lakeflow.jobs j + ON r.workspace_id = j.workspace_id + AND r.job_id = j.job_id +JOIN system.billing.usage b + ON r.job_id = b.usage_metadata.job_id + AND r.run_id = b.usage_metadata.job_run_id +JOIN system.billing.list_prices lp + ON b.sku_name = lp.sku_name + AND b.usage_date = lp.price_start_time +GROUP BY j.name, r.run_id, r.result_state +ORDER BY estimated_cost DESC; +``` + +::: {.callout-warning title="Important" appearance="simple"} +Jobs running on all-purpose (interactive) compute share resources with other workloads, so cost attribution will not be precise. For accurate per-job costing, use dedicated job compute or serverless compute. +::: + +### Monitor Job Duration Trends + +Track whether your jobs are getting slower over time: + +``` sql +SELECT + j.name AS job_name, + DATE(r.period_start_time) AS run_date, + AVG(r.run_duration_ms) / 1000 AS avg_duration_seconds, + COUNT(*) AS run_count +FROM system.lakeflow.job_run_timeline r +JOIN system.lakeflow.jobs j + ON r.workspace_id = j.workspace_id + AND r.job_id = j.job_id +WHERE r.period_start_time > current_timestamp() - INTERVAL 30 DAYS + AND r.result_state = 'SUCCESS' +GROUP BY j.name, DATE(r.period_start_time) +ORDER BY j.name, run_date; +``` + +## Access Requirements + +To query these tables, you need one of: + +- **Metastore admin** and **account admin** roles, or +- Explicit `USE` and `SELECT` grants on the `system.lakeflow` schema + +::: {.callout-note title="Regional Scope" appearance="simple"} +System tables contain records from all workspaces in the same cloud region. To see jobs from another region, query from a workspace deployed in that region. +::: + +## Further Reading + +- [Jobs System Table Reference](https://docs.databricks.com/aws/en/admin/system-tables/jobs) +- [Monitoring and Observability for Lakeflow Jobs](https://docs.databricks.com/aws/en/jobs/monitor) +- [Lakeflow Jobs Overview](https://docs.databricks.com/aws/en/jobs/) +- [January 2026 Release Notes](https://docs.databricks.com/aws/en/release-notes/product/2026/january)