📚 MLV‑Lakehouse Demo – End‑to‑End Materialised Lake Views Author: Your Name / Team | Created: 2026‑04‑07
Table of Contents Project Overview What is a Materialised Lake View (MLV)? 2.1 Explanation for a Year 10 student 2.2 Explanation for an IT / Data‑Engineering professional Architecture & Data Flow (block diagram) Prerequisites Setup & Execution Notebook Walk‑through Testing & Validation Security, Governance & Compliance License & Contributing References & Further Reading 1️⃣ Project Overview Layer Object Purpose Bronze bronze.raw_patient (Delta table) Raw ingest of NHS‑style patient records. Change‑Data‑Feed (CDF) enabled so downstream layers receive only the delta. Silver silver.cleaned_patient (Materialised Lake View) Cleans the raw data – hashes the NHS number, capitalises names, normalises gender, and keeps clinical columns. Gold gold.patient_summary (Materialised Lake View) Aggregates the clean data into per‑condition KPIs (patient count, average systolic/diastolic BP, first/last encounter). Automation Fabric automatically tracks Bronze → Silver → Gold lineage and triggers incremental refreshes whenever new rows land in Bronze. Compliance • NHS number is pseudonymised (sha2(...,256)). • No clear‑text identifiers ever appear in a view → satisfies GDPR Art 4(1) and NHS UK data‑minimisation guidance. Data‑Management Design follows DAMA‑DM‑Core (metadata, lineage, quality, governance). 2️⃣ What is a Materialised Lake View (MLV)? 2.1 📚 Explanation for a Year 10 student Think of the raw patient sheet as a messy notebook. Bronze is a copy of that notebook, unchanged. Silver is a tidy version – names are written correctly, the NHS number is turned into a secret code, and genders are spelled out. Gold is a summary page that says “For condition C001 we treated 2 people, average blood pressure was 135/90.”
An MLV is the tidy and summary page stored inside a data lake. It refreshes itself automatically when new raw rows appear, so the summary is always up‑to‑date.
2️⃣ Explanation for an IT / Data‑Engineering professional Feature Description Read‑only, persisted SELECT CREATE MATERIALIZED LAKE VIEW … AS SELECT … materialises the result set as Delta files; DML must happen in the source tables. Automatic lineage & refresh Fabric records the dependency graph (bronze → silver → gold). When a source table with Change‑Data‑Feed receives new commits, an incremental refresh runs automatically. Refresh policies Stored in view metadata (RefreshPolicy = FullRefresh / IncrementalRefresh / Skip). Metadata & governance Exposed via DESCRIBE DETAIL (size, row‑count, properties). Integrated with Fabric Governance (formerly Azure Purview) for DAMA‑DM‑Core compliance. Security model Inherits the lake‑house RBAC and column‑level security. The view contains only a SHA‑256 hash of the NHS number → fulfills GDPR pseudonymisation and NHS UK Data‑Protection Toolkit requirements. Performance Because the view is materialised, downstream BI (Power BI, Tableau, etc.) reads the cached Delta files directly; refreshes happen asynchronously on the Spark pool. 3️⃣ Architecture & Data Flow The diagram below is written in Mermaid – it renders automatically on GitHub, in VS Code Markdown preview, and in most Markdown viewers.
flowchart TD subgraph Raw["Bronze (Delta Table)"] B[bronze.raw_patient] end
subgraph Clean["Silver (MLV)"]
S[silver.cleaned_patient]
end
subgraph Agg["Gold (MLV)"]
G[gold.patient_summary]
end
B -->|Change‑Data‑Feed| S
S -->|Incremental Refresh| G
classDef gov fill:#F9F9F9,stroke:#2C3E50,stroke-width:2px;
class B,S,G gov;
click B "https://learn.microsoft.com/azure/databricks/delta/delta-change-data-feed" "Delta Change‑Data‑Feed"
click S "https://learn.microsoft.com/fabric/data-warehouse/materialized-lake-views" "Materialised Lake View (Silver)"
click G "https://learn.microsoft.com/fabric/data-warehouse/materialized-lake-views" "Materialised Lake View (Gold)"
Arrows represent automatic lineage and incremental refresh driven by Fabric.
4️⃣ Prerequisites Item Minimum requirement Azure subscription with Microsoft Fabric (or Synapse) workspace – Lakehouse (named mlv_lakehouse in the demo) – Delta format enabled – Spark pool (runtime 3.5 or newer) – required for CREATE MATERIALIZED LAKE VIEW – Permissions – Lakehouse Contributor + Governance Reader (or equivalent AD‑based RBAC) – Python 3.9+ (only for local execution) – Compliance – If you replace the synthetic data with real NHS data, run a Data‑Protection Impact Assessment (DPIA) and register the dataset as PHI in Fabric Governance. – Optional – Power BI, Tableau, or any downstream analytics tool – 5️⃣ Setup & Execution Clone the repo
git clone https://github.com/<your‑org>/mlv‑lakehouse‑demo.git cd mlv‑lakehouse‑demo Open the notebook
In the Fabric portal → Lakehouse → Notebooks → MLV Demo.ipynb Or, from Synapse, attach the notebook to the Spark pool trident/default. Run the cells in order
Bronze: create schema, table, enable CDF, insert sample rows. Silver: materialised view that hashes NHS numbers and normalises fields. Gold: aggregation view with KPIs. Insert new patient (simulates streaming) → observe automatic refresh. Run validation tests (assertions, DESCRIBE DETAIL, row‑counts). Optional – Force a full refresh
REFRESH MATERIALIZED LAKE VIEW silver.cleaned_patient; REFRESH MATERIALIZED LAKE VIEW gold.patient_summary; Consume the Gold view
Connect Power BI (or any SQL client) to the lakehouse and query:
SELECT * FROM gold.patient_summary; 6️⃣ Notebook Walk‑through Section Key code (abridged) What it does Bronze schema & table CREATE SCHEMA IF NOT EXISTS bronze; CREATE TABLE bronze.raw_patient ( … ) Defines raw landing zone. Enable Change‑Data‑Feed ALTER TABLE bronze.raw_patient SET TBLPROPERTIES (delta.enableChangeDataFeed = true); Makes downstream incremental refresh possible. Insert sample data INSERT INTO bronze.raw_patient VALUES (1, '9990001111', …); Populates a deterministic test set. Silver MLV CREATE MATERIALIZED LAKE VIEW silver.cleaned_patient AS SELECT patient_id, sha2(nhs_number,256) AS patient_hash, initcap(first_name) … FROM bronze.raw_patient; Cleanses & pseudonymises data. Gold MLV CREATE MATERIALIZED LAKE VIEW gold.patient_summary AS SELECT condition_code, COUNT() AS total_patients, AVG(CAST(split(blood_pressure,'/')[0] AS INT)) AS avg_systolic, … FROM silver.cleaned_patient GROUP BY condition_code; Produces KPI summary. Automatic lineage No explicit code – Fabric records it when the MLVs are created. Tests Python asserts (df.count() > 0, "patient_hash" in df.columns), Spark SQL SHOW TABLES, DESCRIBE DETAIL. Guarantees view existence, correct columns, CDF enabled, etc. Refresh (optional) REFRESH MATERIALIZED LAKE VIEW … Force a full recompute for debugging. 7️⃣ Testing & Validation Test Description Pass criteria Bronze row‑count SELECT COUNT() FROM bronze.raw_patient ≥ 1 Silver column integrity Verify patient_hash exists; gender never equals Unknown True Gold KPI correctness Manual calculation of averages matches view output (tolerance 0.01) True CDF propagation Insert a new Bronze row → Silver & Gold row‑counts increase without manual refresh Incremental refresh fires GDPR check No column named nhs_number appears in any Materialised view True Lineage metadata DESCRIBE DETAIL silver.cleaned_patient contains fabric.source.entities with upstream table reference Non‑empty array All tests are implemented as assertions in the notebook; they will raise an exception if any condition fails.
8️⃣ Security, Governance & Compliance Requirement How the demo satisfies it NHS Data Security & Protection Toolkit • Privacy by design: NHS numbers are hashed (SHA‑256). • Access control: Fabric RBAC limits view access to read‑only roles. GDPR (Art 4, 5, 25, 32) • Pseudonymisation – original identifiers never stored in a view. • Data minimisation: Only needed columns are kept. • Data‑subject rights: Because identifiers are hashed, erasure requests are simplified. DAMA‑DM‑Core (Metadata, Lineage, Quality) • Metadata – DESCRIBE DETAIL returns schema, size, refresh timestamps. • Lineage – Fabric automatically records upstream‑downstream relationships (visible in Governance hub). • Quality – Notebook asserts gender values, row‑counts, CDF enabled. Auditing All Spark‑SQL statements are logged to the workspace Activity Log (exportable to Azure Monitor). Encryption Underlying OneLake storage is encrypted at rest with Microsoft‑managed keys; TLS is enforced in‑flight. Data‑Protection Impact Assessment (DPIA) Required only when loading real patient data; the demo uses synthetic data, but the architecture is DPIA‑ready. Tip: When you replace the synthetic data with a live NHS feed, register the dataset in Fabric Governance with the classification PHI and enable Data‑Retention policies that match the NHS UK data‑retention schedule.
9️⃣ License & Contributing License – MIT (see LICENSE file). Contributing – Fork → create a feature branch → PR. Code of Conduct – Contributor Covenant v2.0. Open an Issue for bugs, feature requests, or security concerns.
🔟 References & Further Reading Topic Link Materialised Lake Views (MLV) – official docs https://learn.microsoft.com/fabric/data-warehouse/materialized-lake-views Delta Change‑Data‑Feed https://learn.microsoft.com/azure/databricks/delta/delta-change-data-feed GDPR – Pseudonymisation (Art 4(1)) https://gdpr-info.eu/art-4-gdpr/ NHS Digital – Data Security and Protection Toolkit https://digital.nhs.uk/data-and-information/looking-after-information/data-security-and-protection-toolkit DAMA‑DM‑Core – Data Management Body of Knowledge https://www.dama.org/content/body-knowledge Microsoft Fabric Governance (Purview) https://learn.microsoft.com/fabric/governance/ Best practice for hashing NHS numbers https://digital.nhs.uk/services/nhs-number (see Security section) Power BI + Lakehouse (DirectQuery) https://learn.microsoft.com/power-bi/connect-data/azure-synapse-analytics