Skip to content

codezelaca/de-dw_optimization_and_finops

Repository files navigation

Data Warehouse Optimization and FinOps

A comprehensive guide to Snowflake data warehouse optimization and financial operations (FinOps) practices. This repository demonstrates practical techniques for optimizing query performance, managing costs, and monitoring resource consumption in Snowflake.

📋 Table of Contents


🎯 Overview

This project provides hands-on examples of:

  • Query Performance Optimization - Techniques to improve query execution time
  • Partition Pruning - Reducing data scanned by leveraging filtering
  • Materialized Views & Summary Tables - Pre-aggregating data for faster queries
  • FinOps Monitoring - Tracking credit consumption and budget management
  • Cost Attribution - Identifying expensive queries and resource usage patterns

Platform: Snowflake Data Warehouse


📦 Prerequisites

Before running these scripts, ensure you have:

  • Active Snowflake account with appropriate privileges
  • Access to SNOWFLAKE.ACCOUNT_USAGE schema for monitoring queries
  • Permissions to create warehouses, databases, schemas, and tables
  • Basic understanding of SQL and data warehousing concepts

📁 Project Structure

├── create_data_set.sql              # Initial setup and sample data generation
├── query_performance.sql            # Query performance monitoring
├── query_pruning.sql                # Partition pruning optimization
├── materialized_optimization.sql    # Aggregation optimization techniques
├── finops_and_monitoring.sql        # Cost tracking and budget monitoring
└── images/                          # FinOps dashboard screenshots
    ├── budget.png
    ├── consumption.png
    └── cost_management_overview.png

🚀 Implementation Guide

1. Setup & Data Generation (create_data_set.sql)

Purpose: Initialize the Snowflake environment and create a sample sales dataset for testing optimization techniques.

What it does:

  • Creates a dedicated warehouse (demo_wh) with XSMALL size
  • Configures auto-suspend (60 seconds) and auto-resume for cost efficiency
  • Creates database (demo_db) and schema (sales_schema)
  • Generates 1 million random sales records with realistic data distribution

Key Concepts:

-- Warehouse configuration for cost optimization
AUTO_SUSPEND = 60      -- Suspends after 60 seconds of inactivity
AUTO_RESUME = TRUE     -- Automatically resumes when queries arrive
INITIALLY_SUSPENDED = TRUE  -- Starts in suspended state (no charges)

Sample Data Structure:

Field Type Description
order_id INT Unique identifier for each order
order_date DATE Random date between 2023-01-01 and ~2 years
region STRING One of: North, South, East, West, Central
product STRING One of: Laptop, Phone, Tablet, Monitor
amount FLOAT Random sales amount between 100-5000

Run Order: Execute this file first to set up your environment.


2. Query Performance Monitoring (query_performance.sql)

Purpose: Analyze query execution metrics to identify performance bottlenecks and optimization opportunities.

Key Metrics Tracked:

  • Execution Time - Total time to complete query (converted to seconds)
  • Bytes Scanned - Amount of data read from storage
  • Rows Produced - Number of rows returned
  • Warehouse Size - Compute resources used

Queries Included:

A. Basic Aggregation Performance

-- Aggregates sales by region
SELECT region, SUM(amount), COUNT(*), AVG(amount)
FROM sales
GROUP BY region;

B. Query History Analysis

Uses INFORMATION_SCHEMA.QUERY_HISTORY() to inspect execution metrics:

  • Filter queries containing SUM(amount)
  • Exclude monitoring queries themselves
  • Display bytes scanned in MB for easier interpretation
  • Sort by most recent executions

When to use this: After running any query to measure its performance impact and identify optimization opportunities.


3. Partition Pruning Optimization (query_pruning.sql)

Purpose: Demonstrate how filtering reduces data scanning and improves performance.

Core Concept: Snowflake uses micro-partitions and metadata to skip reading irrelevant data when filters are applied.

Comparison Examples:

❌ Bad Practice - Full Table Scan

-- Scans all 1 million rows
SELECT COUNT(*), SUM(amount) FROM sales;

✅ Good Practice - Filtered Query

-- Only scans relevant partitions (6 months of data)
SELECT COUNT(*), SUM(amount)
FROM sales
WHERE order_date >= '2024-01-01'
    AND order_date < '2024-07-01';

Performance Impact Measurement: The script includes a query that compares mb_scanned and execution_time between filtered and unfiltered queries.

Expected Results:

  • Filtered queries typically scan 30-70% less data
  • Execution time reduces proportionally
  • Lower data scanning = lower compute costs

Best Practice: Always use WHERE clauses with date or categorical columns when possible.


4. Materialized Optimization (materialized_optimization.sql)

Purpose: Pre-aggregate commonly queried data to avoid repeated expensive calculations.

Strategy: Create summary tables for frequently accessed aggregations.

Implementation Approach:

Option A - Materialized View (If Supported):

CREATE MATERIALIZED VIEW sales_by_region AS
SELECT region, SUM(amount), COUNT(*), AVG(amount)
FROM sales
GROUP BY region;

Option B - Physical Summary Table (Fallback):

CREATE OR REPLACE TABLE sales_by_region_summary AS
SELECT region, SUM(amount), COUNT(*), AVG(amount)
FROM sales
GROUP BY region;

Performance Comparison:

The script includes a query that compares:

  • Original aggregation - Scans 1M rows every time
  • Summary table - Scans only 5 pre-aggregated rows

Use Cases:

  • Dashboards with repeated aggregations
  • Reports that don't need real-time data (can be refreshed periodically)
  • Executive summaries with high-level metrics

Trade-offs:

  • Pros: 50-90% faster query performance, reduced compute costs
  • ⚠️ Cons: Data may be stale, requires refresh strategy, additional storage cost

Refresh Strategy:

-- Schedule periodic refresh
CREATE OR REPLACE TABLE sales_by_region_summary AS
SELECT region, SUM(amount), COUNT(*), AVG(amount)
FROM sales
GROUP BY region;

5. FinOps & Monitoring (finops_and_monitoring.sql)

Purpose: Track credit consumption, identify expensive queries, and monitor budget compliance.

A. Warehouse Credit Consumption

Monitors daily credit usage across all warehouses:

-- 7-day credit consumption trend
SELECT warehouse_name, usage_date, total_credits, estimated_cost_usd
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY

Key Insights:

  • Identify which warehouses consume the most credits
  • Track daily spending trends
  • Estimate costs (assuming $2 per credit - adjust to your rate)

FinOps Actions:

  • Identify idle warehouses to suspend
  • Right-size over-provisioned warehouses
  • Consolidate underutilized warehouses

B. Top 10 Most Expensive Queries

Identifies queries that consume the most credits:

Credit Calculation Formula:

Credits = (execution_seconds / 3600) × warehouse_credit_rate

Warehouse Credit Rates:
- XSMALL: 1 credit/hour
- SMALL:  2 credits/hour
- MEDIUM: 4 credits/hour
- LARGE:  8 credits/hour

Use This To:

  • Find queries that should be optimized first (highest ROI)
  • Identify inefficient user queries
  • Detect long-running ETL jobs

Optimization Priority:

  1. Queries with high execution time + large warehouse size
  2. Frequently executed expensive queries
  3. Queries scanning excessive data

C. Monthly Budget Tracking Dashboard

Provides real-time budget compliance monitoring:

Status Categories:

Credits Used Percentage Status Action Required
< 25 < 50% On Track Continue monitoring
25-40 50-80% Monitor Closely Review optimization opportunities
40-50 80-100% Near Limit Immediate cost reduction needed
> 50 > 100% Over Budget Emergency cost controls

Budget Dashboard Features:

  • Real-time credit consumption by warehouse
  • Percentage of monthly budget consumed
  • Credits remaining until budget limit
  • Alert status based on thresholds

Example Output:

warehouse_name  | credits_used | monthly_budget | pct_consumed | status          | credits_remaining
----------------|--------------|----------------|--------------|-----------------|------------------
prod_etl_wh     | 42.5         | 50             | 85.0%        | Near Limit      | 7.5
analytics_wh    | 18.2         | 50             | 36.4%        | Monitor Closely | 31.8
dev_wh          | 8.5          | 50             | 17.0%        | On Track        | 41.5

📊 FinOps Dashboards

Visual monitoring of Snowflake costs and consumption patterns:

1. Budget Overview

Budget Dashboard Monthly budget tracking showing credit consumption by warehouse and alert status

2. Consumption Analysis

Consumption Dashboard Detailed view of credit consumption trends, peak usage periods, and cost drivers

3. Cost Management Overview

Cost Management Comprehensive cost management dashboard with budget alerts and optimization recommendations


🎯 Best Practices

Query Optimization

  1. Use WHERE Clauses - Always filter on date or indexed columns
  2. Limit Result Sets - Use LIMIT when testing queries
  3. Avoid SELECT * - Only select columns you need
  4. Use Clustering Keys - For large tables with common filter columns
  5. Monitor Query Profiles - Review execution plans regularly

Warehouse Management

  1. Right-Size Warehouses - Start small, scale up if needed
  2. Enable Auto-Suspend - Set to 60-300 seconds based on usage patterns
  3. Use Multi-Cluster - Only for concurrent workload spikes
  4. Separate Workloads - Different warehouses for ETL vs analytics
  5. Query Acceleration - Enable for unpredictable workloads

Cost Optimization

  1. Set Budget Alerts - Configure email notifications at 75%, 90%, 100%
  2. Review Query History - Weekly analysis of expensive queries
  3. Use Result Caching - Identical queries return cached results (free)
  4. Schedule ETL Wisely - Avoid overlapping heavy workloads
  5. Archive Old Data - Move to cheaper storage tiers

Data Management

  1. Regular Vacuuming - Remove deleted data to reduce scanning
  2. Partition Columns - Design tables with time-based partitioning
  3. Compression - Snowflake auto-compresses but consider data types
  4. Time Travel Settings - Reduce retention period for non-critical tables
  5. Materialized Views - For expensive aggregations used in dashboards

💰 Cost Optimization Strategies

Immediate Actions (Quick Wins)

  • ✅ Enable auto-suspend on all warehouses
  • ✅ Reduce time travel retention from 90 to 7 days for dev tables
  • ✅ Suspend unused warehouses
  • ✅ Scale down over-provisioned warehouses

Short-Term (1-4 Weeks)

  • 📊 Analyze and optimize top 10 expensive queries
  • 📊 Create summary tables for dashboard aggregations
  • 📊 Implement budget alerts for all warehouses
  • 📊 Review and consolidate similar warehouses

Long-Term (1-3 Months)

  • 🎯 Implement clustering keys on large tables
  • 🎯 Migrate infrequently accessed data to cheaper storage
  • 🎯 Establish query governance policies
  • 🎯 Automate cost reporting and anomaly detection
  • 🎯 Train team on optimization best practices

ROI Calculation

Monthly Savings = (Before Credits - After Credits) × Credit Rate
ROI % = (Monthly Savings / Implementation Cost) × 100

Example:

  • Before: 500 credits/month at $2/credit = $1,000
  • After optimization: 350 credits/month = $700
  • Monthly savings: $300
  • Annual savings: $3,600

🔍 Monitoring & Alerts

Set Up Budget Alerts

-- Create resource monitor
CREATE RESOURCE MONITOR monthly_limit WITH
    CREDIT_QUOTA = 500
    FREQUENCY = MONTHLY
    START_TIMESTAMP = IMMEDIATELY
    TRIGGERS
        ON 75 PERCENT DO NOTIFY
        ON 90 PERCENT DO NOTIFY
        ON 100 PERCENT DO SUSPEND
        ON 110 PERCENT DO SUSPEND_IMMEDIATE;

Weekly Review Checklist

  • Review top 10 expensive queries
  • Check warehouse credit consumption trends
  • Identify queries with declining performance
  • Verify auto-suspend settings are working
  • Review failed query patterns
  • Update summary tables/materialized views

📚 Additional Resources


🎓 Learning Path

Beginner:

  1. Start with create_data_set.sql to set up your environment
  2. Run queries and observe execution times
  3. Review query_performance.sql to understand metrics

Intermediate: 4. Experiment with query_pruning.sql to see filtering impact 5. Implement summary tables using materialized_optimization.sql 6. Set up basic monitoring with finops_and_monitoring.sql

Advanced: 7. Create custom FinOps dashboards 8. Implement automated optimization workflows 9. Establish organization-wide cost governance policies


About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors