diff --git a/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/README.md b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/README.md new file mode 100644 index 00000000..966de34d --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/README.md @@ -0,0 +1,105 @@ +# πŸ’° Cost/Performance Benchmark + +## 🌟 Overview + +This template provides a crucial framework for **FinOps (Financial Operations)** by running a **Cost/Performance Benchmark** on deep learning tasks. It accurately measures the trade-off between speed and cost, providing data to answer the core question: *Which hardware configuration delivers the best performance per dollar?* + +It uses a **custom Python logger** to record key metrics, generating a structured report that can be used to compare different machine types (e.g., A100 vs. V100, or CPU vs. GPU). + +### Key Metrics Tracked + + * **Cost/Epoch:** Calculated estimated cost based on the configured hourly rate. + * **Tokens/sec:** Measures the raw speed/throughput of the hardware. + * **Job Summary:** Provides total estimated cost and total execution time. + * **Hardware:** Tracks CPU vs. GPU execution path. + +----- + +## πŸ› οΈ Implementation Details + +### 1\. Project Setup (Bash Script) + +Save the following as `setup_benchmark_env.sh`. This script installs the necessary PyTorch library and configuration. + +```bash +#!/bin/bash + +ENV_NAME="cost_benchmark_env" +PYTHON_VERSION="3.11" + +echo "=================================================" +echo "πŸš€ Setting up Cost/Performance Benchmark Environment" +echo "=================================================" + +# 1. Create and Activate Stable VENV +rm -rf $ENV_NAME +python3.$PYTHON_VERSION -m venv $ENV_NAME +source $ENV_NAME/bin/activate + +# 2. Install PyTorch (Required for accurate CUDA event timing) +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 + +# 3. Install Helpers +pip install numpy pandas psutil + +echo "--- Installation Complete ---" +``` + +#### Execution + +1. **Grant Permission:** `chmod +x setup_benchmark_env.sh` +2. **Run Setup:** `./setup_benchmark_env.sh` + +----- + +### 2\. Procedures (Job Execution) + +#### Step A: Activate the Environment + +```bash +source cost_benchmark_env/bin/activate +``` + +#### Step B: Configure Pricing (CRITICAL) + +Before running the script, you **must** update the `GPU_HOURLY_RATE` constant in `cost_benchmark.py` to reflect the actual hourly cost of the machine you are testing on Saturn Cloud. + +```python +# --- Configuration & Constants in cost_benchmark.py --- +# UPDATE THIS VALUE MANUALLY based on your Saturn Cloud instance type +GPU_HOURLY_RATE = 3.20 # Example $/hour for a high-end GPU (must be updated manually) +``` + +#### Step C: Run the Benchmark + +Execute the Python script (`cost_benchmark.py`). + +```bash +python cost_benchmark.py +``` + +### Verification and Reporting + +The script will generate structured output to the console and a persistent file named **`benchmark_results.log`**. + +| Log Entry Example | Metric Significance | +| :--- | :--- | +| `Time: 0.0500s` | Raw speed (lower is better). | +| `Cost: $0.00004` | **Cost/Epoch** (lower is better for efficiency). | +| `Tokens/s: 6400` | **Throughput/Speed** (higher is better for performance). | + +This log file serves as the definitive source for generating a comparative chart (Cost/Epoch vs. Tokens/sec) for optimal rightsizing. + +----- + +## 4\. πŸ”— Conclusion and Scaling on Saturn Cloud + +The **Cost/Performance Benchmark** template is fundamental to the **Optimize** phase of the FinOps lifecycle. By quantifying the true expense of your speed, you can make data-driven decisions to reduce cloud waste. + +To operationalize this benchmarking practice, **Saturn Cloud** offers the ideal platform: + + * **FinOps Integration:** Saturn Cloud is an all-in-one solution for data science and MLOps, essential for implementing robust FinOps practices. + * **Rightsizing and Optimization:** Easily run this job on different GPU types within Saturn Cloud to determine the most cost-effective solution before deploying models to production. [Saturn Cloud MLOps Documentation](https://www.saturncloud.io/docs/design-principles/concepts/mlops/) + * **Building a Cost-Conscious Culture:** Integrate cost awareness directly into your MLOps pipeline, aligning technical performance with financial goals. [Saturn Cloud Homepage](https://saturncloud.io/) + +**Optimize your cloud spend by deploying this template on Saturn Cloud\!** \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/cost_benchmark.py b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/cost_benchmark.py new file mode 100644 index 00000000..3e7a3a86 --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/cost_benchmark.py @@ -0,0 +1,155 @@ +import os +import time +import torch +import torch.nn as nn +import torch.optim as optim +import logging +import sys +import numpy as np + +# --- Configuration & Constants --- +# Use the correct GPU pricing for your cloud provider (e.g., Saturn Cloud, AWS, GCP) +# Example: NVIDIA A100 pricing (approximate, for demonstration) +GPU_HOURLY_RATE = 100000 # $/hour for a high-end GPU (Must be updated manually) +LOG_FILE = "benchmark_results.log" + +# Hyperparameters for the simulated job +EPOCHS = 5 +BATCH_SIZE = 32 +TOTAL_SAMPLES = 50000 +TOTAL_TOKENS_PER_SAMPLE = 100 # Represents tokens in an NLP task or features in an image +TOTAL_TOKENS = TOTAL_SAMPLES * TOTAL_TOKENS_PER_SAMPLE + +# --- Custom Logger Setup --- + +def setup_logger(): + """Configures the logger to write structured output to a file.""" + # Create the logger object + logger = logging.getLogger('BenchmarkLogger') + logger.setLevel(logging.INFO) + + # Define a custom format that includes time and specific placeholders + # We use a custom format to easily parse the final report later + formatter = logging.Formatter( + '%(asctime)s | %(levelname)s | %(message)s' + ) + + # File Handler + file_handler = logging.FileHandler(LOG_FILE, mode='w') + file_handler.setFormatter(formatter) + logger.addHandler(file_handler) + + # Console Handler (for real-time feedback) + stream_handler = logging.StreamHandler(sys.stdout) + stream_handler.setFormatter(formatter) + logger.addHandler(stream_handler) + + return logger + +# --- Model & Timing Functions --- + +class SimpleModel(nn.Module): + def __init__(self, input_size, output_size): + super().__init__() + self.linear = nn.Linear(input_size, output_size) + def forward(self, x): + return self.linear(x) + +def run_training_benchmark(logger, device): + + logger.info(f"--- STARTING BENCHMARK ON {device.type.upper()} ---") + + # Configuration based on device + INPUT_SIZE = 512 + OUTPUT_SIZE = 1 + + # Model and Data Setup (on the target device) + model = SimpleModel(INPUT_SIZE, OUTPUT_SIZE).to(device) + dummy_input = torch.randn(BATCH_SIZE, INPUT_SIZE, device=device) + dummy_target = torch.randn(BATCH_SIZE, OUTPUT_SIZE, device=device) + optimizer = optim.Adam(model.parameters()) + criterion = nn.MSELoss() + + # Total estimated cost + total_estimated_cost = 0.0 + + # Synchronization is crucial for accurate GPU timing + if device.type == 'cuda': + # Warm-up run is necessary to avoid compilation time bias + logger.info("Performing CUDA warm-up run...") + _ = model(dummy_input) + torch.cuda.synchronize() + + # Start timing the entire job + job_start_time = time.time() + + for epoch in range(1, EPOCHS + 1): + + if device.type == 'cuda': + # Use synchronized CUDA events for precise timing + start_event = torch.cuda.Event(enable_timing=True) + end_event = torch.cuda.Event(enable_timing=True) + start_event.record() + else: + start_event = time.time() + + # --- Simulated Training Step --- + optimizer.zero_grad() + output = model(dummy_input) + loss = criterion(output, dummy_target) + loss.backward() + optimizer.step() + # --- End Simulated Training Step --- + + if device.type == 'cuda': + end_event.record() + torch.cuda.synchronize() # Wait for GPU to finish + # elapsed_time returns milliseconds, convert to seconds + epoch_time_s = start_event.elapsed_time(end_event) / 1000.0 + else: + epoch_time_s = time.time() - start_event + + # --- COST AND PERFORMANCE CALCULATION --- + + # 1. Cost Calculation + cost_per_epoch = (epoch_time_s / 3600.0) * GPU_HOURLY_RATE + total_estimated_cost += cost_per_epoch + + # 2. Performance Calculation (Throughput) + throughput_samples_sec = BATCH_SIZE / epoch_time_s + throughput_tokens_sec = (BATCH_SIZE * TOTAL_TOKENS_PER_SAMPLE) / epoch_time_s + + # --- LOGGING THE RESULTS --- + logger.info( + f"EPOCH: {epoch}/{EPOCHS} | " + f"Time: {epoch_time_s:.4f}s | " + f"Cost: ${cost_per_epoch:.5f} | " + f"Tokens/s: {throughput_tokens_sec:.0f}" + ) + + job_total_time = time.time() - job_start_time + + # --- FINAL REPORT --- + logger.info("--- JOB SUMMARY ---") + logger.info(f"FINAL_COST: ${total_estimated_cost:.4f}") + logger.info(f"TOTAL_TIME: {job_total_time:.2f}s") + logger.info(f"TOTAL_TOKENS_PROCESSED: {TOTAL_TOKENS * EPOCHS}") + logger.info(f"-------------------") + + +def main(): + logger = setup_logger() + logger.info(f"Configuration: GPU Hourly Rate = ${GPU_HOURLY_RATE}/hr") + + # 1. Check for GPU availability + if torch.cuda.is_available(): + device = torch.device("cuda") + logger.info("GPU detected. Running GPU Benchmark.") + else: + device = torch.device("cpu") + logger.warning("GPU not detected. Running CPU Benchmark.") + + run_training_benchmark(logger, device) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/setup_benchmark_env.sh b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/setup_benchmark_env.sh new file mode 100644 index 00000000..c2bedaef --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/cost_benchmark/setup_benchmark_env.sh @@ -0,0 +1,20 @@ +#!/bin/bash + +ENV_NAME="cost_benchmark_env" +PYTHON_VERSION="3.12" + +echo "--- Setting up Cost/Performance Benchmark Environment ---" + +# 1. Create and Activate Stable VENV +rm -rf $ENV_NAME +python$PYTHON_VERSION -m venv $ENV_NAME +source $ENV_NAME/bin/activate + +# 2. Install PyTorch (GPU version for CUDA 12) +# We need PyTorch for accurate CUDA timing events. +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 + +# 3. Install Helpers +pip install numpy pandas psutil + +echo "βœ… Environment setup complete." \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/README.md b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/README.md new file mode 100644 index 00000000..4cef68bf --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/README.md @@ -0,0 +1,92 @@ +# πŸ“ˆ MLflow Experiment Tracking Template (GPU Ready) + +## 🌟 Overview + +This template provides a robust, reproducible framework for **tracking Deep Learning experiments** on GPU-accelerated hardware. It leverages **MLflow Tracking** to automatically log hyperparameters, model artifacts, and vital **GPU system utilization metrics** (memory, temperature, and usage) during the training process. + +This system is essential for comparing model performance and hardware efficiency across different runsβ€”a key capability for MLOps on platforms like **Saturn Cloud**. + +### Key Features + + * **GPU Readiness:** Dynamically detects and utilizes available CUDA devices. + * **Automatic Tracking:** Uses `mlflow.pytorch.autolog()` to capture hyperparameters and model architecture. + * **System Metrics:** Logs GPU/CPU usage and memory over time using `log_system_metrics=True`. + * **Centralized UI:** Easy verification and comparison of runs via the **MLflow UI table**. + +----- + +## πŸ› οΈ How to Run the Template + +### 1\. Project Setup (Bash Script) + +This script sets up a stable Python environment, installs PyTorch, MLflow, and the necessary GPU monitoring packages (`nvidia-ml-py`). + +#### File: `setup_mlflow_env.sh` + +#### Step A: Grant Execution Permission + +In your terminal, grant executable permission to the setup script. + +```bash +chmod +x setup_mlflow_env.sh +``` + +#### Step B: Execute the Setup + +Run the script to install all dependencies. + +```bash +./setup_mlflow_env.sh +``` + +----- + +### 2\. Procedures (Execution & Monitoring) + +#### Step C: Activate the Environment + +You must do this every time you open a new terminal session. + +```bash +source mlflow_gpu_env_stable/bin/activate +``` + +#### Step D: Configure Tracking Location + +The template uses the environment variable `MLFLOW_TRACKING_URI` to determine where to log data. + +| Mode | Configuration (Terminal Command) | Use Case | +| :--- | :--- | :--- | +| **Local (Default)** | (No command needed) | Development and testing where logs are written to the local `mlruns/` folder. | +| **Remote (Server)** | `export MLFLOW_TRACKING_URI="http://:5000"` | Production jobs requiring centralized, shared tracking (e.g., **Saturn Cloud Managed MLflow**). | + +#### Step E: Run the Tracking Sample + +Execute the main pipeline script (`train_and_track.py`). + +```bash +python train_and_track.py +``` + +#### Step F: Verification (Checking Tracked Data) + + * **Local UI Access:** If running locally, start the UI server: + ```bash + mlflow ui --host 0.0.0.0 --port 5000 + ``` + Then, access the exposed IP and port in your browser. + * **Remote UI Access:** Navigate to the host address of your remote tracking server. The **MLflow UI Table** will display the run, confirming successful logging of all parameters, metrics, and **GPU utilization** (see image above). + +----- + +## 4\. πŸ”— Conclusion and Scaling on Saturn Cloud + +This template successfully creates a fully observable training environment, fulfilling the core requirements of MLOps for GPU-accelerated workloads. All run detailsβ€”from hyperparameters to **GPU utilization metrics**β€”are now centralized and ready for comparison. + +To maximize performance, streamline infrastructure management, and integrate MLOps practices, deploy this template on **Saturn Cloud**: + + * **Official Saturn Cloud Website:** [Saturn Cloud](https://saturncloud.io/) + * **MLOps Guide:** Saturn Cloud enables a robust MLOps lifecycle by simplifying infrastructure, scaling, and experiment tracking. [A Practical Guide to MLOps](https://saturncloud.io/docs/design-principles/concepts/mlops/) + * **GPU Clusters:** Easily provision and manage GPU-equipped compute resources, including high-performance NVIDIA A100/H100 GPUs, directly within **Saturn Cloud**. [Saturn Cloud Documentation](https://saturncloud.io/docs/user-guide/) + +**Start building your scalable MLOps pipeline today on Saturn Cloud\!** \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/setup_mlflow_env.sh b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/setup_mlflow_env.sh new file mode 100644 index 00000000..e04726bf --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/setup_mlflow_env.sh @@ -0,0 +1,36 @@ +#!/bin/bash + +ENV_NAME="mlflow_gpu_env_stable" +PYTHON_VERSION="3.12" +CUDA_VERSION="12" + +echo "=================================================" +echo "πŸš€ Setting up MLflow GPU Tracking Environment (Python $PYTHON_VERSION)" +echo "=================================================" + +# --- 1. Create and Activate Stable VENV --- +rm -rf $ENV_NAME +python$PYTHON_VERSION -m venv $ENV_NAME +source $ENV_NAME/bin/activate +echo "βœ… Virtual Environment created and activated." + +# --- 2. Install Core Libraries --- +echo "--- Installing Core MLflow and PyTorch Libraries ---" + +# Install PyTorch (GPU version for CUDA 12.1) +pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 + +# Install MLflow and helper libraries +pip install mlflow==2.11.3 numpy scikit-learn pandas + +# --- 3. Replace Deprecated PYNVML for System Metrics --- +echo "--- Replacing deprecated pynvml with nvidia-ml-py ---" + +# Uninstall old package (if it exists) +pip uninstall -y pynvml + +# Install the correct GPU monitoring package and prerequisites +pip install psutil nvidia-ml-py + +echo "--- Installation Complete ---" +echo "βœ… Environment is ready. Run 'source $ENV_NAME/bin/activate' before executing the Python script." \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/train_and_track.py b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/train_and_track.py new file mode 100644 index 00000000..1e2b7628 --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/mlflow-tracking/train_and_track.py @@ -0,0 +1,109 @@ +import os +import time +import torch +import torch.nn as nn +import torch.optim as optim +from torch.utils.data import TensorDataset, DataLoader + +import mlflow +import mlflow.pytorch +import numpy as np + +# --- Configuration --- +# 1. MLflow Tracking URI (MLflow server or local './mlruns') +MLFLOW_TRACKING_URI = os.environ.get("MLFLOW_TRACKING_URI", "file:./mlruns") +MLFLOW_EXPERIMENT_NAME = "GPU_DeepLearning_RunPod" + +# 2. Hyperparameters (These will be automatically logged by mlflow.pytorch.autolog()) +# Note: Autologging handles logging the optimizer details and LR automatically. +PARAMS = { + "learning_rate": 0.001, + "epochs": 5, + "batch_size": 32, + "model_type": "SimpleConvNet", + "optimizer": "Adam" +} + +# --- PyTorch Model Definition --- +class SimpleConvNet(nn.Module): + def __init__(self): + super(SimpleConvNet, self).__init__() + self.conv1 = nn.Conv2d(1, 10, kernel_size=5) + self.relu = nn.ReLU() + self.fc = nn.Linear(10 * 24 * 24, 1) + + def forward(self, x): + x = self.relu(self.conv1(x)) + x = x.view(-1, 10 * 24 * 24) + x = self.fc(x) + return x + +def train_and_log(device): + + # --- 1. MLflow Setup --- + mlflow.set_tracking_uri(MLFLOW_TRACKING_URI) + mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME) + + # 2. ENABLE AUTOLOGGING: Automatically logs model, params, and metrics (except custom loops) + mlflow.pytorch.autolog(log_models=True, log_datasets=False) # + + # 3. START RUN: Enable system metrics logging inside the run context + with mlflow.start_run(run_name="GPU_Train_Run", log_system_metrics=True) as run: + + # Log system information manually (GPU type and custom params not auto-logged) + if device.type == 'cuda': + mlflow.log_param("gpu_device", torch.cuda.get_device_name(0)) + mlflow.log_params(PARAMS) + + # --- Training Execution --- + print(f"Starting training on device: {device} with LR={PARAMS['learning_rate']}") + + # Simulate Data Setup + data = torch.randn(100, 1, 28, 28, device=device) + labels = torch.randint(0, 2, (100, 1), dtype=torch.float32, device=device) + dataloader = DataLoader(TensorDataset(data, labels), batch_size=PARAMS['batch_size']) + + model = SimpleConvNet().to(device) + optimizer = optim.Adam(model.parameters(), lr=PARAMS['learning_rate']) + criterion = nn.BCEWithLogitsLoss() + + # Training Loop + for epoch in range(PARAMS['epochs']): + total_loss = 0.0 + + for inputs, targets in dataloader: + optimizer.zero_grad() + outputs = model(inputs) + loss = criterion(outputs, targets) + loss.backward() + optimizer.step() + total_loss += loss.item() + + avg_loss = total_loss / len(dataloader) + + # Manually log the primary metric (optional, as autolog might cover this in integrated loops) + mlflow.log_metric("avg_loss_manual", avg_loss, step=epoch) + + print(f"Epoch {epoch+1} - Loss: {avg_loss:.4f}") + + # 4. Final Logging + mlflow.log_metric("final_loss", avg_loss) + + # Note: Model and optimizer params are logged automatically by mlflow.pytorch.autolog() + + print("\nβœ… Training complete.") + print(f"MLflow Run ID: {run.info.run_id}") + + +def main(): + if torch.cuda.is_available(): + device = torch.device("cuda") + print("πŸ’‘ GPU detected and available.") + else: + device = torch.device("cpu") + print("⚠️ GPU not detected. Running on CPU.") + + train_and_log(device) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/monitoring_grafana/cpu_monitoring.ipynb b/examples/MLOps_Ops_and_Enterprise_Features/monitoring_grafana/cpu_monitoring.ipynb new file mode 100644 index 00000000..b1cdccef --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/monitoring_grafana/cpu_monitoring.ipynb @@ -0,0 +1,228 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# πŸ“Š Prometheus + Grafana Monitoring\n", + "\n", + "**Role:** Installs and runs a system monitoring stack inside this notebook.\n", + "**Focus:** CPU, Memory, Disk, and Network metrics.\n", + "\n", + "**Components:**\n", + "* **Prometheus:** Time-series database (Port 9090)\n", + "* **Grafana:** Dashboard UI (Port 3000)\n", + "* **Node Exporter:** System metrics agent (Port 9100)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Installation (Run Once)\n", + "This cell downloads the standalone binaries. We use a robust script that checks if files exist to prevent errors." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "\n", + "# Define Versions\n", + "PROM_VER=\"2.45.0\"\n", + "GRAFANA_VER=\"10.0.3\"\n", + "NODE_EXP_VER=\"1.6.1\"\n", + "\n", + "# Create directories if they don't exist\n", + "mkdir -p prometheus_bin grafana_bin\n", + "\n", + "# --- Download Prometheus ---\n", + "if [ ! -f prometheus_bin/prometheus ]; then\n", + " echo \"⬇️ Downloading Prometheus...\"\n", + " wget -q -nc https://github.com/prometheus/prometheus/releases/download/v${PROM_VER}/prometheus-${PROM_VER}.linux-amd64.tar.gz\n", + " tar xf prometheus-${PROM_VER}.linux-amd64.tar.gz --strip-components=1 -C prometheus_bin\n", + " echo \"βœ… Prometheus Installed\"\n", + "else\n", + " echo \"⏩ Prometheus already installed.\"\n", + "fi\n", + "\n", + "# --- Download Grafana ---\n", + "if [ ! -f grafana_bin/bin/grafana-server ]; then\n", + " echo \"⬇️ Downloading Grafana...\"\n", + " wget -q -nc https://dl.grafana.com/oss/release/grafana-${GRAFANA_VER}.linux-amd64.tar.gz\n", + " tar xf grafana-${GRAFANA_VER}.linux-amd64.tar.gz --strip-components=1 -C grafana_bin\n", + " echo \"βœ… Grafana Installed\"\n", + "else\n", + " echo \"⏩ Grafana already installed.\"\n", + "fi\n", + "\n", + "# --- Download Node Exporter ---\n", + "if [ ! -f prometheus_bin/node_exporter ]; then\n", + " echo \"⬇️ Downloading Node Exporter...\"\n", + " wget -q -nc https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXP_VER}/node_exporter-${NODE_EXP_VER}.linux-amd64.tar.gz\n", + " tar xf node_exporter-${NODE_EXP_VER}.linux-amd64.tar.gz --strip-components=1 -C prometheus_bin\n", + " echo \"βœ… Node Exporter Installed\"\n", + "else\n", + " echo \"⏩ Node Exporter already installed.\"\n", + "fi\n", + "\n", + "# --- Cleanup ---\n", + "rm *.tar.gz 2>/dev/null || true\n", + "echo \"πŸŽ‰ All binaries ready!\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Generate Configuration\n", + "We generate a `prometheus.yml` that configures Prometheus to scrape the Node Exporter every 5 seconds." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prometheus_config = \"\"\"\n", + "global:\n", + " scrape_interval: 5s\n", + "\n", + "scrape_configs:\n", + " - job_name: 'jupyter_cpu_node'\n", + " static_configs:\n", + " - targets: ['localhost:9100'] # Node Exporter\n", + "\"\"\"\n", + "\n", + "with open(\"prometheus.yml\", \"w\") as f:\n", + " f.write(prometheus_config)\n", + "print(\"βœ… 'prometheus.yml' configuration created.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Start Services\n", + "Launch the background processes. Logs are redirected to `.log` files in your directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "import os\n", + "\n", + "BASE_DIR = os.getcwd()\n", + "PROM_BIN = f\"{BASE_DIR}/prometheus_bin/prometheus\"\n", + "GRAFANA_BIN = f\"{BASE_DIR}/grafana_bin/bin/grafana-server\"\n", + "NODE_EXP_BIN = f\"{BASE_DIR}/prometheus_bin/node_exporter\"\n", + "\n", + "def start_process(cmd, log_file):\n", + " with open(log_file, \"w\") as f:\n", + " return subprocess.Popen(cmd.split(), stdout=f, stderr=subprocess.STDOUT)\n", + "\n", + "# 1. Start Node Exporter\n", + "p_node = start_process(f\"{NODE_EXP_BIN}\", \"node_exp.log\")\n", + "print(f\"πŸš€ Node Exporter started (PID: {p_node.pid})\")\n", + "\n", + "# 2. Start Prometheus\n", + "p_prom = start_process(f\"{PROM_BIN} --config.file=prometheus.yml\", \"prometheus.log\")\n", + "print(f\"πŸš€ Prometheus started (PID: {p_prom.pid})\")\n", + "\n", + "# 3. Start Grafana\n", + "grafana_cmd = f\"{GRAFANA_BIN} --homepath {BASE_DIR}/grafana_bin\"\n", + "p_graf = start_process(grafana_cmd, \"grafana.log\")\n", + "print(f\"πŸš€ Grafana started (PID: {p_graf.pid})\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Access Dashboard\n", + "Grafana is now running on **Port 3000**.\n", + "\n", + "**1. Login:**\n", + "* Open the link below.\n", + "* User: `admin` / Password: `admin`.\n", + "\n", + "**2. Add Data Source:**\n", + "* Go to **Connections** -> **Data Sources** -> **Add data source**.\n", + "* Select **Prometheus**.\n", + "* **IMPORTANT:** Set URL to `http://127.0.0.1:9090` (Do not use localhost).\n", + "* Click **Save & Test**.\n", + "\n", + "**3. Import Dashboard:**\n", + "* Go to **Dashboards** -> **New** -> **Import**.\n", + "* Enter ID **1860** (Node Exporter Full) and click **Load**.\n", + "* Select your Prometheus source and click **Import**." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(\"πŸ“Š Access Grafana here:\")\n", + "print(\"https:///proxy/3000/\")\n", + "print(\"(Note: Replace with your browser URL domain, or localhost:3000 if local)\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# πŸ›‘ STOP ALL SERVICES\n", + "# Uncomment and run this cell to kill the background processes\n", + "\n", + "p_node.terminate()\n", + "p_prom.terminate()\n", + "p_graf.terminate()\n", + "print(\"πŸ›‘ All services stopped.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🏁 Conclusion\n", + "\n", + "This template provides immediate visibility into your compute resources without complex installation steps. It helps you identify bottlenecksβ€”such as single-core saturation or memory limitsβ€”during your data processing jobs.\n", + "\n", + "To persist these metrics long-term or monitor multiple nodes, consider deploying this stack as a dedicated service on [Saturn Cloud](https://saturncloud.io/)." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/AWS_Setup.md b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/AWS_Setup.md new file mode 100644 index 00000000..f1a10e85 --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/AWS_Setup.md @@ -0,0 +1,260 @@ + +### 1. The Network (Click-by-Click Guide) + +To guarantee complete isolation, we will build a VPC that physically cannot route traffic to the public internet. + +#### Step 1.1: Create the Isolated VPC + +1. Log into the AWS Management Console and navigate to the **VPC Dashboard**. +2. Click the orange **Create VPC** button. +3. Under *Resources to create*, select **VPC and more**. +4. **Name tag auto-generation:** Enter `airgapped-vpc`. +5. **IPv4 CIDR block:** Leave as default (usually `10.0.0.0/16`). +6. **Number of Availability Zones:** Select **1** (to keep costs down). +7. **Number of public subnets:** Select **0** *(CRITICAL: This ensures no Internet Gateway is created).* +8. **Number of private subnets:** Select **1**. +9. **NAT gateways ($):** Select **None** *(CRITICAL: This ensures the private subnet cannot leak to the internet).* +10. **VPC endpoints:** Select **None** (We will configure these explicitly in the next steps). +11. Click **Create VPC**. + +#### Step 1.2: Create the Internal Security Group + +Before we create our endpoints, we need a firewall rule that allows our VPC to talk to itself internally. + +1. On the left sidebar of the VPC Dashboard, click **Security Groups**. +2. Click **Create security group**. +3. **Security group name:** `airgapped-internal-sg`. +4. **VPC:** Select your new `airgapped-vpc` from the dropdown. +5. **Inbound rules:** Click **Add rule**. +* **Type:** Select `HTTPS` (Port 443 is required for SSM Interface Endpoints). +* **Source:** Select `Custom` and type in your VPC's CIDR block (e.g., `10.0.0.0/16`). + + +6. Click **Create security group**. + +#### Step 1.3: Create the S3 Gateway Endpoint + +This endpoint acts as a secret backdoor, allowing your EC2 instance to reach S3 securely without traversing the public internet. + +1. On the left sidebar of the VPC Dashboard, click **Endpoints**. +2. Click **Create endpoint**. +3. **Name tag:** `s3-gateway-endpoint`. +4. **Service category:** Select **AWS services**. +5. **Services:** In the search bar, type `s3` and press Enter. +6. Look at the *Type* column. You must select the one that says **Gateway** (e.g., `com.amazonaws.eu-north-1.s3`). +7. **VPC:** Select `airgapped-vpc`. +8. **Route tables:** Check the box next to the route table associated with your private subnet. *(This automatically injects the route so your server knows how to find S3).* +9. Click **Create endpoint**. + +#### Step 1.4: Create the SSM Interface Endpoints + +Because your EC2 instance has no internet, you cannot SSH into it normally. AWS Systems Manager (SSM) allows secure terminal access, but it requires three specific "Interface" endpoints to function offline. + +You will repeat this exact process **three times**, once for each required service: + +1. Click **Create endpoint**. +2. **Name tag:** `ssm-endpoint-1` (then 2, then 3). +3. **Service category:** **AWS services**. +4. **Services:** Search for and select the following services one by one (Ensure the *Type* says **Interface**): +* First endpoint: `com.amazonaws.[your-region].ssm` +* Second endpoint: `com.amazonaws.[your-region].ssmmessages` +* Third endpoint: `com.amazonaws.[your-region].ec2messages` + + +5. **VPC:** Select `airgapped-vpc`. +6. **Subnets:** Check the box for your one Availability Zone, then select your private subnet. +7. **Security groups:** Check the box for the `airgapped-internal-sg` you created in Step 1.2. +8. Click **Create endpoint**. *(Repeat until all 3 are created).* + +--- + +### 2. Identity & Security (IAM Role & S3 Policy) + +Before we launch the server, we need to give it an "ID Badge" (IAM Role) and lock down the S3 bucket so it strictly trusts that badge and your internal network. + +#### Step 2.1: Create the EC2 IAM Role + +1. Navigate to the **IAM Dashboard** in the AWS Console. +2. On the left sidebar, click **Roles**, then click the orange **Create role** button. +3. **Trusted entity type:** Select **AWS service**. +4. **Use case:** Select **EC2** and click **Next**. +5. **Add permissions:** Use the search bar to find and check the boxes next to these two exact policies: +* `AmazonSSMManagedInstanceCore` *(Allows the SSM tunnel to connect)* +* `AmazonS3FullAccess` *(Allows the server to read/write to your bucket)* + + +6. Click **Next**. +7. **Role name:** Type `Airgapped-EC2-Role`. +8. Click **Create role**. + +#### Step 2.2: Create and Lock Down the S3 Bucket + +1. Navigate to the **S3 Dashboard**. +2. Click **Create bucket**. +3. **Bucket name:** Choose a globally unique name (e.g., `airgapped-vpc-ssm`). +4. Leave all other settings as default (Ensure "Block all public access" remains **checked**) and click **Create bucket**. +5. Open your new bucket by clicking its name, then navigate to the **Permissions** tab. +6. Scroll down to **Bucket policy** and click **Edit**. +7. Paste the strict Zero-Trust policy below. **CRITICAL:** You must replace the `YOUR_BUCKET_NAME` with your actual bucket name, and replace `vpce-XXXXXXXXXXXXXXXXX` with the actual ID of the S3 Gateway Endpoint you created in Step 1.3! + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Sid": "VPCOnlyAccess", + "Effect": "Deny", + "Principal": "*", + "Action": "s3:*", + "Resource": [ + "arn:aws:s3:::YOUR_BUCKET_NAME", + "arn:aws:s3:::YOUR_BUCKET_NAME/*" + ], + "Condition": { + "StringNotEquals": { + "aws:sourceVpce": "vpce-XXXXXXXXXXXXXXXXX" + } + } + } + ] +} + +``` + +8. Click **Save changes**. *(Note: You will immediately see an "Access Denied" error in your browser. This proves the airgap is working, as your browser is not inside the VPC!)* + +--- + +## πŸ’Ύ Phase 2: EC2 Deployment & S3 Mountpoint (Click-by-Click) + +Now we drop the server straight into the airgapped vault. + +#### Step 2.1: Launch the Airgapped EC2 Instance + +1. Navigate to the **EC2 Dashboard** and click **Launch instance**. +2. **Name:** `Airgapped-Dashboard-Server`. +3. **Application and OS Images (AMI):** Select **Amazon Linux 2023 AMI**. +4. **Instance type:** Select **t2.micro** or **t3.micro** (CPU is all we need). +5. **Key pair (login):** Select **Proceed without a key pair** from the dropdown. *(We do not need SSH keys because we are using SSM!)* +6. **Network settings:** Click the **Edit** button in this box. +* **VPC:** Select your `airgapped-vpc`. +* **Subnet:** Select your private subnet. +* **Auto-assign public IP:** Select **Disable**. *(Crucial for the airgap).* +* **Firewall (security groups):** Click **Select existing security group** and choose your `airgapped-internal-sg`. + + +7. **Advanced details:** Expand this bottom section, scroll down to **IAM instance profile**, and select your `Airgapped-EC2-Role`. +8. Click the orange **Launch instance** button. + +#### Step 2.2: Connect and Mount S3 + +1. Wait a few minutes for the instance state to show "Running". +2. Select the instance, click the **Connect** button at the top of the screen. +3. Select the **Session Manager** tab and click **Connect**. A black terminal will open in your browser. +4. Run these exact commands to attach your S3 bucket as a local folder: + +```bash +sudo mkdir -p /mnt/s3-data +sudo chmod 777 /mnt/s3-data +sudo dnf install mount-s3 -y +mount-s3 YOUR_BUCKET_NAME /mnt/s3-data + +``` + +--- + +## πŸ“¦ Phase 3: The "Airgap Bypass" (Dependency Smuggling) + +Because the server has zero internet access, standard Python package installations will fail. We must smuggle them in. + +#### Step 3.1: Package the Files Locally + +Open a terminal on your **local, internet-connected laptop** (e.g., your Kali Linux terminal) and run: + +```bash +mkdir sm_packages_linux +pip download --only-binary=:all: --platform manylinux2014_x86_64 --python-version 39 streamlit boto3 -d sm_packages_linux +zip -r sm_packages_linux.zip sm_packages_linux + +``` + +#### Step 3.2: Smuggle via S3 + +1. Go to your **EC2 Session Manager Terminal** and temporarily delete the strict bucket policy so you can upload from your browser: +```bash +aws s3api delete-bucket-policy --bucket YOUR_BUCKET_NAME + +``` + + +2. Go back to your **AWS S3 Console** in your web browser. Refresh the page. +3. Click **Upload** and drop your `sm_packages_linux.zip` file into the bucket. +4. **CRITICAL:** Once uploaded, immediately go back to the S3 **Permissions** tab and paste your strict Bucket Policy back in to lock the vault! + +#### Step 3.3: Offline Installation + +Go back to your **EC2 Session Manager Terminal** and unpack the smuggled files: + +```bash +# Install pip securely via Amazon's internal S3 repos +sudo dnf install python3-pip -y + +# Move the zip file out of the S3 mount to allow unzipping +cd ~ +cp /mnt/s3-data/sm_packages_linux.zip . +unzip sm_packages_linux.zip + +# Install strictly offline, bypassing any OS-level conflicts +python3 -m pip install --user --no-index --find-links=sm_packages_linux/ streamlit boto3 + +``` + +--- + +## πŸ“Š Phase 4: Application Deployment & Tunneling + +#### Step 4.1: Create and Run the Dashboard + +In your **EC2 Session Manager Terminal**, create the file: + +```bash +nano app.py + +``` + +*(Paste the Python code from the Overview section here, save with `Ctrl+O` -> `Enter`, and exit with `Ctrl+X`)*. + +Run the dashboard using the explicit user path: + +```bash +/home/ssm-user/.local/bin/streamlit run app.py + +``` + +*(Leave this terminal running!)* + +#### Step 4.2: Build the Secure Tunnel (Local Laptop) + +Open a new terminal on your **local laptop**. Ensure your AWS CLI is configured (`aws configure`) with an IAM user that has Administrative or SSM access. + +Run the tunnel command: + +```bash +aws ssm start-session \ + --target i-XXXXXXXXXXXXXXXXX \ + --document-name AWS-StartPortForwardingSession \ + --parameters "portNumber"=["8501"],"localPortNumber"=["8501"] \ + --region eu-north-1 + +``` + +*(Replace `i-XXXXXXXXXXXXXXXXX` with your actual EC2 Instance ID, found on the EC2 Dashboard).* + +#### Step 4.3: View the Dashboard + +Once your local terminal says `Waiting for connections...`, open your local web browser and explicitly type **HTTP** (not HTTPS): + +```text +http://localhost:8501 + +``` \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/README.md b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/README.md new file mode 100644 index 00000000..c068819c --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/README.md @@ -0,0 +1,192 @@ +# πŸ”’ Private Data Access (VPC Only) + +**Hardware:** CPU (Amazon Linux 2023) | **Resource:** Dashboard | **Tech Stack:** AWS SDK (Python/Boto3), Streamlit | **Architecture:** VPC-Only S3 Mounts + +![vpc--padlock](vpc-padlock.png) + +## πŸ“– Overview + +This template deploys a secure, airgapped S3 Data Gateway inside an isolated AWS environment. It reads data strictly from a local Mountpoint for Amazon S3 and queries AWS health metrics via VPC Endpoints. It is designed to be accessed purely through an encrypted AWS Systems Manager (SSM) tunnel, bypassing the public internet entirely. + +## βœ… Prerequisites + +Ensure the AWS environment is already configured with the following: + +* **The Network:** A private VPC subnet containing an Amazon Linux 2023 EC2 instance, with **no** Internet Gateway (IGW) or NAT Gateway attached. +* **VPC Endpoints:** An S3 Gateway Endpoint and SSM Interface Endpoints (`ssm`, `ssmmessages`, `ec2messages`) attached to your VPC. +* **IAM & Security:** The EC2 instance has an IAM Role attached with `AmazonSSMManagedInstanceCore` and S3 permissions. +* **Local Machine:** Your local terminal has the AWS CLI configured and the **AWS Session Manager Plugin** installed. + +Note: **AWS_Setup.md** contains step by step procedure to set this environment up. + +--- + +## πŸ“¦ Step 1: The "Airgap Bypass" (Dependency Smuggling) + +Because your target EC2 instance has no internet access, standard `pip install` commands will fail. You must package the Python dependencies on your local machine and transfer them to the server via your S3 bucket. + +**1. Download the Linux Packages Locally** +Open a terminal on your **local, internet-connected machine** and run these commands to force `pip` to download the specific Amazon Linux (`manylinux`) offline installers: + +```bash +mkdir sm_packages_linux +pip download --only-binary=:all: --platform manylinux2014_x86_64 --python-version 39 streamlit boto3 -d sm_packages_linux +zip -r sm_packages_linux.zip sm_packages_linux + +``` + +**2. Upload to S3** + +* Navigate to the **AWS S3 Console** in your web browser. +* Upload the `sm_packages_linux.zip` file directly to the root of your target S3 bucket. + +--- + +## πŸš€ Step 2: Automated Server Deployment + +Connect to your airgapped EC2 instance via **AWS Systems Manager (SSM) Session Manager** in the AWS Console. + +Instead of running commands one by one, we will use an automated deployment script. + +**1. Create the Deployment Script** +In your EC2 terminal, create a new script file: + +```bash +nano deploy_dashboard.sh + +``` + +**2. Paste the Automation Code** +Paste the code below into the file. **CRITICAL: Update the `YOUR_BUCKET_NAME` and `eu-north-1` variables at the very top of the script to match your environment!** + +```bash +#!/bin/bash + +# ========================================== +# CONFIGURATION (UPDATE THESE VARIABLES!) +# ========================================== +BUCKET_NAME="YOUR_BUCKET_NAME" +REGION="eu-north-1" +# ========================================== + +MOUNT_DIR="/mnt/s3-data" +ZIP_FILE="sm_packages_linux.zip" + +echo "πŸš€ Starting Airgapped Dashboard Deployment..." + +echo "πŸ“ Creating mount directory and installing mount-s3..." +sudo mkdir -p $MOUNT_DIR +sudo chmod 777 $MOUNT_DIR +sudo dnf install mount-s3 -y + +echo "πŸ”— Mounting S3 bucket..." +mount-s3 $BUCKET_NAME $MOUNT_DIR + +echo "🐍 Installing Python pip..." +sudo dnf install python3-pip -y + +echo "πŸ“¦ Extracting smuggled dependencies..." +cd ~ +cp $MOUNT_DIR/$ZIP_FILE . +unzip -o $ZIP_FILE + +echo "βš™οΈ Installing Python packages offline..." +python3 -m pip install --user --no-index --find-links=sm_packages_linux/ streamlit boto3 + +echo "πŸ“ Generating app.py..." +cat << EOF > app.py +import streamlit as st +import os +import boto3 + +MOUNT_PATH = "${MOUNT_DIR}" +BUCKET_NAME = "${BUCKET_NAME}" +REGION = "${REGION}" + +st.set_page_config(page_title="Secure VPC Dashboard", layout="wide") +st.title("πŸ”’ Airgapped S3 Data Gateway") +st.markdown("This dashboard runs securely inside a private VPC, completely isolated from the public internet.") + +col1, col2 = st.columns(2) + +with col1: + st.header("πŸ“ Local S3 Mount Viewer") + try: + files = os.listdir(MOUNT_PATH) + st.success(f"Connected to mount: \`{MOUNT_PATH}\`") + if files: + for file in files: + st.write(f"πŸ“„ {file}") + else: + st.info("The bucket is currently empty.") + except Exception as e: + st.error(f"Mount read error. Error: {e}") + +with col2: + st.header("πŸ“‘ AWS SDK Health Monitor") + try: + s3_client = boto3.client('s3', region_name=REGION) + response = s3_client.list_objects_v2(Bucket=BUCKET_NAME, MaxKeys=5) + st.success("βœ… VPC Endpoint Connection Active") + + if 'Contents' in response: + st.write("**Recent Objects via SDK:**") + for obj in response['Contents']: + st.write(f"☁️ \`{obj['Key']}\` ({round(obj['Size'] / 1024, 2)} KB)") + else: + st.info("No objects found via SDK.") + except Exception as e: + st.error(f"VPC Endpoint connection failed. Error: {e}") +EOF + +echo "πŸŽ‰ Deployment complete! Starting Streamlit server..." +/home/ssm-user/.local/bin/streamlit run app.py + +``` + +*(Save the file: Press `Ctrl+O`, `Enter`, then `Ctrl+X`).* + +**3. Run the Script** +Make the script executable and run it. It will automatically handle the S3 mounts, offline installations, Python code generation, and will launch the server directly! + +```bash +chmod +x deploy_dashboard.sh +./deploy_dashboard.sh + +``` + +*(Leave this terminal open and running once it says it is waiting on port 8501!)* + +--- + +## πŸš‡ Step 3: Access the Dashboard (Local Tunnel) + +Your dashboard is now running, but it is trapped inside the airgapped VPC. + +**1. Open the Encrypted Tunnel** +Open a terminal on your **local machine** and run the following command to securely forward the traffic. *(Replace `TARGET_INSTANCE_ID` and `YOUR_REGION` with your actual EC2 details)*: + +```bash +aws ssm start-session \ + --target TARGET_INSTANCE_ID \ + --document-name AWS-StartPortForwardingSession \ + --parameters "portNumber"=["8501"],"localPortNumber"=["8501"] \ + --region YOUR_REGION + +``` + +**2. View the Interface** +Once your local terminal says `Port 8501 opened for sessionId... Waiting for connections...`, open a web browser on your local machine and strictly navigate to the **HTTP** address (Do not use HTTPS): + +```text +http://localhost:8501 + +``` + +πŸŽ‰ **Your Secure Airgapped Dashboard is now live!** + +--- + +*Built with [Saturn](https://saturncloud.io) β€” Explore more templates and secure deployments.* + +--- diff --git a/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/app.py b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/app.py new file mode 100644 index 00000000..45f84e84 --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/app.py @@ -0,0 +1,42 @@ +import streamlit as st +import os +import boto3 + +MOUNT_PATH = "${MOUNT_DIR}" +BUCKET_NAME = "${BUCKET_NAME}" +REGION = "${REGION}" + +st.set_page_config(page_title="Secure VPC Dashboard", layout="wide") +st.title("πŸ”’ Airgapped S3 Data Gateway") +st.markdown("This dashboard runs securely inside a private VPC, completely isolated from the public internet.") + +col1, col2 = st.columns(2) + +with col1: + st.header("πŸ“ Local S3 Mount Viewer") + try: + files = os.listdir(MOUNT_PATH) + st.success(f"Connected to mount: \`{MOUNT_PATH}\`") + if files: + for file in files: + st.write(f"πŸ“„ {file}") + else: + st.info("The bucket is currently empty.") + except Exception as e: + st.error(f"Mount read error. Error: {e}") + +with col2: + st.header("πŸ“‘ AWS SDK Health Monitor") + try: + s3_client = boto3.client('s3', region_name=REGION) + response = s3_client.list_objects_v2(Bucket=BUCKET_NAME, MaxKeys=5) + st.success("βœ… VPC Endpoint Connection Active") + + if 'Contents' in response: + st.write("**Recent Objects via SDK:**") + for obj in response['Contents']: + st.write(f"☁️ \`{obj['Key']}\` ({round(obj['Size'] / 1024, 2)} KB)") + else: + st.info("No objects found via SDK.") + except Exception as e: + st.error(f"VPC Endpoint connection failed. Error: {e}") \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/vpc-padlock.png b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/vpc-padlock.png new file mode 100644 index 00000000..825067fb Binary files /dev/null and b/examples/MLOps_Ops_and_Enterprise_Features/private-vpc/vpc-padlock.png differ diff --git a/examples/MLOps_Ops_and_Enterprise_Features/secrets-iam/README.md b/examples/MLOps_Ops_and_Enterprise_Features/secrets-iam/README.md new file mode 100644 index 00000000..3e989c4e --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/secrets-iam/README.md @@ -0,0 +1,239 @@ +# πŸ” Template: Secrets & IAM Patterns + +*Powered by [Saturn](https://saturncloud.io) β€” Secure Cloud Architectures.* + +**Hardware:** CPU (Amazon Linux 2023) | **Resource:** Jupyter Notebook | **Tech Stack:** Vault (AWS SSM), Env (`.env`), Python (Boto3) | **Goal:** Secure S3 Access + +## πŸ“– Overview + +Hardcoded credentials (`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`) are the number one cause of cloud security breaches. This template provides the gold-standard architecture for securely accessing Amazon S3 from a Jupyter Notebook. + +Instead of typing secret keys into your code, this architecture uses: + +1. **IAM Role Assumption:** Grants the notebook temporary, native permissions to AWS resources without static keys. +2. **Vault Integration (AWS SSM Parameter Store):** Securely fetches sensitive configuration (like target bucket names or database connection strings) at runtime. +3. **Env (`.env`):** Loads non-sensitive local environment states seamlessly, keeping code and configuration completely separate. + +--- + +## πŸ—οΈ Phase 1: AWS Cloud Setup + +### Step 1.1: Create the Target S3 Bucket + +1. Go to the **S3 Dashboard** in the AWS Console and click **Create bucket**. +2. **Name:** Give it a unique name (e.g., `jupyter-vault-data-12345`). +3. Leave all default settings (ensure **Block all public access** remains checked). +4. Click **Create bucket**. +5. Upload a simple test file (like a `.txt` or `.csv`) into the bucket so you have data to read later. + +### Step 1.2: Store the Bucket Name in the Vault + +We will store the bucket name in AWS Systems Manager (SSM) Parameter Store as an encrypted `SecureString` so it is never hardcoded in our Python script. + +1. Go to the **Systems Manager Dashboard**. +2. On the left sidebar, click **Parameter Store**, then click **Create parameter**. +3. **Name:** Type `/jupyter/config/s3_bucket_name` +4. **Tier:** Standard (Free). +5. **Type:** Select **SecureString** *(This encrypts the data using AWS KMS).* +6. **Value:** Type the exact name of your new S3 bucket (e.g., `jupyter-vault-data-12345`). +7. Click **Create parameter**. + +### Step 1.3: Create the EC2 IAM Role (Zero-Trust) + +1. Go to the **IAM Dashboard**, click **Roles** on the left, and click **Create role**. +2. **Trusted entity type:** Select **AWS service** -> **EC2**. Click Next. +3. **Permissions:** Search for and check these exactly three policies: +* `AmazonS3ReadOnlyAccess` *(Allows reading the S3 data)* +* `AmazonSSMReadOnlyAccess` *(Allows reading the Vault)* +* `AmazonSSMManagedInstanceCore` *(Allows us to connect a secure terminal to the server)* + + +4. Click Next, name the role `Jupyter-Vault-Role`, and click **Create role**. + +--- + +## πŸ’» Phase 2: Compute Deployment + +### Step 2.1: Launch the EC2 Instance + +1. Go to the **EC2 Dashboard** and click **Launch instance**. +2. **Name:** `Jupyter-Vault-Server` +3. **AMI:** Amazon Linux 2023. +4. **Instance Type:** `t2.micro` or `t3.micro`. +5. **Key pair:** Select **Proceed without a key pair** (We will use SSM for secure access). +6. **Advanced Details:** Scroll down to **IAM instance profile** and select the `Jupyter-Vault-Role`. +7. Click **Launch instance**. + +### Step 2.2: Install the Tech Stack & Fix Directory Permissions + +AWS SSM connections can sometimes drop your terminal into the root system directory (which triggers "Permission Denied" errors in Jupyter). We will explicitly create a user workspace to prevent this. + +1. Once running, select the instance and click **Connect** -> **Session Manager** -> **Connect**. +2. Run these commands to install Jupyter, Boto3, and python-dotenv: +```bash +sudo dnf install python3-pip -y +python3 -m pip install jupyterlab boto3 python-dotenv + +``` + + +3. Create the workspace and the `.env` file explicitly: +```bash +# Create and enter the exact user directory +mkdir -p /home/ssm-user/notebooks +cd /home/ssm-user/notebooks + +# Write environment variables +echo "APP_ENV=production" > .env +echo "REGION=eu-north-1" >> .env + +``` + + +*(Note: Change `eu-north-1` to the region where you built your bucket!)* +4. Start the Jupyter Notebook server securely in the background, enforcing the correct directory: +```bash +/home/ssm-user/.local/bin/jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --notebook-dir=/home/ssm-user/notebooks & + +``` + + +5. Press `Enter`, then grab your unique login token: +```bash +/home/ssm-user/.local/bin/jupyter server list + +``` + + +*(Copy the long string of characters after `token=`)* + +--- + +## πŸš‡ Phase 3: Access the Notebook (Local Tunnel Automation) + +Instead of typing out a complex AWS CLI command every time you want to access your notebook, we will create a local shell script to automate the secure tunnel. + +**1. Create the Tunnel Script** +On your **local machine** (ensure the AWS CLI is configured and the Session Manager Plugin is installed), open a terminal and create a new script: + +```bash +nano start_tunnel.sh + +``` + +**2. Paste the Automation Code** +Paste the following code into the file. **CRITICAL: Update the `TARGET_INSTANCE_ID` and `REGION` variables to match your EC2 instance!** + +```bash +#!/bin/bash + +# ========================================== +# CONFIGURATION (UPDATE THESE VARIABLES!) +# ========================================== +TARGET_INSTANCE_ID="i-xxxxxxxxxxxxxxxxx" +REGION="eu-north-1" +PORT="8888" +# ========================================== + +echo "πŸš€ Starting Secure SSM Tunnel to EC2 Instance: $TARGET_INSTANCE_ID..." +echo "🌐 Region: $REGION" +echo "πŸ”Œ Port Forwarding: Local $PORT -> Remote $PORT" +echo "⚠️ Keep this terminal open! Press Ctrl+C to close the tunnel." +echo "" + +aws ssm start-session \ + --target $TARGET_INSTANCE_ID \ + --document-name AWS-StartPortForwardingSession \ + --parameters "portNumber"=["$PORT"],"localPortNumber"=["$PORT"] \ + --region $REGION + +``` + +*(Save and exit: `Ctrl+O`, `Enter`, `Ctrl+X`)* + +**3. Run the Tunnel** +Make the script executable and run it: + +```bash +chmod +x start_tunnel.sh +./start_tunnel.sh + +``` + +**4. Log In** +Once your terminal says "Port 8888 opened," open your local web browser and explicitly navigate to: `http://localhost:8888` +*(Paste your token from Phase 2 into the password box to log in).* + +--- + +## πŸ““ Phase 4: The Secure Code Pattern + +Create a new Python 3 notebook in Jupyter. Paste and run this exact code block to see the architecture in action. + +```python +import os +import boto3 +from dotenv import load_dotenv + +# ========================================== +# 1. ENV INTEGRATION +# Load local environment configuration +# ========================================== +load_dotenv() +environment = os.getenv("APP_ENV", "development") +region = os.getenv("REGION", "eu-north-1") + +print(f"πŸš€ Initializing Notebook in [{environment}] mode...") + +# ========================================== +# 2. VAULT INTEGRATION +# Fetch secure data from AWS Parameter Store +# ========================================== +ssm_client = boto3.client('ssm', region_name=region) + +try: + vault_response = ssm_client.get_parameter( + Name='/jupyter/config/s3_bucket_name', + WithDecryption=True + ) + SECURE_BUCKET = vault_response['Parameter']['Value'] + print("βœ… Vault query successful. Secure configuration loaded.") +except Exception as e: + print(f"❌ Vault Error: {e}") + +# ========================================== +# 3. SECURE S3 ACCESS +# Query the bucket using the decrypted Vault data +# ========================================== +s3_client = boto3.client('s3', region_name=region) + +print(f"\nπŸ“ Accessing S3 Bucket: {SECURE_BUCKET}") +try: + s3_response = s3_client.list_objects_v2(Bucket=SECURE_BUCKET, MaxKeys=5) + + if 'Contents' in s3_response: + for obj in s3_response['Contents']: + size_kb = round(obj['Size'] / 1024, 2) + print(f" - πŸ“„ {obj['Key']} ({size_kb} KB)") + else: + print(" - Bucket is empty. (Upload a test file in the AWS Console to see it here!)") +except Exception as e: + print(f"❌ S3 Access Error: {e}") + +``` + +--- + +## 🏁 Conclusion + +This template successfully demonstrates a **Zero-Trust Architecture** for data science workloads. By entirely removing static access keys and hardcoded secrets from the codebase, we eliminate the risk of accidental credential leaks via GitHub or shared scripts. + +The integration of `python-dotenv` ensures local environment flexibility, while **AWS Systems Manager Parameter Store** acts as a robust, encrypted vault for sensitive infrastructure details. Finally, binding permissions directly to the EC2 compute instance via **IAM Roles** guarantees that access rights remain strictly confined to authorized AWS resources, setting a gold standard for secure cloud development. + +### πŸ”— Verifiable Resources + +* **AWS Parameter Store Documentation:** [AWS Systems Manager User Guide](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html) +* **Boto3 SSM Integration:** [Boto3 Docs - SSM Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ssm.html) +* **Explore More Secure Architectures:** [Saturn Cloud Templates](https://saturncloud.io) + +--- \ No newline at end of file diff --git a/examples/MLOps_Ops_and_Enterprise_Features/secrets-iam/start_tunnel.sh b/examples/MLOps_Ops_and_Enterprise_Features/secrets-iam/start_tunnel.sh new file mode 100644 index 00000000..0e24550f --- /dev/null +++ b/examples/MLOps_Ops_and_Enterprise_Features/secrets-iam/start_tunnel.sh @@ -0,0 +1,21 @@ +#!/bin/bash + +# ========================================== +# CONFIGURATION (UPDATE THESE VARIABLES!) +# ========================================== +TARGET_INSTANCE_ID="i-xxxxxxxxxxxxxxxxx" +REGION="eu-north-1" +PORT="8888" +# ========================================== + +echo "πŸš€ Starting Secure SSM Tunnel to EC2 Instance: $TARGET_INSTANCE_ID..." +echo "🌐 Region: $REGION" +echo "πŸ”Œ Port Forwarding: Local $PORT -> Remote $PORT" +echo "⚠️ Keep this terminal open! Press Ctrl+C to close the tunnel." +echo "" + +aws ssm start-session \ + --target $TARGET_INSTANCE_ID \ + --document-name AWS-StartPortForwardingSession \ + --parameters "portNumber"=["$PORT"],"localPortNumber"=["$PORT"] \ + --region $REGION \ No newline at end of file