Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
282 changes: 174 additions & 108 deletions docs/STICKY_SESSIONS.md
Original file line number Diff line number Diff line change
@@ -1,126 +1,192 @@
# Sticky Sessions for MCP Session Affinity

## Overview

AgentGateway stores MCP sessions in-memory. When running multiple replicas, requests from the same client must be routed to the same replica to maintain session state.

**Sticky sessions** (also called session affinity) use HTTP cookies to ensure all requests from a client are routed to the same replica.

## How It Works
# Remediation Plan:

**Severity:** medium
**Category:** threat-model
**Estimated Effort:** 8-12 hours

## Summary
Create or update threat model documentation for sticky sessions implementation to identify and mitigate security risks associated with session affinity mechanisms

## Affected Components
- session_management
- load_balancer
- documentation

## Implementation Steps
### Step 1: Analyze current sticky sessions implementation
Review the existing sticky sessions configuration and implementation to understand the current architecture, session binding mechanisms, and potential security gaps

**Files to modify:**
- `docs/STICKY_SESSIONS.md`

**Example code:**
```python
# Document current implementation
## Current Architecture
- Load balancer: [type]
- Session binding method: [cookie/IP/header]
- Session storage: [in-memory/database/cache]
- Failover mechanism: [description]
```

1. When a client makes their first request, Azure Container Apps routes it to any available replica
2. The response includes an `ARRAffinity` cookie that identifies the replica
3. Subsequent requests include this cookie, ensuring they go to the same replica
4. If a replica becomes unavailable, the client is routed to a new replica (session will be lost)
_Note: Document all components involved in sticky session management_

### Step 2: Conduct STRIDE threat analysis for sticky sessions
Perform a comprehensive STRIDE analysis specifically for sticky sessions functionality, identifying threats across all categories

**Files to modify:**
- `docs/STICKY_SESSIONS.md`

**Example code:**
```python
## STRIDE Threat Analysis
### Spoofing
- Session ID prediction attacks
- Cookie hijacking
### Tampering
- Session data modification
- Load balancer configuration tampering
### Repudiation
- Insufficient session logging
### Information Disclosure
- Session data exposure
- Server affinity information leakage
### Denial of Service
- Session exhaustion attacks
- Uneven load distribution
### Elevation of Privilege
- Session fixation attacks
- Cross-session data access
```

## Configuration
_Note: Consider all attack vectors specific to sticky session implementations_

### Terraform (Recommended)
### Step 3: Define security controls and mitigations
Document specific security controls and mitigation strategies for each identified threat, including technical implementation details

Sticky sessions are enabled by default in the Terraform module (`terraform/`):
**Files to modify:**
- `docs/STICKY_SESSIONS.md`

```hcl
# In terraform/terraform.tfvars
environment = "prod"
resource_group_name = "agentgateway-prod-rg"
**Example code:**
```python
## Security Controls
### Session Security
- Use cryptographically secure session IDs (minimum 128-bit entropy)
- Implement session timeout (idle: 30min, absolute: 8hrs)
- Enable secure and httpOnly cookie flags
- Regenerate session ID on authentication state changes

# Enable sticky sessions for MCP session affinity (default: true)
enable_sticky_sessions = true
### Load Balancer Security
- Configure TLS termination with strong ciphers
- Implement rate limiting per client IP
- Enable health checks with authentication
- Log all session routing decisions
```

The module automatically runs `az containerapp ingress sticky-sessions set` after creating the Container App.

### Azure CLI

Enable manually with:

```bash
az containerapp ingress sticky-sessions set \
--name <container-app-name> \
--resource-group <resource-group> \
--affinity sticky
_Note: Include specific configuration parameters and code snippets where applicable_

### Step 4: Document session lifecycle and security boundaries
Create detailed documentation of the secure session lifecycle, including creation, validation, renewal, and termination processes

**Files to modify:**
- `docs/STICKY_SESSIONS.md`

**Example code:**
```python
## Secure Session Lifecycle
### Session Creation
1. Generate cryptographically random session ID
2. Create server-side session storage
3. Set secure cookie with proper flags
4. Log session creation event

### Session Validation
1. Verify session ID format and entropy
2. Check session expiration
3. Validate server affinity
4. Confirm user authorization

### Session Termination
1. Clear server-side session data
2. Invalidate client-side cookies
3. Log session termination
4. Update load balancer routing
```

Verify configuration:

```bash
az containerapp show \
--name <container-app-name> \
--resource-group <resource-group> \
--query "properties.configuration.ingress.stickySessions"
_Note: Include error handling and edge cases in the documentation_

### Step 5: Create monitoring and alerting specifications
Define monitoring requirements and alerting thresholds for detecting security incidents related to sticky sessions

**Files to modify:**
- `docs/STICKY_SESSIONS.md`

**Example code:**
```python
## Security Monitoring
### Metrics to Monitor
- Session creation/destruction rates
- Failed authentication attempts per session
- Session duration anomalies
- Load distribution imbalances
- Cookie tampering attempts

### Alert Thresholds
- >100 failed authentications/minute from single IP
- Session duration >12 hours
- >10% load imbalance between servers
- Invalid session ID format attempts
```

## Requirements

- **Single Revision Mode**: Sticky sessions only work when the Container App is in single revision mode
- **HTTP Ingress**: The container app must use HTTP ingress (not TCP)
- **Cookie Support**: Clients must accept and send cookies

## Limitations

1. **Session Loss on Replica Failure**: If a replica goes down, all sessions on that replica are lost
2. **Uneven Load Distribution**: Some replicas may handle more traffic than others
3. **No Cross-Replica Session Sharing**: Sessions exist only on the replica that created them
_Note: Include SIEM integration requirements and incident response procedures_

## Testing
### Step 6: Document testing and validation procedures
Create comprehensive testing procedures to validate the security of sticky sessions implementation

Verify sticky sessions work with multiple replicas:
**Files to modify:**
- `docs/STICKY_SESSIONS.md`

```bash
# Scale to multiple replicas
az containerapp update \
--name unitone-agw-prod-app \
--resource-group mcp-gateway-prod-rg \
--min-replicas 2 \
--max-replicas 10
**Example code:**
```python
## Security Testing Procedures
### Automated Tests
- Session fixation vulnerability tests
- Session timeout validation
- Cookie security flag verification
- Load balancer failover testing

# Run E2E tests
source .venv/bin/activate
GATEWAY_URL="https://your-gateway.azurecontainerapps.io" \
python3 tests/e2e_mcp_sse_test.py
### Manual Testing
- Session hijacking simulation
- Cross-site request forgery testing
- Session exhaustion testing
- Server affinity bypass attempts
```

If tests pass with multiple replicas, sticky sessions are working correctly.

## When to Use

**Enable sticky sessions when:**
- Running multiple replicas for high availability
- Using stateful MCP mode (the default)
- Clients need to make multiple requests within a session

**Consider disabling when:**
- Running a single replica
- Using stateless MCP mode
- Session persistence is not required

## Alternative: Redis Session Storage

For production deployments requiring:
- Cross-replica session sharing
- Session persistence across restarts
- High availability without session loss

Consider implementing Redis session storage. This requires modifying the agentgateway core code (currently not supported).

## Troubleshooting

### "Session not found" errors with multiple replicas

1. Verify sticky sessions are enabled:
```bash
az containerapp show --name <app-name> -g <rg> \
--query "properties.configuration.ingress.stickySessions"
```

2. Check that clients are sending the `ARRAffinity` cookie

3. Verify the app is in single revision mode:
```bash
az containerapp show --name <app-name> -g <rg> \
--query "properties.configuration.activeRevisionsMode"
```

### Tests pass with 1 replica but fail with 2+

This indicates sticky sessions are not properly configured. Follow the configuration steps above.
_Note: Include both positive and negative test cases_

## Security Considerations
- Ensure session IDs have sufficient entropy to prevent prediction attacks
- Implement proper session timeout mechanisms to limit exposure window
- Use secure cookie attributes (Secure, HttpOnly, SameSite) to prevent client-side attacks
- Monitor for session-based attacks and implement appropriate alerting
- Consider the impact of server failures on session security and data integrity
- Validate that load balancer configuration doesn't expose sensitive information
- Implement proper logging for session-related security events for forensic analysis

## Best Practices
- Use established session management libraries rather than custom implementations
- Implement defense in depth with multiple layers of session security controls
- Regular security testing of session management functionality
- Maintain detailed documentation of session security architecture
- Implement graceful degradation when sticky sessions fail
- Use centralized session storage for improved security and scalability
- Regular review and update of session security policies

## Acceptance Criteria
- [ ] Complete STRIDE threat analysis documented for sticky sessions functionality
- [ ] Security controls and mitigations defined for each identified threat
- [ ] Session lifecycle security procedures documented with implementation details
- [ ] Monitoring and alerting specifications defined for session-related security events
- [ ] Security testing procedures documented and validated
- [ ] Documentation reviewed and approved by security team
- [ ] All security configurations and recommendations are technically feasible and implementable
Loading