ADR-006: Routing and Sharding System Architecture
ADR-006: Routing and Sharding System Architecture
Section titled “ADR-006: Routing and Sharding System Architecture”Status
Section titled “Status”Accepted
2025-09-02
Context
Section titled “Context”We needed to implement a robust routing and sharding system for WorkerSQL that could:
- Route queries to appropriate database shards based on tenant or hash-based logic
- Support dynamic configuration through YAML policies
- Provide versioning and rollback capabilities for routing changes
- Enable horizontal scaling through shard discovery and health checking
- Maintain high availability and fault tolerance
Key requirements:
- Multi-tenant data isolation with tenant-based routing
- Hash-based sharding for global tables
- Dynamic routing policy updates without downtime
- Version control for routing configurations
- Automatic shard discovery and health monitoring
- Circuit breaker patterns for fault tolerance
Alternative approaches considered:
- Static Routing Tables: Hardcoded routing logic
- External Service Discovery: Centralized routing service
- Client-side Sharding: Application-level routing
- Dynamic YAML-based Configuration
Decision
Section titled “Decision”We implemented a comprehensive routing and sharding system with the following components:
- TablePolicyParser: YAML-based table policy configuration parser
- RoutingVersionManager: Versioned routing policy management
- RouterService: Core routing logic with tenant and hash-based strategies
- CircuitBreakerService: Health checking and fault tolerance
- ConnectionManager: Shard connection pooling and session management
Rationale
Section titled “Rationale”Architecture Components:
Section titled “Architecture Components:”TablePolicyParser:
- YAML-based configuration for flexibility and human readability
- Dynamic import with JSON fallback for edge environment compatibility
- Environment variable substitution for dynamic configuration
- Comprehensive validation with detailed error messages
RoutingVersionManager:
- Versioned policy storage using Cloudflare KV
- Checksum-based integrity validation
- Compatibility checking for policy updates
- Rollback capabilities for safe deployments
RouterService:
- Tenant-based routing for multi-tenant isolation
- Hash-based routing for global data distribution
- Shard discovery from environment configuration
- Integration with circuit breaker for health monitoring
CircuitBreakerService:
- Open/closed/half-open states for fault tolerance
- Configurable failure thresholds and recovery timeouts
- Automatic health checking and recovery
- Integration with routing decisions
ConnectionManager:
- WebSocket-based sticky sessions for transactions
- Connection pooling per shard
- TTL-based cleanup and session management
- Shard affinity for performance optimization
Technical Implementation:
Section titled “Technical Implementation:”YAML Configuration Example:
version: 1tenants: tenant_a: shard_0 tenant_b: shard_1ranges: - prefix: 'user_' shard: shard_0 - prefix: 'order_' shard: shard_1
Routing Strategies:
- Tenant-based: Direct mapping from tenant ID to shard
- Hash-based: Consistent hashing for global data distribution
- Range-based: Prefix-based routing for specific data patterns
Version Management:
- Policies stored with version numbers and timestamps
- Compatibility validation before updates
- Diff generation for change tracking
- Rollback to previous versions
Advantages of This Approach:
Section titled “Advantages of This Approach:”- Flexibility: YAML configuration allows easy policy updates
- Reliability: Versioning and rollback prevent configuration errors
- Scalability: Hash-based sharding supports horizontal scaling
- Fault Tolerance: Circuit breaker patterns handle shard failures
- Performance: Connection pooling and session stickiness optimize throughput
Comparison with Alternatives:
Section titled “Comparison with Alternatives:”Static Routing Tables:
- ❌ Inflexible, requires code changes for routing updates
- ❌ No versioning or rollback capabilities
- ✅ Simple implementation
- ✅ Predictable performance
External Service Discovery:
- ❌ Additional network latency for routing decisions
- ❌ Single point of failure
- ❌ Increased complexity and operational overhead
- ✅ Centralized control and monitoring
Client-side Sharding:
- ❌ Routing logic duplicated across clients
- ❌ Inconsistent routing decisions
- ❌ Harder to maintain and update
- ✅ Reduced server-side complexity
Dynamic YAML-based Configuration:
- ✅ Flexible and human-readable
- ✅ Version controllable
- ✅ Easy to update and rollback
- ✅ Supports complex routing rules
Consequences
Section titled “Consequences”Positive:
Section titled “Positive:”- Highly flexible routing configuration through YAML
- Safe policy updates with versioning and rollback
- Automatic scaling through hash-based sharding
- Fault-tolerant operations with circuit breaker patterns
- Optimized performance through connection pooling
- Comprehensive test coverage (100% target achieved)
Negative:
Section titled “Negative:”- Increased complexity in routing logic
- YAML parsing overhead on configuration updates
- Dependency on Cloudflare KV for policy storage
- Learning curve for YAML configuration syntax
- Additional operational complexity for version management
Mitigation Strategies:
Section titled “Mitigation Strategies:”- Comprehensive documentation and examples for YAML configuration
- Automated validation and testing of routing policies
- Monitoring and alerting for routing performance
- Gradual rollout strategies for policy updates
- Fallback mechanisms for configuration failures
Technical Implications:
Section titled “Technical Implications:”- Must handle YAML parsing failures gracefully
- Need robust error handling for shard communication
- Configuration updates require careful coordination
- Testing complexity increases with dynamic routing
- Performance monitoring critical for routing efficiency