SPYDER Architecture Overview
SPYDER (System for Probing and Yielding DNS-based Entity Relations) is designed as a scalable, distributed system for mapping inter-domain relationships across the internet.
High-Level Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Domain List │ │ Redis Queue │ │ Ingest API │
│ (File/Queue) │───▶│ (Optional) │───▶│ (HTTP/HTTPS) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ SPYDER Probe Engine │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker │ │ Worker │ │ Worker │ ... (N) │
│ │ Pool │ │ Pool │ │ Pool │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ DNS Resolver│ │ HTTP Client │ │ TLS Analyzer│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Prometheus │ │ Redis Dedup │ │ Batch Emitter │
│ Metrics │ │ (Optional) │ │ + Spooling │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Core Components
1. Probe Engine (internal/probe
)
The central orchestrator that coordinates domain probing:
- Worker Pool Management: Manages concurrent workers for parallel processing
- Task Distribution: Distributes domain probing tasks across workers
- Policy Enforcement: Applies robots.txt and TLD exclusion policies
- Rate Limiting: Enforces per-host rate limits to be respectful
Key Features:
- Configurable concurrency (default: 256 workers)
- Graceful shutdown with proper resource cleanup
- OpenTelemetry tracing integration
- Prometheus metrics collection
2. DNS Resolution (internal/dns
)
Performs comprehensive DNS lookups:
- A/AAAA Records: IPv4 and IPv6 address resolution
- NS Records: Nameserver discovery
- CNAME Records: Canonical name mapping
- MX Records: Mail exchanger identification
- TXT Records: Text record collection (future use)
Implementation:
- Uses Go's
net.DefaultResolver
- Context-aware with timeout handling
- Concurrent resolution for multiple record types
3. HTTP Client (internal/httpclient
)
Optimized HTTP client for web content fetching:
- Connection Pooling: Reuses connections for efficiency
- Timeout Management: Configurable response timeouts
- TLS Configuration: Secure HTTPS connections
- Content Limiting: Restricts response body size (512KB max)
Configuration:
- Max idle connections: 1024
- Max connections per host: 128
- Response timeout: 10 seconds
- Overall timeout: 15 seconds
4. Link Extraction (internal/extract
)
Parses HTML content to discover external relationships:
- HTML Parsing: Uses
golang.org/x/net/html
tokenizer - Link Discovery: Extracts
href
andsrc
attributes - External Filtering: Identifies cross-domain relationships
- Apex Domain Calculation: Uses public suffix list
Extraction Sources:
<a href="...">
- Hyperlinks<link href="...">
- Stylesheets and resources<script src="...">
- JavaScript resources<img src="...">
- Images<iframe src="...">
- Embedded content
5. TLS Analysis (internal/tlsinfo
)
Extracts TLS certificate metadata:
- Certificate Chain: Analyzes server certificates
- SPKI Fingerprinting: SHA-256 hash of Subject Public Key Info
- Validity Periods: Not-before and not-after timestamps
- Subject/Issuer: Common name extraction
Security:
- Proper certificate chain validation
- Timeout protection (8 seconds)
- Secure TLS configuration
6. Rate Limiting (internal/rate
)
Per-host token bucket rate limiting:
- Token Bucket Algorithm: Using
golang.org/x/time/rate
- Per-Host Limiting: Independent limits for each domain
- Configurable Rates: Adjustable requests per second
- Burst Support: Allows burst traffic within limits
Default Configuration:
- 1.0 requests per second per host
- Burst size: 1 request
7. Robots.txt Handling (internal/robots
)
Respectful crawling with robots.txt compliance:
- LRU Cache: 4096 entries with 24-hour TTL
- Fallback Logic: HTTPS first, then HTTP
- User-Agent Matching: Respects specific and wildcard rules
- Default Allow: Assumes allowed if robots.txt unavailable
8. Deduplication (internal/dedup
)
Prevents duplicate data collection:
- Memory Backend: In-process hash set (default)
- Redis Backend: Distributed deduplication across probes
- Key Generation: Consistent hashing for nodes and edges
- TTL Support: Automatic expiration for Redis backend
9. Batch Emitter (internal/emit
)
Efficient data output with reliability:
- Batch Aggregation: Combines multiple discoveries
- Configurable Limits: Max edges per batch (10,000)
- Timed Flushing: Regular batch emission (2 seconds)
- Retry Logic: Exponential backoff for failed requests
- Spooling: On-disk storage for failed batches
- mTLS Support: Client certificate authentication
10. Queue System (internal/queue
)
Distributed task distribution:
- Redis-based: Uses Redis lists for work distribution
- Atomic Operations: BRPOPLPUSH for atomic task leasing
- TTL Management: Lease timeout handling
- Processing Tracking: Separate processing queue
Data Flow
Single-Node Operation
- Input: Read domains from file
- Distribution: Send domains to worker pool
- Processing: Each worker performs:
- DNS resolution
- Robots.txt check
- Rate limiting
- HTTP content fetch
- TLS certificate analysis
- Link extraction
- Deduplication: Check for previously seen data
- Batch Formation: Aggregate results into batches
- Output: Emit to stdout or HTTP endpoint
Distributed Operation
- Queue Population: Seed Redis queue with domains
- Lease Management: Workers lease tasks from queue
- Processing: Same as single-node
- Distributed Dedup: Use Redis for cross-probe deduplication
- Result Aggregation: All probes send to common ingest API
Scalability Considerations
Vertical Scaling
- Worker Concurrency: Increase
-concurrency
parameter - Memory: More workers require more memory
- CPU: Processing is CPU-bound for parsing operations
- Network: Higher concurrency increases network usage
Horizontal Scaling
- Multiple Probes: Run multiple SPYDER instances
- Redis Queue: Shared work distribution
- Redis Dedup: Prevent duplicate work across probes
- Load Balancing: Distribute probe workload
Performance Tuning
# High-throughput configuration
./bin/spyder \
-domains=large-list.txt \
-concurrency=512 \
-batch_max_edges=50000 \
-batch_flush_sec=1 \
-ingest=https://fast-ingest.example.com/v1/batch
Security Architecture
Input Validation
- Domain name sanitization
- URL parsing validation
- Content-Type verification
Network Security
- mTLS for ingest API communication
- Secure TLS for all HTTPS requests
- DNS over HTTPS support (configurable)
Resource Protection
- Memory limits for HTTP responses
- Connection timeouts
- Rate limiting to prevent abuse
Data Privacy
- No sensitive content storage
- Configurable TLD exclusions
- Robots.txt compliance
Monitoring Integration
Metrics Collection
- Prometheus metrics endpoint
- Counter and gauge metrics
- Custom labels for filtering
Distributed Tracing
- OpenTelemetry integration
- Span creation for major operations
- Context propagation
Structured Logging
- JSON-formatted logs
- Configurable log levels
- Error context preservation
Error Handling
Graceful Degradation
- Continue processing on individual failures
- Skip unreachable hosts
- Partial result collection
Retry Logic
- Exponential backoff for transient failures
- Configurable retry attempts
- Circuit breaker patterns
Recovery Mechanisms
- Batch spooling for failed ingestion
- Automatic spool replay on restart
- Graceful shutdown with data preservation
Configuration Management
Environment Variables
- Redis connection strings
- Feature toggles
- External service endpoints
Command-Line Flags
- Runtime parameters
- Performance tuning
- Output configuration
Policy Configuration
- Excluded TLD lists
- User-Agent customization
- Rate limiting parameters
This architecture enables SPYDER to scale from single-machine development environments to large-scale distributed deployments while maintaining reliability, security, and observability.