SPYDER Architecture
Overview
SPYDER (System for Probing and Yielding DNS-based Entity Relations) is a distributed, production-ready probe designed to map inter-domain relationships across the internet while respecting robots.txt policies and implementing comprehensive rate limiting.
Core Components
1. Probe Engine (internal/probe/probe.go
)
The central crawler that orchestrates domain discovery and relationship mapping:
- DNS Resolution: Discovers A records, NS records, CNAME chains, and MX records
- HTTP Crawling: Fetches root pages to extract external domain links
- TLS Certificate Analysis: Captures certificate metadata and SPKI fingerprints
- Policy Enforcement: Respects robots.txt and excludes sensitive TLDs (gov, mil, int)
2. Data Models (internal/emit/emit.go
)
Node Types
- NodeDomain: Domain entities with apex domain mapping
- NodeIP: IP address entities with first/last seen timestamps
- NodeCert: TLS certificate entities with SPKI SHA256 fingerprints
Relationship Types
RESOLVES_TO
: Domain → IP address mappingsUSES_NS
: Domain → nameserver relationshipsALIAS_OF
: CNAME chain mappingsUSES_MX
: Mail exchange relationshipsLINKS_TO
: External domain links found in HTMLUSES_CERT
: Domain → certificate associations
3. Deduplication System (internal/dedup/
)
Prevents duplicate data collection across probe runs:
- Memory Backend: Fast LRU cache for single-process deployments
- Redis Backend: Distributed deduplication for multi-node deployments
- Key Format:
{type}|{identifier}
for nodes,edge|{source}|{relation}|{target}
for relationships
4. Batch Emitter (internal/emit/emit.go
)
Handles reliable data delivery with resilience features:
- Batching: Accumulates up to 10,000 edges or 5,000 nodes before flushing
- Timer-based Flushing: Flushes batches every 2 seconds by default
- Retry Logic: Exponential backoff with 30-second max elapsed time
- Spool Directory: On-disk backup for failed HTTP deliveries
- mTLS Support: Client certificate authentication for secure ingestion
5. Rate Limiting (internal/rate/limiter.go
)
Multi-layer traffic control:
- Per-Host Token Bucket: 1 request/second per domain by default
- Global Concurrency: 256 concurrent workers by default
- Respectful Crawling: Prevents overwhelming target servers
6. Policy Engine (internal/robots/cache.go
)
Implements robots.txt compliance:
- LRU Cache: Efficient robots.txt caching with configurable TTL
- User-Agent Matching: Respects specific bot directives
- Default Allow: Fails open when robots.txt is unavailable
Architecture Patterns
Worker Pool
- Fixed-size worker pool processes domains concurrently
- Each worker handles complete domain analysis pipeline
- Graceful shutdown with context cancellation
Publisher-Subscriber
- Probe workers publish batches to emission channel
- Single emitter goroutine consumes and delivers batches
- Decouples discovery from data delivery
Circuit Breaker Pattern
- Failed HTTP deliveries trigger spool-to-disk
- Automatic retry of spooled data on next startup
- Prevents data loss during downstream outages
Observability
Structured Logging (Zap)
- JSON-formatted logs for machine processing
- Configurable log levels and output destinations
- Request tracing with correlation IDs
Prometheus Metrics (internal/metrics/metrics.go
)
- Task completion counters by status
- Edge type distribution counters
- Robots.txt block counters
- HTTP response time histograms
OpenTelemetry Tracing (internal/telemetry/otel.go
)
- Distributed tracing support via OTLP
- Span creation for major operations
- Custom attributes for probe and run identification
Data Flow
- Input: Domain list from file or Redis queue
- DNS Resolution: Parallel A/AAAA, NS, CNAME, MX lookups
- Policy Check: robots.txt validation and TLD exclusion
- HTTP Fetch: Root page retrieval with User-Agent headers
- Link Extraction: HTML parsing for external domain discovery
- TLS Analysis: Certificate fingerprint collection
- Deduplication: Key-based duplicate detection
- Batching: Accumulation until size or time threshold
- Delivery: HTTP POST to ingestion endpoint with retry
- Monitoring: Metrics emission and structured logging
Production Features
- Graceful Shutdown: SIGINT/SIGTERM handling with resource cleanup
- Configuration: Flag-based and environment variable configuration
- Docker Support: Distroless container images for security
- systemd Integration: Service unit files for Linux deployments
- CI/CD: Automated linting, testing, and building