SPYDER Architecture

Overview

SPYDER (System for Probing and Yielding DNS-based Entity Relations) is a distributed, production-ready probe designed to map inter-domain relationships across the internet while respecting robots.txt policies and implementing comprehensive rate limiting.

Core Components

1. Probe Engine (`internal/probe/probe.go`)

The central crawler that orchestrates domain discovery and relationship mapping:

DNS Resolution: Discovers A records, NS records, CNAME chains, and MX records
HTTP Crawling: Fetches root pages to extract external domain links
TLS Certificate Analysis: Captures certificate metadata and SPKI fingerprints
Policy Enforcement: Respects robots.txt and excludes sensitive TLDs (gov, mil, int)

2. Data Models (`internal/emit/emit.go`)

Node Types

NodeDomain: Domain entities with apex domain mapping
NodeIP: IP address entities with first/last seen timestamps
NodeCert: TLS certificate entities with SPKI SHA256 fingerprints

Relationship Types

RESOLVES_TO: Domain → IP address mappings
USES_NS: Domain → nameserver relationships
ALIAS_OF: CNAME chain mappings
USES_MX: Mail exchange relationships
LINKS_TO: External domain links found in HTML
USES_CERT: Domain → certificate associations

3. Deduplication System (`internal/dedup/`)

Prevents duplicate data collection across probe runs:

Memory Backend: Fast LRU cache for single-process deployments
Redis Backend: Distributed deduplication for multi-node deployments
Key Format: {type}|{identifier} for nodes, edge|{source}|{relation}|{target} for relationships

4. Batch Emitter (`internal/emit/emit.go`)

Handles reliable data delivery with resilience features:

Batching: Accumulates up to 10,000 edges or 5,000 nodes before flushing
Timer-based Flushing: Flushes batches every 2 seconds by default
Retry Logic: Exponential backoff with 30-second max elapsed time
Spool Directory: On-disk backup for failed HTTP deliveries
mTLS Support: Client certificate authentication for secure ingestion

5. Rate Limiting (`internal/rate/limiter.go`)

Multi-layer traffic control:

Per-Host Token Bucket: 1 request/second per domain by default
Global Concurrency: 256 concurrent workers by default
Respectful Crawling: Prevents overwhelming target servers

6. Policy Engine (`internal/robots/cache.go`)

Implements robots.txt compliance:

LRU Cache: Efficient robots.txt caching with configurable TTL
User-Agent Matching: Respects specific bot directives
Default Allow: Fails open when robots.txt is unavailable

Architecture Patterns

Worker Pool

Fixed-size worker pool processes domains concurrently
Each worker handles complete domain analysis pipeline
Graceful shutdown with context cancellation

Publisher-Subscriber

Probe workers publish batches to emission channel
Single emitter goroutine consumes and delivers batches
Decouples discovery from data delivery

Circuit Breaker Pattern

Failed HTTP deliveries trigger spool-to-disk
Automatic retry of spooled data on next startup
Prevents data loss during downstream outages

Observability

Structured Logging (Zap)

JSON-formatted logs for machine processing
Configurable log levels and output destinations
Request tracing with correlation IDs

Prometheus Metrics (`internal/metrics/metrics.go`)

Task completion counters by status
Edge type distribution counters
Robots.txt block counters
HTTP response time histograms

OpenTelemetry Tracing (`internal/telemetry/otel.go`)

Distributed tracing support via OTLP
Span creation for major operations
Custom attributes for probe and run identification

Data Flow

Input: Domain list from file or Redis queue
DNS Resolution: Parallel A/AAAA, NS, CNAME, MX lookups
Policy Check: robots.txt validation and TLD exclusion
HTTP Fetch: Root page retrieval with User-Agent headers
Link Extraction: HTML parsing for external domain discovery
TLS Analysis: Certificate fingerprint collection
Deduplication: Key-based duplicate detection
Batching: Accumulation until size or time threshold
Delivery: HTTP POST to ingestion endpoint with retry
Monitoring: Metrics emission and structured logging

Production Features

Graceful Shutdown: SIGINT/SIGTERM handling with resource cleanup
Configuration: Flag-based and environment variable configuration
Docker Support: Distroless container images for security
systemd Integration: Service unit files for Linux deployments
CI/CD: Automated linting, testing, and building

SPYDER Architecture ​

Overview ​

Core Components ​

1. Probe Engine (internal/probe/probe.go) ​

2. Data Models (internal/emit/emit.go) ​

Node Types ​

Relationship Types ​

3. Deduplication System (internal/dedup/) ​

4. Batch Emitter (internal/emit/emit.go) ​

5. Rate Limiting (internal/rate/limiter.go) ​

6. Policy Engine (internal/robots/cache.go) ​

Architecture Patterns ​

Worker Pool ​

Publisher-Subscriber ​

Circuit Breaker Pattern ​

Observability ​

Structured Logging (Zap) ​

Prometheus Metrics (internal/metrics/metrics.go) ​

OpenTelemetry Tracing (internal/telemetry/otel.go) ​

Data Flow ​

Production Features ​