Code Structure
This document describes the directory layout, package responsibilities, data flow, and key abstractions in the SPYDER codebase.
Directory Layout
spyder/
├── cmd/
│ ├── spyder/ # Main probe binary
│ │ └── main.go # CLI flags, config loading, orchestration
│ └── seed/ # Queue seeder binary
│ └── main.go # Reads domains file, pushes to Redis queue
├── internal/ # Private packages (not importable by external code)
│ ├── circuitbreaker/ # Per-host circuit breaker
│ ├── config/ # Configuration loading and validation
│ ├── dedup/ # Deduplication (memory + Redis backends)
│ ├── discover/ # Discovery sink system for recursive crawling
│ ├── dns/ # DNS resolution (A, AAAA, NS, CNAME, MX, TXT)
│ ├── emit/ # Batch emitter with retry, spooling, and mTLS
│ ├── extract/ # HTML link extraction and apex domain calculation
│ ├── health/ # Health check HTTP handler
│ ├── httpclient/ # Resilient HTTP client with connection pooling
│ ├── logging/ # Structured logging (zap)
│ ├── metrics/ # Prometheus metrics registration and server
│ ├── output/ # Output formatting (JSON, JSONL, CSV)
│ ├── probe/ # Core probe engine: worker pool and CrawlOne logic
│ ├── queue/ # Redis-based distributed work queue
│ ├── rate/ # Per-host token bucket rate limiter
│ ├── robots/ # robots.txt fetching, caching, and TLD filtering
│ ├── telemetry/ # OpenTelemetry initialization
│ ├── tlsinfo/ # TLS certificate metadata extraction
│ └── ui/ # Terminal UI: progress indicators and log output
├── configs/ # Example configuration files and domain lists
├── docs/ # VitePress documentation site
├── scripts/ # Build and deployment scripts
├── .github/workflows/ # CI/CD pipeline definitions
├── Makefile # Build, test, lint, docker targets
├── Dockerfile # Multi-stage container build
├── go.mod # Go module definition (go 1.23)
└── go.sum # Dependency checksumsEntry Points
cmd/spyder -- Main Probe
The primary binary. Reads a domain list (from file or Redis queue), runs concurrent workers to probe each domain, and outputs structured JSON batches.
Startup sequence:
- Parse CLI flags
- Load config file (if
-configis set) - Apply environment variables (
REDIS_ADDR,REDIS_QUEUE_ADDR,REDIS_QUEUE_KEY) - Merge CLI flags over config (flags take precedence)
- Validate final configuration
- Initialize telemetry, metrics server, health handler
- Initialize dedup backend (memory or Redis)
- Initialize emitter (stdout or HTTP ingest endpoint)
- Initialize task channel and discovery sink
- Start probe worker pool
- Drain emitter on shutdown
# Minimal usage
./bin/spyder -domains=configs/domains.txt
# Full-featured
./bin/spyder \
-config=configs/spyder.yaml \
-concurrency=512 \
-ingest=https://ingest.example.com/v1/batch \
-continuous \
-max_domains=100000cmd/seed -- Queue Seeder
A utility that reads a domains file and pushes each domain into a Redis queue. Used in distributed deployments where multiple probes consume from a shared queue.
./bin/seed -domains=configs/domains.txt -redis=127.0.0.1:6379 -key=spyder:queuePackage Responsibilities
internal/probe
The central orchestrator. Probe.Run() starts N worker goroutines that read domain names from a channel and call CrawlOne() for each.
CrawlOne() performs the complete probing pipeline for a single domain:
- DNS resolution (A/AAAA, NS, CNAME, MX)
- Robots.txt policy check
- TLD exclusion check (skip .gov, .mil, .int)
- Per-host rate limiting
- HTTP GET of the root page (with 512KB body limit)
- HTML link extraction for external domains
- TLS certificate metadata extraction
- Deduplication of all nodes and edges
- Submission of discovered domains to the discovery sink
- Flushing the batch to the emitter channel
Dependencies: dns, httpclient, robots, rate, extract, tlsinfo, dedup, discover, emit, metrics
internal/dns
Performs concurrent DNS lookups using Go's net.DefaultResolver. The ResolveAll() function resolves a domain and returns:
- IP addresses (A/AAAA combined)
- Nameserver hostnames (NS)
- CNAME target
- Mail exchanger hostnames (MX)
- TXT records
All hostnames are normalized by stripping trailing dots. Lookup failures for individual record types are silently ignored -- a domain that has A records but no MX records simply returns an empty MX slice.
internal/httpclient
Provides httpclient.Default() which returns an *http.Client configured for high-throughput crawling:
- Max idle connections: 1024
- Max connections per host: 128
- Response timeout: 10 seconds
- Overall timeout: 15 seconds
ResilientClient wraps the base client with circuit breaker integration.
internal/extract
Two key functions:
ParseLinks(base *url.URL, body io.Reader) ([]string, error)-- parses HTML withgolang.org/x/net/htmltokenizer, extracts href/src attributes from<a>,<link>,<script>,<img>, and<iframe>tags, and resolves them against the base URL.ExternalDomains(host string, links []string) []string-- filters parsed links to only those pointing to domains outside the current host's apex domain.Apex(host string) string-- computes the apex (registrable) domain using the public suffix list.
internal/emit
Defines the data model (Batch, Edge, NodeDomain, NodeIP, NodeCert) and the Emitter that handles output.
The Emitter:
- Accumulates batches from the probe workers
- Flushes when edge count exceeds
batch_max_edges(default: 10,000) or a timer fires (default: 2 seconds) - Outputs to stdout (JSON) when no ingest URL is configured
- POSTs to an HTTP endpoint with exponential backoff retry (via
cenkalti/backoff) - Spools failed batches to disk as timestamped JSON files
- Replays spooled files on
Drain()(called during shutdown) - Supports mTLS client certificates for authenticated ingest endpoints
Edge type constants: RESOLVES_TO, USES_NS, ALIAS_OF, USES_MX, LINKS_TO, USES_CERT
internal/dedup
Prevents duplicate nodes and edges from being emitted.
Interface:
type Interface interface {
Seen(key string) bool
}Seen() returns true if the key was already recorded (duplicate), false if it is new (first occurrence). Implementations must be safe for concurrent use.
Backends:
NewMemory()-- in-processsync.Map-based dedup. Fast, zero configuration, but not shared across probe instances.NewRedis(addr string, ttl time.Duration, log *zap.SugaredLogger)-- Redis SET with TTL. Enables distributed dedup across multiple probe instances. Activated by settingREDIS_ADDR.
Keys follow a prefix convention: nodeip|<ip>, domain|<host>, cert|<spki>, edge|<src>|<type>|<target>.
internal/discover
Manages the discovery sink system for recursive/continuous crawling. When a probe discovers a new domain (via DNS NS/CNAME/MX records or HTML links), it submits it to a Sink for potential future crawling.
See Discovery Sink System below for full details.
internal/config
Loads and validates configuration. Supports YAML and JSON config files.
Config struct holds all configurable fields with yaml and json struct tags. Methods:
SetDefaults()-- fills in defaults for unset fieldsValidate()-- checks required fields and value rangesLoadFromEnv()-- readsREDIS_ADDR,REDIS_QUEUE_ADDR,REDIS_QUEUE_KEYfrom environmentMergeWithFlags(map[string]interface{})-- overlays CLI flag values onto the configLoadFromFile(path string) (*Config, error)-- loads from a YAML or JSON file, applies defaults, and validates
internal/queue
Redis-based distributed work queue using Redis lists:
Seed(ctx, host)-- pushes a domain onto the queue (LPUSH)Lease(ctx) (host, ack, err)-- atomically pops a domain and moves it to a processing list (BRPOPLPUSH)- Lease timeout: 120 seconds
Used by cmd/seed to populate the queue and by cmd/spyder to consume from it.
internal/rate
Per-host token bucket rate limiter using golang.org/x/time/rate:
New(rps float64, burst int) *PerHost-- creates a limiterAllow(host string) bool-- non-blocking checkWait(host string)-- blocking wait until a token is available
Each host gets an independent limiter, created on first access. Default configuration: 1.0 request per second per host, burst of 1.
internal/robots
Fetches and caches robots.txt files:
NewCache(client *http.Client, ua string) *Cache-- creates a cache backed by an LRU (4096 entries, 24-hour TTL)Get(ctx, host) (*robotstxt.RobotsData, error)-- fetches robots.txt (HTTPS first, falls back to HTTP), caches the resultAllowed(rd *robotstxt.RobotsData, ua, path string) bool-- checks if the path is allowed for the user-agentShouldSkipByTLD(host string, excluded []string) bool-- returns true if the host's TLD is in the exclusion list
internal/circuitbreaker
Three-state circuit breaker (Closed, Open, Half-Open):
- Configurable failure threshold, failure ratio, timeout, and interval
Execute(fn func() error) error-- runs a function through the circuit breakerExecuteWithRetry(cb, fn, maxRetries, delay)-- retries with exponential delay, aborts on open circuit
HostBreaker wraps per-host circuit breakers with independent state:
Execute(host string, fn func() error) errorState(host string) StateReset(host string)Stats() map[string]State
internal/tlsinfo
Connects to a host on port 443, performs a TLS handshake, and extracts certificate metadata:
- SPKI SHA-256 fingerprint
- Subject and issuer common names
- Validity period (NotBefore, NotAfter)
Returns a NodeCert struct. Timeout: 8 seconds.
internal/metrics
Registers Prometheus metrics and starts the metrics HTTP server:
TasksTotal-- counter withstatuslabel (ok/error)EdgesTotal-- counter withtypelabel (RESOLVES_TO, USES_NS, etc.)RobotsBlocks-- counter for robots.txt denialsServeWithHealth(addr, healthHandler, log)-- serves/metricsand health endpoints
internal/health
HTTP health check handler with:
- Liveness endpoint (
/healthz) - Readiness endpoint (
/readyz) - Configurable health checkers (e.g., Redis connectivity)
- Metadata (probe ID, run ID, version)
internal/logging
Initializes a zap.SugaredLogger with structured JSON output. Respects the LOG_LEVEL environment variable (debug, info, warn, error).
internal/telemetry
Initializes the OpenTelemetry SDK:
- OTLP HTTP exporter pointing at
otel_endpoint - Service name:
otel_service(default:spyder-probe) - Returns a
shutdownfunction for graceful cleanup
internal/output
Formats batch data for output. Supports three formats:
json-- pretty-printed JSON (default)jsonl/ndjson-- newline-delimited JSON (one batch per line)csv-- comma-separated with header row (edges only)
internal/ui
Terminal UI components:
- Progress indicators showing domains processed, edges discovered, and throughput
- Logger integration for clean output alongside progress bars
Data Flow
The following diagram shows how data moves through SPYDER during a probe run.
┌──────────────────┐
│ domains.txt │
│ (or Redis queue) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Task Channel │
│ (chan string) │
└────────┬─────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker N │
│ CrawlOne │ │ CrawlOne │ │ CrawlOne │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
┌────────┴──────┬──────┴────┬─────────┘
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ DNS Resolve│ │ HTTP GET / │ │ TLS Cert │
│ A/NS/MX/CN│ │ + Extract │ │ Fetch │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
└───────────┬────┴───────────────┘
▼
┌──────────────┐
│ Dedup Check │──── seen? ──→ skip
│ (Seen(key)) │
└──────┬───────┘
│ new
▼
┌──────────────┐
│ Discovery │──→ sink.Submit(host) ──→ back to task channel
│ Sink │ (if continuous mode)
└──────┬───────┘
│
▼
┌──────────────┐
│ Batch Channel│
│ (chan Batch) │
└──────┬───────┘
│
▼
┌──────────────┐
│ Emitter │
│ accumulate │
│ + flush │
└──────┬───────┘
│
┌──────┴──────┐
▼ ▼
┌───────────┐ ┌───────────┐
│ stdout │ │ HTTP │
│ (JSON) │ │ ingest │
└───────────┘ │ endpoint │
└─────┬─────┘
│ on failure
▼
┌───────────┐
│ Spool │
│ (disk) │
└───────────┘Step-by-Step
Input: Domains are read from a text file (one per line) or leased from a Redis queue. Blank lines and lines starting with
#are skipped. Domains are lowercased and trailing dots are stripped.Task distribution: Domains are sent to a buffered channel (capacity 8192). N worker goroutines consume from this channel concurrently (default N=256).
CrawlOne: Each worker probes a single domain through the full pipeline:
- DNS resolution returns IPs, nameservers, CNAME, and MX records
- Policy checks skip excluded TLDs and domains blocked by robots.txt
- Rate limiting enforces per-host request spacing (1 req/sec default)
- HTTP fetch GETs the root page over HTTPS (512KB body limit, 15s timeout)
- Link extraction parses HTML for external domain references
- TLS analysis extracts certificate metadata and SPKI fingerprint
Deduplication: Every node and edge is checked against the dedup backend before being added to the batch. Keys use a prefix convention (
domain|,nodeip|,cert|,edge|) to namespace different entity types.Discovery: Newly discovered domains (from NS, CNAME, MX, and link extraction) are submitted to the discovery sink. In continuous mode, these domains feed back into the task channel for recursive crawling.
Batch emission: Worker batches flow through a channel (capacity 1024) to the emitter, which accumulates them and flushes when the edge count reaches
batch_max_edgesor a timer fires. Output goes to stdout (JSON) or an HTTP ingest endpoint with retry.Spooling and replay: If the ingest endpoint is unreachable, batches are spooled to disk as JSON files. On shutdown,
Drain()replays spooled files.
Key Interfaces
dedup.Interface
type Interface interface {
Seen(key string) bool
}Central abstraction for deduplication. Seen() atomically checks and records a key. Returns false on first call (new item), true on subsequent calls (duplicate).
Implementations:
dedup.NewMemory()-- in-processsync.Mapdedup.NewRedis(addr, ttl, log)-- Redis SET with configurable TTL
Used by: probe.Probe to deduplicate nodes and edges before emission.
discover.Sink
type Sink interface {
Submit(ctx context.Context, host string) bool
Discovered() int64
}Receives newly discovered domains and optionally feeds them back for future crawling. Submit() returns true if the domain was new and successfully enqueued.
Implementations:
discover.NoopSink{}-- discards all discoveries (non-recursive mode)discover.NewChannelSink(ch, dedup, maxDomains)-- sends to an in-memory channeldiscover.NewRedisSink(queue, dedup, maxDomains)-- pushes to a Redis queue
Used by: probe.Probe when it encounters new domains in DNS records or HTML links.
Configuration Loading Order
Configuration values are resolved in a layered system where later sources override earlier ones:
1. Defaults (SetDefaults)
↓
2. Config file (-config=spyder.yaml)
↓
3. Environment variables (REDIS_ADDR, REDIS_QUEUE_ADDR, REDIS_QUEUE_KEY)
↓
4. CLI flags (-concurrency=512, -ingest=..., etc.)1. Defaults
Config.SetDefaults() fills in values for any field that is zero-valued:
| Field | Default |
|---|---|
Probe | local-1 |
Run | run-<unix-timestamp> |
UA | SPYDERProbe/1.0 (+https://github.com/gustycube/spyder) |
ExcludeTLDs | [gov, mil, int] |
Concurrency | 256 |
BatchMaxEdges | 10000 |
BatchFlushSec | 2 |
SpoolDir | spool |
OutputFormat | json |
MetricsAddr | :9090 |
OTELService | spyder-probe |
RedisQueueKey | spyder:queue |
2. Config File
If -config=path.yaml is provided, the file is parsed (YAML or JSON based on extension) and its values replace the defaults. SetDefaults() runs after loading to fill any fields the file did not specify.
# configs/spyder.yaml
probe: us-east-1a
domains: configs/top-1m.txt
concurrency: 512
batch_max_edges: 50000
exclude_tlds:
- gov
- mil
- int
- edu
ingest: https://ingest.example.com/v1/batch3. Environment Variables
Config.LoadFromEnv() reads Redis-related environment variables. These override any values from the config file:
export REDIS_ADDR=redis.prod:6379 # dedup backend
export REDIS_QUEUE_ADDR=redis.prod:6379 # work queue
export REDIS_QUEUE_KEY=spyder:prod:queue # queue key nameThis layer exists so Redis addresses can be injected by container orchestrators (Docker, Kubernetes) without modifying config files.
4. CLI Flags
Command-line flags have the highest precedence. Config.MergeWithFlags() only overwrites a field if the flag was explicitly provided (non-zero value):
./bin/spyder -config=configs/spyder.yaml -concurrency=1024 -ingest=https://other.example.comIn this example, concurrency becomes 1024 and ingest is overridden, while all other values come from the config file.
Discovery Sink System
The discovery sink system controls what happens when SPYDER encounters a domain it has not seen before. This is the mechanism behind recursive/continuous crawling.
NoopSink
type NoopSink struct{}Discards all discoveries. Used in the default (non-recursive) mode where SPYDER only probes the domains from the input file and stops.
Selected when -continuous is not set.
ChannelSink
type ChannelSink struct {
ch chan<- string
dedup dedup.Interface
maxDomains int64
count atomic.Int64
}Feeds discovered domains into an in-memory Go channel that loops back into the task channel. The probe reads seed domains from the file first, then continuously consumes discovered domains until the context is cancelled or maxDomains is reached.
Selected when -continuous is set and no Redis queue is configured.
Flow: probe.CrawlOne -> sink.Submit(host) -> channel -> task channel -> probe.CrawlOne
Guards:
- Dedup check:
dedup.Seen("discovered|" + host)prevents re-submitting the same domain - Max domains:
maxDomainscaps total discoveries (0 = unlimited) - Context cancellation:
Submit()respects context deadline
RedisSink
type RedisSink struct {
q *queue.RedisQueue
dedup dedup.Interface
maxDomains int64
count atomic.Int64
}Pushes discovered domains into the Redis queue via queue.Seed(). Multiple probe instances can share the same queue, enabling distributed recursive crawling where any probe instance can discover domains that other instances will pick up.
Selected when -continuous is set and REDIS_QUEUE_ADDR is configured.
Flow: probe.CrawlOne -> sink.Submit(host) -> Redis LPUSH -> any probe instance leases it
Selection Logic
The sink is selected in cmd/spyder/main.go based on two flags:
-continuous | REDIS_QUEUE_ADDR | Sink |
|---|---|---|
| false | any | NoopSink |
| true | empty | ChannelSink |
| true | set | RedisSink |
Example: Distributed Recursive Crawl
# Terminal 1: Seed the queue
./bin/seed -domains=configs/seeds.txt -redis=redis:6379
# Terminal 2: Probe instance A (continuous, Redis queue + dedup)
REDIS_ADDR=redis:6379 REDIS_QUEUE_ADDR=redis:6379 \
./bin/spyder -continuous -max_domains=500000 \
-domains=/dev/null -ingest=https://ingest.example.com/v1/batch
# Terminal 3: Probe instance B (same configuration)
REDIS_ADDR=redis:6379 REDIS_QUEUE_ADDR=redis:6379 \
./bin/spyder -continuous -max_domains=500000 \
-domains=/dev/null -ingest=https://ingest.example.com/v1/batchBoth probes consume from the same Redis queue. When either probe discovers a new domain, it pushes it back into the queue. The shared Redis dedup backend ensures no domain is probed twice across the cluster.