Deduplication Component
The deduplication component (internal/dedup
) prevents redundant processing by tracking previously seen domains across probe instances and time.
Overview
The deduplication component provides efficient domain tracking to prevent redundant work in distributed SPYDER deployments. It offers both memory-based and Redis-based implementations for different deployment scenarios.
Core Interface
Interface
Common interface for all deduplication implementations:
type Interface interface {
Seen(key string) bool // Returns true if key was previously seen
}
Implementations
Memory-Based Deduplication
Memory
Structure
type Memory struct {
m sync.Map // Thread-safe map for concurrent access
}
NewMemory() *Memory
Creates a new memory-based deduplicator.
Returns:
*Memory
: Memory-based deduplicator instance
Characteristics:
- Thread-Safe: Uses
sync.Map
for concurrent access - Process-Local: Deduplication scope limited to single process
- Persistent: Entries persist for process lifetime
- Zero Configuration: No external dependencies
Seen(key string) bool
Checks if a key has been seen before, marking it as seen if new.
Parameters:
key
: The string key to check (typically domain name)
Returns:
bool
:true
if key was previously seen,false
if new
Behavior:
- Atomic Operation: Uses
LoadOrStore
for atomic check-and-set - Memory Efficient: Stores empty struct{} as values
- Immediate Response: No network latency or timeouts
Redis-Based Deduplication
Redis
Structure
type Redis struct {
cli *redis.Client // Redis client connection
ttl time.Duration // Time-to-live for entries
errorCount int // Error counter for monitoring
}
NewRedis(addr string, ttl time.Duration) (*Redis, error)
Creates a new Redis-based deduplicator with TTL support.
Parameters:
addr
: Redis server address (e.g., "127.0.0.1:6379")ttl
: Time-to-live for deduplication entries
Returns:
*Redis
: Redis-based deduplicator instanceerror
: Connection error if Redis is unreachable
Features:
- Connection Validation: Pings Redis server during initialization
- TTL Support: Automatic expiration of old entries
- Distributed: Shared state across multiple probe instances
Seen(key string) bool
Checks if a key has been seen using Redis SET NX operation.
Parameters:
key
: The string key to check (prefixed with "seen:")
Returns:
bool
:true
if key was previously seen,false
if new
Implementation Details:
- Key Prefix: Adds "seen:" prefix to all keys
- Atomic Operation: Uses
SETNX
for atomic check-and-set - Timeout Protection: 2-second context timeout for Redis operations
- Error Tolerance: Returns
false
(not seen) on Redis errors - Error Throttling: Logs every 100th error to prevent spam
Deployment Strategies
Single-Node Deployment
// Memory-based deduplication for single probe instances
deduper := dedup.NewMemory()
if !deduper.Seen("example.com") {
// Process domain (first time seen)
}
Distributed Deployment
// Redis-based deduplication for distributed probes
deduper, err := dedup.NewRedis("127.0.0.1:6379", 24*time.Hour)
if err != nil {
log.Fatal("Redis connection failed")
}
if !deduper.Seen("example.com") {
// Process domain (first time seen across cluster)
}
Key Generation Strategies
Domain-Based Keys
// Simple domain deduplication
key := "domain:" + domain
Content-Based Keys
// URL-specific deduplication
key := "url:" + url
Time-Window Keys
// Daily deduplication windows
key := "daily:" + time.Now().Format("2006-01-02") + ":" + domain
Performance Characteristics
Memory Implementation
- Latency: Sub-microsecond operation latency
- Throughput: Millions of operations per second
- Memory: Linear growth with unique keys
- Concurrency: Excellent concurrent performance
Redis Implementation
- Latency: ~1-2ms operation latency (network dependent)
- Throughput: Thousands of operations per second
- Memory: Bounded by TTL and Redis memory
- Concurrency: Shared state across processes
Error Handling
Memory Implementation
- No Errors: Memory operations cannot fail
- Guaranteed Consistency: Thread-safe operations
- No Network Dependencies: Pure in-memory operation
Redis Implementation
- Graceful Degradation: Treats Redis errors as "not seen"
- Error Monitoring: Counts and logs errors for observability
- Timeout Protection: 2-second timeout prevents hanging
- Connection Recovery: Automatic reconnection on network issues
TTL and Expiration
Redis TTL Benefits
- Memory Management: Automatic cleanup of old entries
- Freshness Control: Ensures recent data processing
- Storage Efficiency: Prevents unbounded Redis growth
TTL Configuration Examples
// Hourly deduplication
deduper, _ := dedup.NewRedis("redis:6379", 1*time.Hour)
// Daily deduplication
deduper, _ := dedup.NewRedis("redis:6379", 24*time.Hour)
// Weekly deduplication
deduper, _ := dedup.NewRedis("redis:6379", 7*24*time.Hour)
Integration Patterns
Probe Pipeline Integration
type ProbeConfig struct {
Deduper dedup.Interface
}
func (p *Probe) ProcessDomain(domain string) {
if p.Deduper.Seen(domain) {
return // Skip already processed domain
}
// Continue with domain processing
}
Configuration-Driven Selection
func NewDeduper(redisAddr string, ttl time.Duration) dedup.Interface {
if redisAddr != "" {
if redis, err := dedup.NewRedis(redisAddr, ttl); err == nil {
return redis
}
log.Println("Redis unavailable, falling back to memory")
}
return dedup.NewMemory()
}
Monitoring and Observability
Memory Implementation Metrics
- Total Keys: Number of unique keys stored
- Memory Usage: Approximate memory consumption
- Hit Rate: Percentage of keys that were already seen
Redis Implementation Metrics
- Connection Status: Redis server connectivity
- Error Rate: Redis operation failure rate
- Response Latency: Redis operation timing
- Key Count: Number of active keys in Redis
- Memory Usage: Redis server memory consumption
Common Use Cases
Domain Deduplication
// Prevent processing same domain multiple times
if !deduper.Seen("domain:"+domain) {
processDomain(domain)
}
URL Deduplication
// Prevent fetching same URL multiple times
urlKey := "url:" + url
if !deduper.Seen(urlKey) {
fetchAndProcessURL(url)
}
Batch Deduplication
// Process only new domains from a batch
var newDomains []string
for _, domain := range domains {
if !deduper.Seen("batch:"+domain) {
newDomains = append(newDomains, domain)
}
}
processBatch(newDomains)
Best Practices
Key Naming Conventions
- Use Prefixes: Namespace keys to avoid collisions
- Consistent Format: Use consistent key formatting
- Descriptive Names: Include context in key names
Error Handling
- Graceful Degradation: Never block on deduplication failures
- Monitoring: Track error rates and patterns
- Fallback Strategy: Consider memory fallback for Redis failures
Performance Optimization
- Batch Operations: Group multiple checks when possible
- Key Length: Keep keys reasonably short for memory efficiency
- TTL Selection: Balance freshness needs with performance
Security Considerations
Redis Security
- Authentication: Use Redis AUTH for production deployments
- Network Security: Secure Redis network communications
- Access Control: Limit Redis access to probe processes only
Data Privacy
- Key Content: Be mindful of sensitive data in keys
- TTL Compliance: Ensure TTL meets data retention policies
- Access Logging: Monitor deduplication key access patterns
Troubleshooting
Common Issues
- Redis Connectivity: Network or authentication failures
- Memory Growth: Unbounded memory usage in Memory implementation
- Key Collisions: Different operations using same keys
Debugging Steps
- Check Redis Connection: Verify Redis server availability
- Monitor Error Logs: Review Redis operation error messages
- Analyze Key Patterns: Examine key naming and distribution
- Performance Analysis: Monitor operation latency and throughput