Deduplication Component

The deduplication component (internal/dedup) prevents redundant processing by tracking previously seen domains across probe instances and time.

Overview

The deduplication component provides efficient domain tracking to prevent redundant work in distributed SPYDER deployments. It offers both memory-based and Redis-based implementations for different deployment scenarios.

Core Interface

`Interface`

Common interface for all deduplication implementations:

type Interface interface {
    Seen(key string) bool  // Returns true if key was previously seen
}

Implementations

Memory-Based Deduplication

`Memory` Structure

type Memory struct {
    m sync.Map  // Thread-safe map for concurrent access
}

`NewMemory() *Memory`

Creates a new memory-based deduplicator.

Returns:

*Memory: Memory-based deduplicator instance

Characteristics:

Thread-Safe: Uses sync.Map for concurrent access
Process-Local: Deduplication scope limited to single process
Persistent: Entries persist for process lifetime
Zero Configuration: No external dependencies

`Seen(key string) bool`

Checks if a key has been seen before, marking it as seen if new.

Parameters:

key: The string key to check (typically domain name)

Returns:

bool: true if key was previously seen, false if new

Behavior:

Atomic Operation: Uses LoadOrStore for atomic check-and-set
Memory Efficient: Stores empty struct{} as values
Immediate Response: No network latency or timeouts

Redis-Based Deduplication

`Redis` Structure

type Redis struct {
    cli        *redis.Client  // Redis client connection
    ttl        time.Duration  // Time-to-live for entries
    errorCount int           // Error counter for monitoring
}

`NewRedis(addr string, ttl time.Duration) (*Redis, error)`

Creates a new Redis-based deduplicator with TTL support.

Parameters:

addr: Redis server address (e.g., "127.0.0.1:6379")
ttl: Time-to-live for deduplication entries

Returns:

*Redis: Redis-based deduplicator instance
error: Connection error if Redis is unreachable

Features:

Connection Validation: Pings Redis server during initialization
TTL Support: Automatic expiration of old entries
Distributed: Shared state across multiple probe instances

`Seen(key string) bool`

Checks if a key has been seen using Redis SET NX operation.

Parameters:

key: The string key to check (prefixed with "seen:")

Returns:

bool: true if key was previously seen, false if new

Implementation Details:

Key Prefix: Adds "seen:" prefix to all keys
Atomic Operation: Uses SETNX for atomic check-and-set
Timeout Protection: 2-second context timeout for Redis operations
Error Tolerance: Returns false (not seen) on Redis errors
Error Throttling: Logs every 100th error to prevent spam

Deployment Strategies

Single-Node Deployment

// Memory-based deduplication for single probe instances
deduper := dedup.NewMemory()
if !deduper.Seen("example.com") {
    // Process domain (first time seen)
}

Distributed Deployment

// Redis-based deduplication for distributed probes
deduper, err := dedup.NewRedis("127.0.0.1:6379", 24*time.Hour)
if err != nil {
    log.Fatal("Redis connection failed")
}
if !deduper.Seen("example.com") {
    // Process domain (first time seen across cluster)
}

Key Generation Strategies

Domain-Based Keys

// Simple domain deduplication
key := "domain:" + domain

Content-Based Keys

// URL-specific deduplication
key := "url:" + url

Time-Window Keys

// Daily deduplication windows
key := "daily:" + time.Now().Format("2006-01-02") + ":" + domain

Performance Characteristics

Memory Implementation

Latency: Sub-microsecond operation latency
Throughput: Millions of operations per second
Memory: Linear growth with unique keys
Concurrency: Excellent concurrent performance

Redis Implementation

Latency: ~1-2ms operation latency (network dependent)
Throughput: Thousands of operations per second
Memory: Bounded by TTL and Redis memory
Concurrency: Shared state across processes

Error Handling

Memory Implementation

No Errors: Memory operations cannot fail
Guaranteed Consistency: Thread-safe operations
No Network Dependencies: Pure in-memory operation

Redis Implementation

Graceful Degradation: Treats Redis errors as "not seen"
Error Monitoring: Counts and logs errors for observability
Timeout Protection: 2-second timeout prevents hanging
Connection Recovery: Automatic reconnection on network issues

TTL and Expiration

Redis TTL Benefits

Memory Management: Automatic cleanup of old entries
Freshness Control: Ensures recent data processing
Storage Efficiency: Prevents unbounded Redis growth

TTL Configuration Examples

// Hourly deduplication
deduper, _ := dedup.NewRedis("redis:6379", 1*time.Hour)

// Daily deduplication
deduper, _ := dedup.NewRedis("redis:6379", 24*time.Hour)

// Weekly deduplication
deduper, _ := dedup.NewRedis("redis:6379", 7*24*time.Hour)

Integration Patterns

Probe Pipeline Integration

type ProbeConfig struct {
    Deduper dedup.Interface
}

func (p *Probe) ProcessDomain(domain string) {
    if p.Deduper.Seen(domain) {
        return // Skip already processed domain
    }
    // Continue with domain processing
}

Configuration-Driven Selection

func NewDeduper(redisAddr string, ttl time.Duration) dedup.Interface {
    if redisAddr != "" {
        if redis, err := dedup.NewRedis(redisAddr, ttl); err == nil {
            return redis
        }
        log.Println("Redis unavailable, falling back to memory")
    }
    return dedup.NewMemory()
}

Monitoring and Observability

Memory Implementation Metrics

Total Keys: Number of unique keys stored
Memory Usage: Approximate memory consumption
Hit Rate: Percentage of keys that were already seen

Redis Implementation Metrics

Connection Status: Redis server connectivity
Error Rate: Redis operation failure rate
Response Latency: Redis operation timing
Key Count: Number of active keys in Redis
Memory Usage: Redis server memory consumption

Common Use Cases

Domain Deduplication

// Prevent processing same domain multiple times
if !deduper.Seen("domain:"+domain) {
    processDomain(domain)
}

URL Deduplication

// Prevent fetching same URL multiple times
urlKey := "url:" + url
if !deduper.Seen(urlKey) {
    fetchAndProcessURL(url)
}

Batch Deduplication

// Process only new domains from a batch
var newDomains []string
for _, domain := range domains {
    if !deduper.Seen("batch:"+domain) {
        newDomains = append(newDomains, domain)
    }
}
processBatch(newDomains)

Best Practices

Key Naming Conventions

Use Prefixes: Namespace keys to avoid collisions
Consistent Format: Use consistent key formatting
Descriptive Names: Include context in key names

Error Handling

Graceful Degradation: Never block on deduplication failures
Monitoring: Track error rates and patterns
Fallback Strategy: Consider memory fallback for Redis failures

Performance Optimization

Batch Operations: Group multiple checks when possible
Key Length: Keep keys reasonably short for memory efficiency
TTL Selection: Balance freshness needs with performance

Security Considerations

Redis Security

Authentication: Use Redis AUTH for production deployments
Network Security: Secure Redis network communications
Access Control: Limit Redis access to probe processes only

Data Privacy

Key Content: Be mindful of sensitive data in keys
TTL Compliance: Ensure TTL meets data retention policies
Access Logging: Monitor deduplication key access patterns

Troubleshooting

Common Issues

Redis Connectivity: Network or authentication failures
Memory Growth: Unbounded memory usage in Memory implementation
Key Collisions: Different operations using same keys

Debugging Steps

Check Redis Connection: Verify Redis server availability
Monitor Error Logs: Review Redis operation error messages
Analyze Key Patterns: Examine key naming and distribution
Performance Analysis: Monitor operation latency and throughput

Deduplication Component ​

Overview ​

Core Interface ​

Interface ​

Implementations ​

Memory-Based Deduplication ​

Memory Structure ​

NewMemory() *Memory ​

Seen(key string) bool ​

Redis-Based Deduplication ​

Redis Structure ​

NewRedis(addr string, ttl time.Duration) (*Redis, error) ​

Seen(key string) bool ​

Deployment Strategies ​

Single-Node Deployment ​

Distributed Deployment ​

Key Generation Strategies ​

Domain-Based Keys ​

Content-Based Keys ​

Time-Window Keys ​

Performance Characteristics ​

Memory Implementation ​

Redis Implementation ​

Error Handling ​

Memory Implementation ​

Redis Implementation ​

TTL and Expiration ​

Redis TTL Benefits ​

TTL Configuration Examples ​

Integration Patterns ​

Probe Pipeline Integration ​

Configuration-Driven Selection ​

Monitoring and Observability ​

Memory Implementation Metrics ​

Redis Implementation Metrics ​

Common Use Cases ​

Domain Deduplication ​

URL Deduplication ​

Batch Deduplication ​

Best Practices ​

Key Naming Conventions ​

Error Handling ​

Performance Optimization ​

Security Considerations ​

Redis Security ​

Data Privacy ​

Troubleshooting ​

Common Issues ​

Debugging Steps ​

Deduplication Component

Overview

Core Interface

`Interface`

Implementations

Memory-Based Deduplication

`Memory` Structure

`NewMemory() *Memory`

`Seen(key string) bool`

Redis-Based Deduplication

`Redis` Structure

`NewRedis(addr string, ttl time.Duration) (*Redis, error)`

`Seen(key string) bool`

Deployment Strategies

Single-Node Deployment

Distributed Deployment

Key Generation Strategies

Domain-Based Keys

Content-Based Keys

Time-Window Keys

Performance Characteristics

Memory Implementation

Redis Implementation

Error Handling

Memory Implementation

Redis Implementation

TTL and Expiration

Redis TTL Benefits

TTL Configuration Examples

Integration Patterns

Probe Pipeline Integration

Configuration-Driven Selection

Monitoring and Observability

Memory Implementation Metrics

Redis Implementation Metrics

Common Use Cases

Domain Deduplication

URL Deduplication

Batch Deduplication

Best Practices

Key Naming Conventions

Error Handling

Performance Optimization

Security Considerations

Redis Security

Data Privacy

Troubleshooting

Common Issues

Debugging Steps