Monitoring and Alerting

This guide covers monitoring SPYDER Probe deployments using Prometheus, Grafana, and alerting systems.

Metrics Overview

SPYDER Probe exposes Prometheus metrics at the /metrics endpoint (default port 9090).

Core Metrics

Task Metrics

spyder_tasks_total{status="ok|error"} - Counter of processed tasks
spyder_task_duration_seconds - Histogram of task processing time
spyder_active_workers - Gauge of currently active workers

Edge Discovery Metrics

spyder_edges_total{type="RESOLVES_TO|HAS_CERT|LINKS_TO"} - Counter of discovered edges by type
spyder_nodes_discovered_total{type="domain|ip|cert"} - Counter of discovered nodes by type

System Metrics

spyder_redis_operations_total{operation="get|set|exists"} - Redis operation counters
spyder_http_requests_total{status_code} - HTTP request counters
spyder_batch_emissions_total{status="success|failure"} - Batch emission results

Prometheus Configuration

Basic Setup

Create /etc/prometheus/prometheus.yml:

yaml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "spyder_alerts.yml"

scrape_configs:
  - job_name: 'spyder-probe'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 10s
    metrics_path: '/metrics'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

Multi-Node Setup

For distributed deployments:

yaml

scrape_configs:
  - job_name: 'spyder-probe-cluster'
    static_configs:
      - targets: 
        - 'probe-1.internal:9090'
        - 'probe-2.internal:9090'
        - 'probe-3.internal:9090'
    labels:
      environment: 'production'
      
  - job_name: 'spyder-redis'
    static_configs:
      - targets: ['redis.internal:6379']

Grafana Dashboards

Installation and Setup

Install Grafana

bash

sudo apt-get install -y grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Add Prometheus data source
- URL: http://localhost:9090
- Access: Server (default)

SPYDER Probe Dashboard

Key panels to include:

Processing Rate Panel

promql

# Tasks processed per second
rate(spyder_tasks_total[5m])

# Success rate
rate(spyder_tasks_total{status="ok"}[5m]) / rate(spyder_tasks_total[5m]) * 100

Edge Discovery Panel

promql

# Edges discovered by type
rate(spyder_edges_total[5m])

# Top domains by edge count
topk(10, increase(spyder_edges_total[1h]))

System Health Panel

promql

# Active workers
spyder_active_workers

# Redis hit rate
rate(spyder_redis_operations_total{operation="get"}[5m]) / 
rate(spyder_redis_operations_total[5m]) * 100

# HTTP error rate
rate(spyder_http_requests_total{status_code!~"2.."}[5m]) / 
rate(spyder_http_requests_total[5m]) * 100

Alerting Rules

Create /etc/prometheus/spyder_alerts.yml:

yaml

groups:
  - name: spyder-probe
    rules:
      # High error rate
      - alert: SpyderHighErrorRate
        expr: (rate(spyder_tasks_total{status="error"}[5m]) / rate(spyder_tasks_total[5m])) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "SPYDER Probe high error rate"
          description: "Error rate is {{ $value | humanizePercentage }} for instance {{ $labels.instance }}"

      # Low processing rate
      - alert: SpyderLowProcessingRate  
        expr: rate(spyder_tasks_total[5m]) < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "SPYDER Probe low processing rate"
          description: "Processing only {{ $value }} tasks/sec on {{ $labels.instance }}"

      # Redis connection issues
      - alert: SpyderRedisDown
        expr: rate(spyder_redis_operations_total[5m]) == 0
        for: 1m  
        labels:
          severity: critical
        annotations:
          summary: "SPYDER Probe Redis connection lost"
          description: "No Redis operations detected on {{ $labels.instance }}"

      # Batch emission failures
      - alert: SpyderBatchEmissionFailures
        expr: rate(spyder_batch_emissions_total{status="failure"}[5m]) > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "SPYDER Probe batch emission failures"
          description: "{{ $value }} batch emission failures/sec on {{ $labels.instance }}"

      # Worker pool exhaustion
      - alert: SpyderWorkerPoolExhausted
        expr: spyder_active_workers == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "SPYDER Probe worker pool exhausted"  
          description: "No active workers on {{ $labels.instance }}"

Log Monitoring

Structured Log Analysis

SPYDER uses structured logging with zap. Common log queries:

bash

# Error analysis
sudo journalctl -u spyder | jq 'select(.level == "error")'

# Performance analysis  
sudo journalctl -u spyder | jq 'select(.msg == "batch emitted") | .duration'

# Redis connectivity issues
sudo journalctl -u spyder | jq 'select(.msg | contains("redis"))'

Log Aggregation with ELK Stack

Filebeat configuration (`/etc/filebeat/filebeat.yml`):

yaml

filebeat.inputs:
- type: journald
  id: spyder-logs
  include_matches:
    - "_SYSTEMD_UNIT=spyder.service"

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  
processors:
  - decode_json_fields:
      fields: ["message"]
      target: ""

Logstash filter:

ruby

filter {
  if [fields][service] == "spyder" {
    json {
      source => "message"
    }
    
    date {
      match => [ "ts", "UNIX" ]
    }
    
    mutate {
      remove_field => [ "message" ]
    }
  }
}

Performance Monitoring

Key Performance Indicators

Throughput Metrics
- Domains processed per second
- Edges discovered per hour
- Data volume processed
Latency Metrics
- Average task processing time
- DNS resolution latency
- HTTP request latency
Resource Utilization
- CPU usage per worker
- Memory consumption
- Network I/O rates

Custom Metrics

Add application-specific metrics:

// Custom metrics example
var (
    domainsProcessed = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "spyder_domains_processed_total",
            Help: "Total domains processed by TLD",
        },
        []string{"tld"},
    )
    
    crawlDepth = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "spyder_crawl_depth",
            Help: "Distribution of crawl depths",
            Buckets: prometheus.LinearBuckets(1, 1, 10),
        },
        []string{"probe_id"},
    )
)

Health Checks

Kubernetes Probes

yaml

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: spyder-probe
    image: spyder-probe:latest
    ports:
    - containerPort: 9090
    livenessProbe:
      httpGet:
        path: /metrics
        port: 9090
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /metrics  
        port: 9090
      initialDelaySeconds: 5
      periodSeconds: 5

External Health Monitoring

Use tools like:

Uptime Kuma - Simple uptime monitoring
Pingdom - External service monitoring
DataDog - Comprehensive monitoring platform

Troubleshooting Monitoring Issues

Common Problems

Metrics not appearing

bash

# Check metrics endpoint
curl http://localhost:9090/metrics | grep spyder

# Verify Prometheus scraping
curl http://prometheus:9090/api/v1/targets

High cardinality metrics

bash

# Check metric cardinality
curl -s http://localhost:9090/metrics | grep spyder | wc -l

# Look for high-cardinality labels
curl -s http://localhost:9090/metrics | grep spyder | cut -d'{' -f2 | sort | uniq -c | sort -nr

Dashboard not loading
- Check Grafana datasource configuration
- Verify PromQL queries in Grafana query inspector
- Check Grafana logs: sudo journalctl -u grafana-server

Performance Impact

Monitor the monitoring overhead:

promql

# Prometheus ingestion rate
rate(prometheus_tsdb_symbol_table_size_bytes[5m])

# Grafana query performance
grafana_api_dashboard_snapshot_external_enabled

Integration Examples

Slack Alerting

Configure Alertmanager for Slack notifications:

yaml

# alertmanager.yml
global:
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

route:
  group_by: ['alertname']
  receiver: 'spyder-alerts'

receivers:
  - name: 'spyder-alerts'
    slack_configs:
      - channel: '#ops-alerts'
        title: 'SPYDER Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

PagerDuty Integration

yaml

receivers:
  - name: 'spyder-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: 'SPYDER: {{ .GroupLabels.alertname }}'

Monitoring and Alerting ​

Metrics Overview ​

Core Metrics ​

Task Metrics ​

Edge Discovery Metrics ​

System Metrics ​

Prometheus Configuration ​

Basic Setup ​

Multi-Node Setup ​

Grafana Dashboards ​

Installation and Setup ​

SPYDER Probe Dashboard ​

Processing Rate Panel ​

Edge Discovery Panel ​

System Health Panel ​

Alerting Rules ​

Log Monitoring ​

Structured Log Analysis ​

Log Aggregation with ELK Stack ​

Filebeat configuration (/etc/filebeat/filebeat.yml): ​

Logstash filter: ​

Performance Monitoring ​

Key Performance Indicators ​

Custom Metrics ​

Health Checks ​

Kubernetes Probes ​

External Health Monitoring ​

Troubleshooting Monitoring Issues ​

Common Problems ​

Performance Impact ​

Integration Examples ​

Slack Alerting ​

PagerDuty Integration ​

Monitoring and Alerting

Metrics Overview

Core Metrics

Task Metrics

Edge Discovery Metrics

System Metrics

Prometheus Configuration

Basic Setup

Multi-Node Setup

Grafana Dashboards

Installation and Setup

SPYDER Probe Dashboard

Processing Rate Panel

Edge Discovery Panel

System Health Panel

Alerting Rules

Log Monitoring

Structured Log Analysis

Log Aggregation with ELK Stack

Filebeat configuration (`/etc/filebeat/filebeat.yml`):

Logstash filter:

Performance Monitoring

Key Performance Indicators

Custom Metrics

Health Checks

Kubernetes Probes

External Health Monitoring

Troubleshooting Monitoring Issues

Common Problems

Performance Impact

Integration Examples

Slack Alerting

PagerDuty Integration