Files
planet/docs/bgp-observability-plan.md

9.4 KiB

BGP Observability Plan

Goal

Build a global routing observability capability on top of:

The target is to support:

  • real-time routing event ingestion
  • historical replay and baseline analysis
  • anomaly detection
  • Earth big-screen visualization

Important Scope Note

These data sources expose the BGP control plane, not user traffic itself.

That means the system can infer:

  • route propagation direction
  • prefix reachability changes
  • AS path changes
  • visibility changes across collectors

But it cannot directly measure:

  • exact application traffic volume
  • exact user packet path
  • real bandwidth consumption between countries or operators

Product wording should therefore use phrases like:

  • global routing propagation
  • route visibility
  • control-plane anomalies
  • suspected path diversion

Instead of claiming direct traffic measurement.

Data Source Roles

RIS Live

Use RIS Live as the real-time feed.

Recommended usage:

  • subscribe to update streams over WebSocket
  • ingest announcements and withdrawals continuously
  • trigger low-latency alerts

Best suited for:

  • hijack suspicion
  • withdrawal bursts
  • real-time path changes
  • live Earth event overlay

BGPStream

Use BGPStream as the historical and replay layer.

Recommended usage:

  • backfill time windows
  • build normal baselines
  • compare current events against history
  • support investigations and playback

Best suited for:

  • historical anomaly confirmation
  • baseline path frequency
  • visibility baselines
  • postmortem analysis
flowchart LR
  A["RIS Live WebSocket"] --> B["Realtime Collector"]
  C["BGPStream Historical Access"] --> D["Backfill Collector"]
  B --> E["Normalization Layer"]
  D --> E
  E --> F["data_snapshots"]
  E --> G["collected_data"]
  E --> H["bgp_anomalies"]
  H --> I["Alerts API"]
  G --> J["Visualization API"]
  H --> J
  J --> K["Earth Big Screen"]

Storage Design

The current project already has:

So the lowest-risk path is:

  1. keep raw and normalized BGP events in collected_data
  2. use data_snapshots to group each ingest window
  3. add a dedicated anomaly table for higher-value derived events

Proposed Data Types

collected_data

Use these source values:

  • ris_live_bgp
  • bgpstream_bgp

Use these data_type values:

  • bgp_update
  • bgp_rib
  • bgp_visibility
  • bgp_path_change

Recommended stable fields:

  • source
  • source_id
  • entity_key
  • data_type
  • name
  • reference_date
  • metadata

Recommended entity_key strategy:

  • event entity: collector|peer|prefix|event_time
  • prefix state entity: collector|peer|prefix
  • origin state entity: prefix|origin_asn

metadata schema for raw events

Store the normalized event payload in metadata:

{
  "project": "ris-live",
  "collector": "rrc00",
  "peer_asn": 3333,
  "peer_ip": "2001:db8::1",
  "event_type": "announcement",
  "prefix": "203.0.113.0/24",
  "origin_asn": 64496,
  "as_path": [3333, 64500, 64496],
  "communities": ["3333:100", "64500:1"],
  "next_hop": "192.0.2.1",
  "med": 0,
  "local_pref": null,
  "timestamp": "2026-03-26T08:00:00Z",
  "raw_message": {}
}

New anomaly table

Add a new table, recommended name: bgp_anomalies

Suggested columns:

  • id
  • snapshot_id
  • task_id
  • source
  • anomaly_type
  • severity
  • status
  • entity_key
  • prefix
  • origin_asn
  • new_origin_asn
  • peer_scope
  • started_at
  • ended_at
  • confidence
  • summary
  • evidence
  • created_at

This table should represent derived intelligence, not raw updates.

Collector Design

1. RISLiveCollector

Responsibility:

  • maintain WebSocket connection
  • subscribe to relevant message types
  • normalize messages
  • write event batches into snapshots
  • optionally emit derived anomalies in near real time

Suggested runtime mode:

  • long-running background task

Suggested snapshot strategy:

  • one snapshot per rolling time window
  • for example every 1 minute or every 5 minutes

2. BGPStreamBackfillCollector

Responsibility:

  • fetch historical data windows
  • normalize to the same schema as real-time data
  • build baselines
  • re-run anomaly rules on past windows if needed

Suggested runtime mode:

  • scheduled task
  • or ad hoc task for investigations

Suggested snapshot strategy:

  • one snapshot per historical query window

Normalization Rules

Normalize both sources into the same internal event model.

Required normalized fields:

  • collector
  • peer_asn
  • peer_ip
  • event_type
  • prefix
  • origin_asn
  • as_path
  • timestamp

Derived normalized fields:

  • as_path_length
  • country_guess
  • prefix_length
  • is_more_specific
  • visibility_weight

Anomaly Detection Rules

Start with these five rules first.

1. Origin ASN Change

Trigger when:

  • the same prefix is announced by a new origin ASN not seen in the baseline window

Use for:

  • hijack suspicion
  • origin drift detection

2. More-Specific Burst

Trigger when:

  • a more-specific prefix appears suddenly
  • especially from an unexpected origin ASN

Use for:

  • subprefix hijack suspicion

3. Mass Withdrawal

Trigger when:

  • the same prefix or ASN sees many withdrawals across collectors within a short window

Use for:

  • outage suspicion
  • regional incident detection

4. Path Deviation

Trigger when:

  • AS path length jumps sharply
  • or a rarely seen transit ASN appears
  • or path frequency drops below baseline norms

Use for:

  • route leak suspicion
  • unusual path diversion

5. Visibility Drop

Trigger when:

  • a prefix is visible from far fewer collectors/peers than its baseline

Use for:

  • regional reachability degradation

Baseline Strategy

Use BGPStream historical data to build:

  • common origin ASN per prefix
  • common AS path patterns
  • collector visibility distribution
  • normal withdrawal frequency

Recommended baseline windows:

  • short baseline: last 24 hours
  • medium baseline: last 7 days
  • long baseline: last 30 days

The first implementation can start with only the 7-day baseline.

API Design

Raw event API

Add endpoints like:

  • GET /api/v1/bgp/events
  • GET /api/v1/bgp/events/{id}

Suggested filters:

  • prefix
  • origin_asn
  • peer_asn
  • collector
  • event_type
  • time_from
  • time_to
  • source

Anomaly API

Add endpoints like:

  • GET /api/v1/bgp/anomalies
  • GET /api/v1/bgp/anomalies/{id}
  • GET /api/v1/bgp/anomalies/summary

Suggested filters:

  • severity
  • anomaly_type
  • status
  • prefix
  • origin_asn
  • time_from
  • time_to

Visualization API

Add an Earth-oriented endpoint like:

  • GET /api/v1/visualization/geo/bgp-anomalies

Recommended feature shapes:

  • point: collector locations
  • arc: inferred propagation or suspicious path edge
  • pulse point: active anomaly hotspot

Earth Big-Screen Design

Recommended layers:

Layer 1: Collector layer

Show known collector locations and current activity intensity.

Layer 2: Route propagation arcs

Use arcs for:

  • origin ASN country to collector country
  • or collector-to-collector visibility edges

Important note:

This is an inferred propagation view, not real packet flow.

Layer 3: Active anomaly overlay

Show:

  • hijack suspicion in red
  • mass withdrawal in orange
  • visibility drop in yellow
  • path deviation in blue

Layer 4: Time playback

Use data_snapshots to replay:

  • minute-by-minute route changes
  • anomaly expansion
  • recovery timeline

Alerting Strategy

Map anomaly severity to the current alert system.

Recommended severity mapping:

  • critical
    • likely hijack
    • very large withdrawal burst
  • high
    • clear origin change
    • large visibility drop
  • medium
    • unusual path change
    • moderate more-specific burst
  • low
    • weak or localized anomalies

Delivery Plan

Phase 1

  • add RISLiveCollector
  • normalize updates into collected_data
  • create bgp_anomalies
  • implement 3 rules:
    • origin change
    • more-specific burst
    • mass withdrawal

Phase 2

  • add BGPStreamBackfillCollector
  • build 7-day baseline
  • implement:
    • path deviation
    • visibility drop

Phase 3

  • add Earth visualization layer
  • add time playback
  • add anomaly filtering and drilldown

Practical Implementation Notes

  • Start with IPv4 first, then add IPv6 after the event schema is stable.
  • Store the original raw payload in metadata.raw_message for traceability.
  • Deduplicate events by a stable hash of collector, peer, prefix, type, and timestamp.
  • Keep anomaly generation idempotent so replay and backfill do not create duplicate alerts.
  • Expect noisy data and partial views; confidence scoring matters.

The first code milestone should include:

  1. backend/app/services/collectors/ris_live.py
  2. backend/app/services/collectors/bgpstream.py
  3. backend/app/models/bgp_anomaly.py
  4. backend/app/api/v1/bgp.py
  5. backend/app/api/v1/visualization.py add BGP anomaly geo endpoint
  6. frontend/src/pages add a BGP anomaly list or summary page
  7. frontend/public/earth/js add BGP anomaly rendering layer

Sources