# BGP Observability Plan ## Goal Build a global routing observability capability on top of: - [RIPE RIS Live](https://ris-live.ripe.net/) - [CAIDA BGPStream data access overview](https://bgpstream.caida.org/docs/overview/data-access) The target is to support: - real-time routing event ingestion - historical replay and baseline analysis - anomaly detection - Earth big-screen visualization ## Important Scope Note These data sources expose the BGP control plane, not user traffic itself. That means the system can infer: - route propagation direction - prefix reachability changes - AS path changes - visibility changes across collectors But it cannot directly measure: - exact application traffic volume - exact user packet path - real bandwidth consumption between countries or operators Product wording should therefore use phrases like: - global routing propagation - route visibility - control-plane anomalies - suspected path diversion Instead of claiming direct traffic measurement. ## Data Source Roles ### RIS Live Use RIS Live as the real-time feed. Recommended usage: - subscribe to update streams over WebSocket - ingest announcements and withdrawals continuously - trigger low-latency alerts Best suited for: - hijack suspicion - withdrawal bursts - real-time path changes - live Earth event overlay ### BGPStream Use BGPStream as the historical and replay layer. Recommended usage: - backfill time windows - build normal baselines - compare current events against history - support investigations and playback Best suited for: - historical anomaly confirmation - baseline path frequency - visibility baselines - postmortem analysis ## Recommended Architecture ```mermaid flowchart LR A["RIS Live WebSocket"] --> B["Realtime Collector"] C["BGPStream Historical Access"] --> D["Backfill Collector"] B --> E["Normalization Layer"] D --> E E --> F["data_snapshots"] E --> G["collected_data"] E --> H["bgp_anomalies"] H --> I["Alerts API"] G --> J["Visualization API"] H --> J J --> K["Earth Big Screen"] ``` ## Storage Design The current project already has: - [data_snapshot.py](/home/ray/dev/linkong/planet/backend/app/models/data_snapshot.py) - [collected_data.py](/home/ray/dev/linkong/planet/backend/app/models/collected_data.py) So the lowest-risk path is: 1. keep raw and normalized BGP events in `collected_data` 2. use `data_snapshots` to group each ingest window 3. add a dedicated anomaly table for higher-value derived events ## Proposed Data Types ### `collected_data` Use these `source` values: - `ris_live_bgp` - `bgpstream_bgp` Use these `data_type` values: - `bgp_update` - `bgp_rib` - `bgp_visibility` - `bgp_path_change` Recommended stable fields: - `source` - `source_id` - `entity_key` - `data_type` - `name` - `reference_date` - `metadata` Recommended `entity_key` strategy: - event entity: `collector|peer|prefix|event_time` - prefix state entity: `collector|peer|prefix` - origin state entity: `prefix|origin_asn` ### `metadata` schema for raw events Store the normalized event payload in `metadata`: ```json { "project": "ris-live", "collector": "rrc00", "peer_asn": 3333, "peer_ip": "2001:db8::1", "event_type": "announcement", "prefix": "203.0.113.0/24", "origin_asn": 64496, "as_path": [3333, 64500, 64496], "communities": ["3333:100", "64500:1"], "next_hop": "192.0.2.1", "med": 0, "local_pref": null, "timestamp": "2026-03-26T08:00:00Z", "raw_message": {} } ``` ### New anomaly table Add a new table, recommended name: `bgp_anomalies` Suggested columns: - `id` - `snapshot_id` - `task_id` - `source` - `anomaly_type` - `severity` - `status` - `entity_key` - `prefix` - `origin_asn` - `new_origin_asn` - `peer_scope` - `started_at` - `ended_at` - `confidence` - `summary` - `evidence` - `created_at` This table should represent derived intelligence, not raw updates. ## Collector Design ## 1. `RISLiveCollector` Responsibility: - maintain WebSocket connection - subscribe to relevant message types - normalize messages - write event batches into snapshots - optionally emit derived anomalies in near real time Suggested runtime mode: - long-running background task Suggested snapshot strategy: - one snapshot per rolling time window - for example every 1 minute or every 5 minutes ## 2. `BGPStreamBackfillCollector` Responsibility: - fetch historical data windows - normalize to the same schema as real-time data - build baselines - re-run anomaly rules on past windows if needed Suggested runtime mode: - scheduled task - or ad hoc task for investigations Suggested snapshot strategy: - one snapshot per historical query window ## Normalization Rules Normalize both sources into the same internal event model. Required normalized fields: - `collector` - `peer_asn` - `peer_ip` - `event_type` - `prefix` - `origin_asn` - `as_path` - `timestamp` Derived normalized fields: - `as_path_length` - `country_guess` - `prefix_length` - `is_more_specific` - `visibility_weight` ## Anomaly Detection Rules Start with these five rules first. ### 1. Origin ASN Change Trigger when: - the same prefix is announced by a new origin ASN not seen in the baseline window Use for: - hijack suspicion - origin drift detection ### 2. More-Specific Burst Trigger when: - a more-specific prefix appears suddenly - especially from an unexpected origin ASN Use for: - subprefix hijack suspicion ### 3. Mass Withdrawal Trigger when: - the same prefix or ASN sees many withdrawals across collectors within a short window Use for: - outage suspicion - regional incident detection ### 4. Path Deviation Trigger when: - AS path length jumps sharply - or a rarely seen transit ASN appears - or path frequency drops below baseline norms Use for: - route leak suspicion - unusual path diversion ### 5. Visibility Drop Trigger when: - a prefix is visible from far fewer collectors/peers than its baseline Use for: - regional reachability degradation ## Baseline Strategy Use BGPStream historical data to build: - common origin ASN per prefix - common AS path patterns - collector visibility distribution - normal withdrawal frequency Recommended baseline windows: - short baseline: last 24 hours - medium baseline: last 7 days - long baseline: last 30 days The first implementation can start with only the 7-day baseline. ## API Design ### Raw event API Add endpoints like: - `GET /api/v1/bgp/events` - `GET /api/v1/bgp/events/{id}` Suggested filters: - `prefix` - `origin_asn` - `peer_asn` - `collector` - `event_type` - `time_from` - `time_to` - `source` ### Anomaly API Add endpoints like: - `GET /api/v1/bgp/anomalies` - `GET /api/v1/bgp/anomalies/{id}` - `GET /api/v1/bgp/anomalies/summary` Suggested filters: - `severity` - `anomaly_type` - `status` - `prefix` - `origin_asn` - `time_from` - `time_to` ### Visualization API Add an Earth-oriented endpoint like: - `GET /api/v1/visualization/geo/bgp-anomalies` Recommended feature shapes: - point: collector locations - arc: inferred propagation or suspicious path edge - pulse point: active anomaly hotspot ## Earth Big-Screen Design Recommended layers: ### Layer 1: Collector layer Show known collector locations and current activity intensity. ### Layer 2: Route propagation arcs Use arcs for: - origin ASN country to collector country - or collector-to-collector visibility edges Important note: This is an inferred propagation view, not real packet flow. ### Layer 3: Active anomaly overlay Show: - hijack suspicion in red - mass withdrawal in orange - visibility drop in yellow - path deviation in blue ### Layer 4: Time playback Use `data_snapshots` to replay: - minute-by-minute route changes - anomaly expansion - recovery timeline ## Alerting Strategy Map anomaly severity to the current alert system. Recommended severity mapping: - `critical` - likely hijack - very large withdrawal burst - `high` - clear origin change - large visibility drop - `medium` - unusual path change - moderate more-specific burst - `low` - weak or localized anomalies ## Delivery Plan ### Phase 1 - add `RISLiveCollector` - normalize updates into `collected_data` - create `bgp_anomalies` - implement 3 rules: - origin change - more-specific burst - mass withdrawal ### Phase 2 - add `BGPStreamBackfillCollector` - build 7-day baseline - implement: - path deviation - visibility drop ### Phase 3 - add Earth visualization layer - add time playback - add anomaly filtering and drilldown ## Practical Implementation Notes - Start with IPv4 first, then add IPv6 after the event schema is stable. - Store the original raw payload in `metadata.raw_message` for traceability. - Deduplicate events by a stable hash of collector, peer, prefix, type, and timestamp. - Keep anomaly generation idempotent so replay and backfill do not create duplicate alerts. - Expect noisy data and partial views; confidence scoring matters. ## Recommended First Patch Set The first code milestone should include: 1. `backend/app/services/collectors/ris_live.py` 2. `backend/app/services/collectors/bgpstream.py` 3. `backend/app/models/bgp_anomaly.py` 4. `backend/app/api/v1/bgp.py` 5. `backend/app/api/v1/visualization.py` add BGP anomaly geo endpoint 6. `frontend/src/pages` add a BGP anomaly list or summary page 7. `frontend/public/earth/js` add BGP anomaly rendering layer ## Sources - [RIPE RIS Live](https://ris-live.ripe.net/) - [CAIDA BGPStream Data Access Overview](https://bgpstream.caida.org/docs/overview/data-access)