488 lines
9.4 KiB
Markdown
488 lines
9.4 KiB
Markdown
# BGP Observability Plan
|
|
|
|
## Goal
|
|
|
|
Build a global routing observability capability on top of:
|
|
|
|
- [RIPE RIS Live](https://ris-live.ripe.net/)
|
|
- [CAIDA BGPStream data access overview](https://bgpstream.caida.org/docs/overview/data-access)
|
|
|
|
The target is to support:
|
|
|
|
- real-time routing event ingestion
|
|
- historical replay and baseline analysis
|
|
- anomaly detection
|
|
- Earth big-screen visualization
|
|
|
|
## Important Scope Note
|
|
|
|
These data sources expose the BGP control plane, not user traffic itself.
|
|
|
|
That means the system can infer:
|
|
|
|
- route propagation direction
|
|
- prefix reachability changes
|
|
- AS path changes
|
|
- visibility changes across collectors
|
|
|
|
But it cannot directly measure:
|
|
|
|
- exact application traffic volume
|
|
- exact user packet path
|
|
- real bandwidth consumption between countries or operators
|
|
|
|
Product wording should therefore use phrases like:
|
|
|
|
- global routing propagation
|
|
- route visibility
|
|
- control-plane anomalies
|
|
- suspected path diversion
|
|
|
|
Instead of claiming direct traffic measurement.
|
|
|
|
## Data Source Roles
|
|
|
|
### RIS Live
|
|
|
|
Use RIS Live as the real-time feed.
|
|
|
|
Recommended usage:
|
|
|
|
- subscribe to update streams over WebSocket
|
|
- ingest announcements and withdrawals continuously
|
|
- trigger low-latency alerts
|
|
|
|
Best suited for:
|
|
|
|
- hijack suspicion
|
|
- withdrawal bursts
|
|
- real-time path changes
|
|
- live Earth event overlay
|
|
|
|
### BGPStream
|
|
|
|
Use BGPStream as the historical and replay layer.
|
|
|
|
Recommended usage:
|
|
|
|
- backfill time windows
|
|
- build normal baselines
|
|
- compare current events against history
|
|
- support investigations and playback
|
|
|
|
Best suited for:
|
|
|
|
- historical anomaly confirmation
|
|
- baseline path frequency
|
|
- visibility baselines
|
|
- postmortem analysis
|
|
|
|
## Recommended Architecture
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A["RIS Live WebSocket"] --> B["Realtime Collector"]
|
|
C["BGPStream Historical Access"] --> D["Backfill Collector"]
|
|
B --> E["Normalization Layer"]
|
|
D --> E
|
|
E --> F["data_snapshots"]
|
|
E --> G["collected_data"]
|
|
E --> H["bgp_anomalies"]
|
|
H --> I["Alerts API"]
|
|
G --> J["Visualization API"]
|
|
H --> J
|
|
J --> K["Earth Big Screen"]
|
|
```
|
|
|
|
## Storage Design
|
|
|
|
The current project already has:
|
|
|
|
- [data_snapshot.py](/home/ray/dev/linkong/planet/backend/app/models/data_snapshot.py)
|
|
- [collected_data.py](/home/ray/dev/linkong/planet/backend/app/models/collected_data.py)
|
|
|
|
So the lowest-risk path is:
|
|
|
|
1. keep raw and normalized BGP events in `collected_data`
|
|
2. use `data_snapshots` to group each ingest window
|
|
3. add a dedicated anomaly table for higher-value derived events
|
|
|
|
## Proposed Data Types
|
|
|
|
### `collected_data`
|
|
|
|
Use these `source` values:
|
|
|
|
- `ris_live_bgp`
|
|
- `bgpstream_bgp`
|
|
|
|
Use these `data_type` values:
|
|
|
|
- `bgp_update`
|
|
- `bgp_rib`
|
|
- `bgp_visibility`
|
|
- `bgp_path_change`
|
|
|
|
Recommended stable fields:
|
|
|
|
- `source`
|
|
- `source_id`
|
|
- `entity_key`
|
|
- `data_type`
|
|
- `name`
|
|
- `reference_date`
|
|
- `metadata`
|
|
|
|
Recommended `entity_key` strategy:
|
|
|
|
- event entity: `collector|peer|prefix|event_time`
|
|
- prefix state entity: `collector|peer|prefix`
|
|
- origin state entity: `prefix|origin_asn`
|
|
|
|
### `metadata` schema for raw events
|
|
|
|
Store the normalized event payload in `metadata`:
|
|
|
|
```json
|
|
{
|
|
"project": "ris-live",
|
|
"collector": "rrc00",
|
|
"peer_asn": 3333,
|
|
"peer_ip": "2001:db8::1",
|
|
"event_type": "announcement",
|
|
"prefix": "203.0.113.0/24",
|
|
"origin_asn": 64496,
|
|
"as_path": [3333, 64500, 64496],
|
|
"communities": ["3333:100", "64500:1"],
|
|
"next_hop": "192.0.2.1",
|
|
"med": 0,
|
|
"local_pref": null,
|
|
"timestamp": "2026-03-26T08:00:00Z",
|
|
"raw_message": {}
|
|
}
|
|
```
|
|
|
|
### New anomaly table
|
|
|
|
Add a new table, recommended name: `bgp_anomalies`
|
|
|
|
Suggested columns:
|
|
|
|
- `id`
|
|
- `snapshot_id`
|
|
- `task_id`
|
|
- `source`
|
|
- `anomaly_type`
|
|
- `severity`
|
|
- `status`
|
|
- `entity_key`
|
|
- `prefix`
|
|
- `origin_asn`
|
|
- `new_origin_asn`
|
|
- `peer_scope`
|
|
- `started_at`
|
|
- `ended_at`
|
|
- `confidence`
|
|
- `summary`
|
|
- `evidence`
|
|
- `created_at`
|
|
|
|
This table should represent derived intelligence, not raw updates.
|
|
|
|
## Collector Design
|
|
|
|
## 1. `RISLiveCollector`
|
|
|
|
Responsibility:
|
|
|
|
- maintain WebSocket connection
|
|
- subscribe to relevant message types
|
|
- normalize messages
|
|
- write event batches into snapshots
|
|
- optionally emit derived anomalies in near real time
|
|
|
|
Suggested runtime mode:
|
|
|
|
- long-running background task
|
|
|
|
Suggested snapshot strategy:
|
|
|
|
- one snapshot per rolling time window
|
|
- for example every 1 minute or every 5 minutes
|
|
|
|
## 2. `BGPStreamBackfillCollector`
|
|
|
|
Responsibility:
|
|
|
|
- fetch historical data windows
|
|
- normalize to the same schema as real-time data
|
|
- build baselines
|
|
- re-run anomaly rules on past windows if needed
|
|
|
|
Suggested runtime mode:
|
|
|
|
- scheduled task
|
|
- or ad hoc task for investigations
|
|
|
|
Suggested snapshot strategy:
|
|
|
|
- one snapshot per historical query window
|
|
|
|
## Normalization Rules
|
|
|
|
Normalize both sources into the same internal event model.
|
|
|
|
Required normalized fields:
|
|
|
|
- `collector`
|
|
- `peer_asn`
|
|
- `peer_ip`
|
|
- `event_type`
|
|
- `prefix`
|
|
- `origin_asn`
|
|
- `as_path`
|
|
- `timestamp`
|
|
|
|
Derived normalized fields:
|
|
|
|
- `as_path_length`
|
|
- `country_guess`
|
|
- `prefix_length`
|
|
- `is_more_specific`
|
|
- `visibility_weight`
|
|
|
|
## Anomaly Detection Rules
|
|
|
|
Start with these five rules first.
|
|
|
|
### 1. Origin ASN Change
|
|
|
|
Trigger when:
|
|
|
|
- the same prefix is announced by a new origin ASN not seen in the baseline window
|
|
|
|
Use for:
|
|
|
|
- hijack suspicion
|
|
- origin drift detection
|
|
|
|
### 2. More-Specific Burst
|
|
|
|
Trigger when:
|
|
|
|
- a more-specific prefix appears suddenly
|
|
- especially from an unexpected origin ASN
|
|
|
|
Use for:
|
|
|
|
- subprefix hijack suspicion
|
|
|
|
### 3. Mass Withdrawal
|
|
|
|
Trigger when:
|
|
|
|
- the same prefix or ASN sees many withdrawals across collectors within a short window
|
|
|
|
Use for:
|
|
|
|
- outage suspicion
|
|
- regional incident detection
|
|
|
|
### 4. Path Deviation
|
|
|
|
Trigger when:
|
|
|
|
- AS path length jumps sharply
|
|
- or a rarely seen transit ASN appears
|
|
- or path frequency drops below baseline norms
|
|
|
|
Use for:
|
|
|
|
- route leak suspicion
|
|
- unusual path diversion
|
|
|
|
### 5. Visibility Drop
|
|
|
|
Trigger when:
|
|
|
|
- a prefix is visible from far fewer collectors/peers than its baseline
|
|
|
|
Use for:
|
|
|
|
- regional reachability degradation
|
|
|
|
## Baseline Strategy
|
|
|
|
Use BGPStream historical data to build:
|
|
|
|
- common origin ASN per prefix
|
|
- common AS path patterns
|
|
- collector visibility distribution
|
|
- normal withdrawal frequency
|
|
|
|
Recommended baseline windows:
|
|
|
|
- short baseline: last 24 hours
|
|
- medium baseline: last 7 days
|
|
- long baseline: last 30 days
|
|
|
|
The first implementation can start with only the 7-day baseline.
|
|
|
|
## API Design
|
|
|
|
### Raw event API
|
|
|
|
Add endpoints like:
|
|
|
|
- `GET /api/v1/bgp/events`
|
|
- `GET /api/v1/bgp/events/{id}`
|
|
|
|
Suggested filters:
|
|
|
|
- `prefix`
|
|
- `origin_asn`
|
|
- `peer_asn`
|
|
- `collector`
|
|
- `event_type`
|
|
- `time_from`
|
|
- `time_to`
|
|
- `source`
|
|
|
|
### Anomaly API
|
|
|
|
Add endpoints like:
|
|
|
|
- `GET /api/v1/bgp/anomalies`
|
|
- `GET /api/v1/bgp/anomalies/{id}`
|
|
- `GET /api/v1/bgp/anomalies/summary`
|
|
|
|
Suggested filters:
|
|
|
|
- `severity`
|
|
- `anomaly_type`
|
|
- `status`
|
|
- `prefix`
|
|
- `origin_asn`
|
|
- `time_from`
|
|
- `time_to`
|
|
|
|
### Visualization API
|
|
|
|
Add an Earth-oriented endpoint like:
|
|
|
|
- `GET /api/v1/visualization/geo/bgp-anomalies`
|
|
|
|
Recommended feature shapes:
|
|
|
|
- point: collector locations
|
|
- arc: inferred propagation or suspicious path edge
|
|
- pulse point: active anomaly hotspot
|
|
|
|
## Earth Big-Screen Design
|
|
|
|
Recommended layers:
|
|
|
|
### Layer 1: Collector layer
|
|
|
|
Show known collector locations and current activity intensity.
|
|
|
|
### Layer 2: Route propagation arcs
|
|
|
|
Use arcs for:
|
|
|
|
- origin ASN country to collector country
|
|
- or collector-to-collector visibility edges
|
|
|
|
Important note:
|
|
|
|
This is an inferred propagation view, not real packet flow.
|
|
|
|
### Layer 3: Active anomaly overlay
|
|
|
|
Show:
|
|
|
|
- hijack suspicion in red
|
|
- mass withdrawal in orange
|
|
- visibility drop in yellow
|
|
- path deviation in blue
|
|
|
|
### Layer 4: Time playback
|
|
|
|
Use `data_snapshots` to replay:
|
|
|
|
- minute-by-minute route changes
|
|
- anomaly expansion
|
|
- recovery timeline
|
|
|
|
## Alerting Strategy
|
|
|
|
Map anomaly severity to the current alert system.
|
|
|
|
Recommended severity mapping:
|
|
|
|
- `critical`
|
|
- likely hijack
|
|
- very large withdrawal burst
|
|
- `high`
|
|
- clear origin change
|
|
- large visibility drop
|
|
- `medium`
|
|
- unusual path change
|
|
- moderate more-specific burst
|
|
- `low`
|
|
- weak or localized anomalies
|
|
|
|
## Delivery Plan
|
|
|
|
### Phase 1
|
|
|
|
- add `RISLiveCollector`
|
|
- normalize updates into `collected_data`
|
|
- create `bgp_anomalies`
|
|
- implement 3 rules:
|
|
- origin change
|
|
- more-specific burst
|
|
- mass withdrawal
|
|
|
|
### Phase 2
|
|
|
|
- add `BGPStreamBackfillCollector`
|
|
- build 7-day baseline
|
|
- implement:
|
|
- path deviation
|
|
- visibility drop
|
|
|
|
### Phase 3
|
|
|
|
- add Earth visualization layer
|
|
- add time playback
|
|
- add anomaly filtering and drilldown
|
|
|
|
## Practical Implementation Notes
|
|
|
|
- Start with IPv4 first, then add IPv6 after the event schema is stable.
|
|
- Store the original raw payload in `metadata.raw_message` for traceability.
|
|
- Deduplicate events by a stable hash of collector, peer, prefix, type, and timestamp.
|
|
- Keep anomaly generation idempotent so replay and backfill do not create duplicate alerts.
|
|
- Expect noisy data and partial views; confidence scoring matters.
|
|
|
|
## Recommended First Patch Set
|
|
|
|
The first code milestone should include:
|
|
|
|
1. `backend/app/services/collectors/ris_live.py`
|
|
2. `backend/app/services/collectors/bgpstream.py`
|
|
3. `backend/app/models/bgp_anomaly.py`
|
|
4. `backend/app/api/v1/bgp.py`
|
|
5. `backend/app/api/v1/visualization.py`
|
|
add BGP anomaly geo endpoint
|
|
6. `frontend/src/pages`
|
|
add a BGP anomaly list or summary page
|
|
7. `frontend/public/earth/js`
|
|
add BGP anomaly rendering layer
|
|
|
|
## Sources
|
|
|
|
- [RIPE RIS Live](https://ris-live.ripe.net/)
|
|
- [CAIDA BGPStream Data Access Overview](https://bgpstream.caida.org/docs/overview/data-access)
|