feat: add bgp observability and admin ui improvements
This commit is contained in:
487
docs/bgp-observability-plan.md
Normal file
487
docs/bgp-observability-plan.md
Normal file
@@ -0,0 +1,487 @@
|
||||
# BGP Observability Plan
|
||||
|
||||
## Goal
|
||||
|
||||
Build a global routing observability capability on top of:
|
||||
|
||||
- [RIPE RIS Live](https://ris-live.ripe.net/)
|
||||
- [CAIDA BGPStream data access overview](https://bgpstream.caida.org/docs/overview/data-access)
|
||||
|
||||
The target is to support:
|
||||
|
||||
- real-time routing event ingestion
|
||||
- historical replay and baseline analysis
|
||||
- anomaly detection
|
||||
- Earth big-screen visualization
|
||||
|
||||
## Important Scope Note
|
||||
|
||||
These data sources expose the BGP control plane, not user traffic itself.
|
||||
|
||||
That means the system can infer:
|
||||
|
||||
- route propagation direction
|
||||
- prefix reachability changes
|
||||
- AS path changes
|
||||
- visibility changes across collectors
|
||||
|
||||
But it cannot directly measure:
|
||||
|
||||
- exact application traffic volume
|
||||
- exact user packet path
|
||||
- real bandwidth consumption between countries or operators
|
||||
|
||||
Product wording should therefore use phrases like:
|
||||
|
||||
- global routing propagation
|
||||
- route visibility
|
||||
- control-plane anomalies
|
||||
- suspected path diversion
|
||||
|
||||
Instead of claiming direct traffic measurement.
|
||||
|
||||
## Data Source Roles
|
||||
|
||||
### RIS Live
|
||||
|
||||
Use RIS Live as the real-time feed.
|
||||
|
||||
Recommended usage:
|
||||
|
||||
- subscribe to update streams over WebSocket
|
||||
- ingest announcements and withdrawals continuously
|
||||
- trigger low-latency alerts
|
||||
|
||||
Best suited for:
|
||||
|
||||
- hijack suspicion
|
||||
- withdrawal bursts
|
||||
- real-time path changes
|
||||
- live Earth event overlay
|
||||
|
||||
### BGPStream
|
||||
|
||||
Use BGPStream as the historical and replay layer.
|
||||
|
||||
Recommended usage:
|
||||
|
||||
- backfill time windows
|
||||
- build normal baselines
|
||||
- compare current events against history
|
||||
- support investigations and playback
|
||||
|
||||
Best suited for:
|
||||
|
||||
- historical anomaly confirmation
|
||||
- baseline path frequency
|
||||
- visibility baselines
|
||||
- postmortem analysis
|
||||
|
||||
## Recommended Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A["RIS Live WebSocket"] --> B["Realtime Collector"]
|
||||
C["BGPStream Historical Access"] --> D["Backfill Collector"]
|
||||
B --> E["Normalization Layer"]
|
||||
D --> E
|
||||
E --> F["data_snapshots"]
|
||||
E --> G["collected_data"]
|
||||
E --> H["bgp_anomalies"]
|
||||
H --> I["Alerts API"]
|
||||
G --> J["Visualization API"]
|
||||
H --> J
|
||||
J --> K["Earth Big Screen"]
|
||||
```
|
||||
|
||||
## Storage Design
|
||||
|
||||
The current project already has:
|
||||
|
||||
- [data_snapshot.py](/home/ray/dev/linkong/planet/backend/app/models/data_snapshot.py)
|
||||
- [collected_data.py](/home/ray/dev/linkong/planet/backend/app/models/collected_data.py)
|
||||
|
||||
So the lowest-risk path is:
|
||||
|
||||
1. keep raw and normalized BGP events in `collected_data`
|
||||
2. use `data_snapshots` to group each ingest window
|
||||
3. add a dedicated anomaly table for higher-value derived events
|
||||
|
||||
## Proposed Data Types
|
||||
|
||||
### `collected_data`
|
||||
|
||||
Use these `source` values:
|
||||
|
||||
- `ris_live_bgp`
|
||||
- `bgpstream_bgp`
|
||||
|
||||
Use these `data_type` values:
|
||||
|
||||
- `bgp_update`
|
||||
- `bgp_rib`
|
||||
- `bgp_visibility`
|
||||
- `bgp_path_change`
|
||||
|
||||
Recommended stable fields:
|
||||
|
||||
- `source`
|
||||
- `source_id`
|
||||
- `entity_key`
|
||||
- `data_type`
|
||||
- `name`
|
||||
- `reference_date`
|
||||
- `metadata`
|
||||
|
||||
Recommended `entity_key` strategy:
|
||||
|
||||
- event entity: `collector|peer|prefix|event_time`
|
||||
- prefix state entity: `collector|peer|prefix`
|
||||
- origin state entity: `prefix|origin_asn`
|
||||
|
||||
### `metadata` schema for raw events
|
||||
|
||||
Store the normalized event payload in `metadata`:
|
||||
|
||||
```json
|
||||
{
|
||||
"project": "ris-live",
|
||||
"collector": "rrc00",
|
||||
"peer_asn": 3333,
|
||||
"peer_ip": "2001:db8::1",
|
||||
"event_type": "announcement",
|
||||
"prefix": "203.0.113.0/24",
|
||||
"origin_asn": 64496,
|
||||
"as_path": [3333, 64500, 64496],
|
||||
"communities": ["3333:100", "64500:1"],
|
||||
"next_hop": "192.0.2.1",
|
||||
"med": 0,
|
||||
"local_pref": null,
|
||||
"timestamp": "2026-03-26T08:00:00Z",
|
||||
"raw_message": {}
|
||||
}
|
||||
```
|
||||
|
||||
### New anomaly table
|
||||
|
||||
Add a new table, recommended name: `bgp_anomalies`
|
||||
|
||||
Suggested columns:
|
||||
|
||||
- `id`
|
||||
- `snapshot_id`
|
||||
- `task_id`
|
||||
- `source`
|
||||
- `anomaly_type`
|
||||
- `severity`
|
||||
- `status`
|
||||
- `entity_key`
|
||||
- `prefix`
|
||||
- `origin_asn`
|
||||
- `new_origin_asn`
|
||||
- `peer_scope`
|
||||
- `started_at`
|
||||
- `ended_at`
|
||||
- `confidence`
|
||||
- `summary`
|
||||
- `evidence`
|
||||
- `created_at`
|
||||
|
||||
This table should represent derived intelligence, not raw updates.
|
||||
|
||||
## Collector Design
|
||||
|
||||
## 1. `RISLiveCollector`
|
||||
|
||||
Responsibility:
|
||||
|
||||
- maintain WebSocket connection
|
||||
- subscribe to relevant message types
|
||||
- normalize messages
|
||||
- write event batches into snapshots
|
||||
- optionally emit derived anomalies in near real time
|
||||
|
||||
Suggested runtime mode:
|
||||
|
||||
- long-running background task
|
||||
|
||||
Suggested snapshot strategy:
|
||||
|
||||
- one snapshot per rolling time window
|
||||
- for example every 1 minute or every 5 minutes
|
||||
|
||||
## 2. `BGPStreamBackfillCollector`
|
||||
|
||||
Responsibility:
|
||||
|
||||
- fetch historical data windows
|
||||
- normalize to the same schema as real-time data
|
||||
- build baselines
|
||||
- re-run anomaly rules on past windows if needed
|
||||
|
||||
Suggested runtime mode:
|
||||
|
||||
- scheduled task
|
||||
- or ad hoc task for investigations
|
||||
|
||||
Suggested snapshot strategy:
|
||||
|
||||
- one snapshot per historical query window
|
||||
|
||||
## Normalization Rules
|
||||
|
||||
Normalize both sources into the same internal event model.
|
||||
|
||||
Required normalized fields:
|
||||
|
||||
- `collector`
|
||||
- `peer_asn`
|
||||
- `peer_ip`
|
||||
- `event_type`
|
||||
- `prefix`
|
||||
- `origin_asn`
|
||||
- `as_path`
|
||||
- `timestamp`
|
||||
|
||||
Derived normalized fields:
|
||||
|
||||
- `as_path_length`
|
||||
- `country_guess`
|
||||
- `prefix_length`
|
||||
- `is_more_specific`
|
||||
- `visibility_weight`
|
||||
|
||||
## Anomaly Detection Rules
|
||||
|
||||
Start with these five rules first.
|
||||
|
||||
### 1. Origin ASN Change
|
||||
|
||||
Trigger when:
|
||||
|
||||
- the same prefix is announced by a new origin ASN not seen in the baseline window
|
||||
|
||||
Use for:
|
||||
|
||||
- hijack suspicion
|
||||
- origin drift detection
|
||||
|
||||
### 2. More-Specific Burst
|
||||
|
||||
Trigger when:
|
||||
|
||||
- a more-specific prefix appears suddenly
|
||||
- especially from an unexpected origin ASN
|
||||
|
||||
Use for:
|
||||
|
||||
- subprefix hijack suspicion
|
||||
|
||||
### 3. Mass Withdrawal
|
||||
|
||||
Trigger when:
|
||||
|
||||
- the same prefix or ASN sees many withdrawals across collectors within a short window
|
||||
|
||||
Use for:
|
||||
|
||||
- outage suspicion
|
||||
- regional incident detection
|
||||
|
||||
### 4. Path Deviation
|
||||
|
||||
Trigger when:
|
||||
|
||||
- AS path length jumps sharply
|
||||
- or a rarely seen transit ASN appears
|
||||
- or path frequency drops below baseline norms
|
||||
|
||||
Use for:
|
||||
|
||||
- route leak suspicion
|
||||
- unusual path diversion
|
||||
|
||||
### 5. Visibility Drop
|
||||
|
||||
Trigger when:
|
||||
|
||||
- a prefix is visible from far fewer collectors/peers than its baseline
|
||||
|
||||
Use for:
|
||||
|
||||
- regional reachability degradation
|
||||
|
||||
## Baseline Strategy
|
||||
|
||||
Use BGPStream historical data to build:
|
||||
|
||||
- common origin ASN per prefix
|
||||
- common AS path patterns
|
||||
- collector visibility distribution
|
||||
- normal withdrawal frequency
|
||||
|
||||
Recommended baseline windows:
|
||||
|
||||
- short baseline: last 24 hours
|
||||
- medium baseline: last 7 days
|
||||
- long baseline: last 30 days
|
||||
|
||||
The first implementation can start with only the 7-day baseline.
|
||||
|
||||
## API Design
|
||||
|
||||
### Raw event API
|
||||
|
||||
Add endpoints like:
|
||||
|
||||
- `GET /api/v1/bgp/events`
|
||||
- `GET /api/v1/bgp/events/{id}`
|
||||
|
||||
Suggested filters:
|
||||
|
||||
- `prefix`
|
||||
- `origin_asn`
|
||||
- `peer_asn`
|
||||
- `collector`
|
||||
- `event_type`
|
||||
- `time_from`
|
||||
- `time_to`
|
||||
- `source`
|
||||
|
||||
### Anomaly API
|
||||
|
||||
Add endpoints like:
|
||||
|
||||
- `GET /api/v1/bgp/anomalies`
|
||||
- `GET /api/v1/bgp/anomalies/{id}`
|
||||
- `GET /api/v1/bgp/anomalies/summary`
|
||||
|
||||
Suggested filters:
|
||||
|
||||
- `severity`
|
||||
- `anomaly_type`
|
||||
- `status`
|
||||
- `prefix`
|
||||
- `origin_asn`
|
||||
- `time_from`
|
||||
- `time_to`
|
||||
|
||||
### Visualization API
|
||||
|
||||
Add an Earth-oriented endpoint like:
|
||||
|
||||
- `GET /api/v1/visualization/geo/bgp-anomalies`
|
||||
|
||||
Recommended feature shapes:
|
||||
|
||||
- point: collector locations
|
||||
- arc: inferred propagation or suspicious path edge
|
||||
- pulse point: active anomaly hotspot
|
||||
|
||||
## Earth Big-Screen Design
|
||||
|
||||
Recommended layers:
|
||||
|
||||
### Layer 1: Collector layer
|
||||
|
||||
Show known collector locations and current activity intensity.
|
||||
|
||||
### Layer 2: Route propagation arcs
|
||||
|
||||
Use arcs for:
|
||||
|
||||
- origin ASN country to collector country
|
||||
- or collector-to-collector visibility edges
|
||||
|
||||
Important note:
|
||||
|
||||
This is an inferred propagation view, not real packet flow.
|
||||
|
||||
### Layer 3: Active anomaly overlay
|
||||
|
||||
Show:
|
||||
|
||||
- hijack suspicion in red
|
||||
- mass withdrawal in orange
|
||||
- visibility drop in yellow
|
||||
- path deviation in blue
|
||||
|
||||
### Layer 4: Time playback
|
||||
|
||||
Use `data_snapshots` to replay:
|
||||
|
||||
- minute-by-minute route changes
|
||||
- anomaly expansion
|
||||
- recovery timeline
|
||||
|
||||
## Alerting Strategy
|
||||
|
||||
Map anomaly severity to the current alert system.
|
||||
|
||||
Recommended severity mapping:
|
||||
|
||||
- `critical`
|
||||
- likely hijack
|
||||
- very large withdrawal burst
|
||||
- `high`
|
||||
- clear origin change
|
||||
- large visibility drop
|
||||
- `medium`
|
||||
- unusual path change
|
||||
- moderate more-specific burst
|
||||
- `low`
|
||||
- weak or localized anomalies
|
||||
|
||||
## Delivery Plan
|
||||
|
||||
### Phase 1
|
||||
|
||||
- add `RISLiveCollector`
|
||||
- normalize updates into `collected_data`
|
||||
- create `bgp_anomalies`
|
||||
- implement 3 rules:
|
||||
- origin change
|
||||
- more-specific burst
|
||||
- mass withdrawal
|
||||
|
||||
### Phase 2
|
||||
|
||||
- add `BGPStreamBackfillCollector`
|
||||
- build 7-day baseline
|
||||
- implement:
|
||||
- path deviation
|
||||
- visibility drop
|
||||
|
||||
### Phase 3
|
||||
|
||||
- add Earth visualization layer
|
||||
- add time playback
|
||||
- add anomaly filtering and drilldown
|
||||
|
||||
## Practical Implementation Notes
|
||||
|
||||
- Start with IPv4 first, then add IPv6 after the event schema is stable.
|
||||
- Store the original raw payload in `metadata.raw_message` for traceability.
|
||||
- Deduplicate events by a stable hash of collector, peer, prefix, type, and timestamp.
|
||||
- Keep anomaly generation idempotent so replay and backfill do not create duplicate alerts.
|
||||
- Expect noisy data and partial views; confidence scoring matters.
|
||||
|
||||
## Recommended First Patch Set
|
||||
|
||||
The first code milestone should include:
|
||||
|
||||
1. `backend/app/services/collectors/ris_live.py`
|
||||
2. `backend/app/services/collectors/bgpstream.py`
|
||||
3. `backend/app/models/bgp_anomaly.py`
|
||||
4. `backend/app/api/v1/bgp.py`
|
||||
5. `backend/app/api/v1/visualization.py`
|
||||
add BGP anomaly geo endpoint
|
||||
6. `frontend/src/pages`
|
||||
add a BGP anomaly list or summary page
|
||||
7. `frontend/public/earth/js`
|
||||
add BGP anomaly rendering layer
|
||||
|
||||
## Sources
|
||||
|
||||
- [RIPE RIS Live](https://ris-live.ripe.net/)
|
||||
- [CAIDA BGPStream Data Access Overview](https://bgpstream.caida.org/docs/overview/data-access)
|
||||
Reference in New Issue
Block a user