feat: add bgp observability and admin ui improvements

This commit is contained in:
linkong
2026-03-27 14:27:07 +08:00
parent bf2c4a172d
commit b0058edf17
51 changed files with 2473 additions and 245 deletions

View File

@@ -0,0 +1,487 @@
# BGP Observability Plan
## Goal
Build a global routing observability capability on top of:
- [RIPE RIS Live](https://ris-live.ripe.net/)
- [CAIDA BGPStream data access overview](https://bgpstream.caida.org/docs/overview/data-access)
The target is to support:
- real-time routing event ingestion
- historical replay and baseline analysis
- anomaly detection
- Earth big-screen visualization
## Important Scope Note
These data sources expose the BGP control plane, not user traffic itself.
That means the system can infer:
- route propagation direction
- prefix reachability changes
- AS path changes
- visibility changes across collectors
But it cannot directly measure:
- exact application traffic volume
- exact user packet path
- real bandwidth consumption between countries or operators
Product wording should therefore use phrases like:
- global routing propagation
- route visibility
- control-plane anomalies
- suspected path diversion
Instead of claiming direct traffic measurement.
## Data Source Roles
### RIS Live
Use RIS Live as the real-time feed.
Recommended usage:
- subscribe to update streams over WebSocket
- ingest announcements and withdrawals continuously
- trigger low-latency alerts
Best suited for:
- hijack suspicion
- withdrawal bursts
- real-time path changes
- live Earth event overlay
### BGPStream
Use BGPStream as the historical and replay layer.
Recommended usage:
- backfill time windows
- build normal baselines
- compare current events against history
- support investigations and playback
Best suited for:
- historical anomaly confirmation
- baseline path frequency
- visibility baselines
- postmortem analysis
## Recommended Architecture
```mermaid
flowchart LR
A["RIS Live WebSocket"] --> B["Realtime Collector"]
C["BGPStream Historical Access"] --> D["Backfill Collector"]
B --> E["Normalization Layer"]
D --> E
E --> F["data_snapshots"]
E --> G["collected_data"]
E --> H["bgp_anomalies"]
H --> I["Alerts API"]
G --> J["Visualization API"]
H --> J
J --> K["Earth Big Screen"]
```
## Storage Design
The current project already has:
- [data_snapshot.py](/home/ray/dev/linkong/planet/backend/app/models/data_snapshot.py)
- [collected_data.py](/home/ray/dev/linkong/planet/backend/app/models/collected_data.py)
So the lowest-risk path is:
1. keep raw and normalized BGP events in `collected_data`
2. use `data_snapshots` to group each ingest window
3. add a dedicated anomaly table for higher-value derived events
## Proposed Data Types
### `collected_data`
Use these `source` values:
- `ris_live_bgp`
- `bgpstream_bgp`
Use these `data_type` values:
- `bgp_update`
- `bgp_rib`
- `bgp_visibility`
- `bgp_path_change`
Recommended stable fields:
- `source`
- `source_id`
- `entity_key`
- `data_type`
- `name`
- `reference_date`
- `metadata`
Recommended `entity_key` strategy:
- event entity: `collector|peer|prefix|event_time`
- prefix state entity: `collector|peer|prefix`
- origin state entity: `prefix|origin_asn`
### `metadata` schema for raw events
Store the normalized event payload in `metadata`:
```json
{
"project": "ris-live",
"collector": "rrc00",
"peer_asn": 3333,
"peer_ip": "2001:db8::1",
"event_type": "announcement",
"prefix": "203.0.113.0/24",
"origin_asn": 64496,
"as_path": [3333, 64500, 64496],
"communities": ["3333:100", "64500:1"],
"next_hop": "192.0.2.1",
"med": 0,
"local_pref": null,
"timestamp": "2026-03-26T08:00:00Z",
"raw_message": {}
}
```
### New anomaly table
Add a new table, recommended name: `bgp_anomalies`
Suggested columns:
- `id`
- `snapshot_id`
- `task_id`
- `source`
- `anomaly_type`
- `severity`
- `status`
- `entity_key`
- `prefix`
- `origin_asn`
- `new_origin_asn`
- `peer_scope`
- `started_at`
- `ended_at`
- `confidence`
- `summary`
- `evidence`
- `created_at`
This table should represent derived intelligence, not raw updates.
## Collector Design
## 1. `RISLiveCollector`
Responsibility:
- maintain WebSocket connection
- subscribe to relevant message types
- normalize messages
- write event batches into snapshots
- optionally emit derived anomalies in near real time
Suggested runtime mode:
- long-running background task
Suggested snapshot strategy:
- one snapshot per rolling time window
- for example every 1 minute or every 5 minutes
## 2. `BGPStreamBackfillCollector`
Responsibility:
- fetch historical data windows
- normalize to the same schema as real-time data
- build baselines
- re-run anomaly rules on past windows if needed
Suggested runtime mode:
- scheduled task
- or ad hoc task for investigations
Suggested snapshot strategy:
- one snapshot per historical query window
## Normalization Rules
Normalize both sources into the same internal event model.
Required normalized fields:
- `collector`
- `peer_asn`
- `peer_ip`
- `event_type`
- `prefix`
- `origin_asn`
- `as_path`
- `timestamp`
Derived normalized fields:
- `as_path_length`
- `country_guess`
- `prefix_length`
- `is_more_specific`
- `visibility_weight`
## Anomaly Detection Rules
Start with these five rules first.
### 1. Origin ASN Change
Trigger when:
- the same prefix is announced by a new origin ASN not seen in the baseline window
Use for:
- hijack suspicion
- origin drift detection
### 2. More-Specific Burst
Trigger when:
- a more-specific prefix appears suddenly
- especially from an unexpected origin ASN
Use for:
- subprefix hijack suspicion
### 3. Mass Withdrawal
Trigger when:
- the same prefix or ASN sees many withdrawals across collectors within a short window
Use for:
- outage suspicion
- regional incident detection
### 4. Path Deviation
Trigger when:
- AS path length jumps sharply
- or a rarely seen transit ASN appears
- or path frequency drops below baseline norms
Use for:
- route leak suspicion
- unusual path diversion
### 5. Visibility Drop
Trigger when:
- a prefix is visible from far fewer collectors/peers than its baseline
Use for:
- regional reachability degradation
## Baseline Strategy
Use BGPStream historical data to build:
- common origin ASN per prefix
- common AS path patterns
- collector visibility distribution
- normal withdrawal frequency
Recommended baseline windows:
- short baseline: last 24 hours
- medium baseline: last 7 days
- long baseline: last 30 days
The first implementation can start with only the 7-day baseline.
## API Design
### Raw event API
Add endpoints like:
- `GET /api/v1/bgp/events`
- `GET /api/v1/bgp/events/{id}`
Suggested filters:
- `prefix`
- `origin_asn`
- `peer_asn`
- `collector`
- `event_type`
- `time_from`
- `time_to`
- `source`
### Anomaly API
Add endpoints like:
- `GET /api/v1/bgp/anomalies`
- `GET /api/v1/bgp/anomalies/{id}`
- `GET /api/v1/bgp/anomalies/summary`
Suggested filters:
- `severity`
- `anomaly_type`
- `status`
- `prefix`
- `origin_asn`
- `time_from`
- `time_to`
### Visualization API
Add an Earth-oriented endpoint like:
- `GET /api/v1/visualization/geo/bgp-anomalies`
Recommended feature shapes:
- point: collector locations
- arc: inferred propagation or suspicious path edge
- pulse point: active anomaly hotspot
## Earth Big-Screen Design
Recommended layers:
### Layer 1: Collector layer
Show known collector locations and current activity intensity.
### Layer 2: Route propagation arcs
Use arcs for:
- origin ASN country to collector country
- or collector-to-collector visibility edges
Important note:
This is an inferred propagation view, not real packet flow.
### Layer 3: Active anomaly overlay
Show:
- hijack suspicion in red
- mass withdrawal in orange
- visibility drop in yellow
- path deviation in blue
### Layer 4: Time playback
Use `data_snapshots` to replay:
- minute-by-minute route changes
- anomaly expansion
- recovery timeline
## Alerting Strategy
Map anomaly severity to the current alert system.
Recommended severity mapping:
- `critical`
- likely hijack
- very large withdrawal burst
- `high`
- clear origin change
- large visibility drop
- `medium`
- unusual path change
- moderate more-specific burst
- `low`
- weak or localized anomalies
## Delivery Plan
### Phase 1
- add `RISLiveCollector`
- normalize updates into `collected_data`
- create `bgp_anomalies`
- implement 3 rules:
- origin change
- more-specific burst
- mass withdrawal
### Phase 2
- add `BGPStreamBackfillCollector`
- build 7-day baseline
- implement:
- path deviation
- visibility drop
### Phase 3
- add Earth visualization layer
- add time playback
- add anomaly filtering and drilldown
## Practical Implementation Notes
- Start with IPv4 first, then add IPv6 after the event schema is stable.
- Store the original raw payload in `metadata.raw_message` for traceability.
- Deduplicate events by a stable hash of collector, peer, prefix, type, and timestamp.
- Keep anomaly generation idempotent so replay and backfill do not create duplicate alerts.
- Expect noisy data and partial views; confidence scoring matters.
## Recommended First Patch Set
The first code milestone should include:
1. `backend/app/services/collectors/ris_live.py`
2. `backend/app/services/collectors/bgpstream.py`
3. `backend/app/models/bgp_anomaly.py`
4. `backend/app/api/v1/bgp.py`
5. `backend/app/api/v1/visualization.py`
add BGP anomaly geo endpoint
6. `frontend/src/pages`
add a BGP anomaly list or summary page
7. `frontend/public/earth/js`
add BGP anomaly rendering layer
## Sources
- [RIPE RIS Live](https://ris-live.ripe.net/)
- [CAIDA BGPStream Data Access Overview](https://bgpstream.caida.org/docs/overview/data-access)