linkong/planet

Fork 0

Files

linkong b0058edf17 feat: add bgp observability and admin ui improvements

2026-03-27 14:27:07 +08:00

9.4 KiB

Raw Blame History

BGP Observability Plan

Goal

Build a global routing observability capability on top of:

The target is to support:

real-time routing event ingestion
historical replay and baseline analysis
anomaly detection
Earth big-screen visualization

Important Scope Note

These data sources expose the BGP control plane, not user traffic itself.

That means the system can infer:

route propagation direction
prefix reachability changes
AS path changes
visibility changes across collectors

But it cannot directly measure:

exact application traffic volume
exact user packet path
real bandwidth consumption between countries or operators

Product wording should therefore use phrases like:

global routing propagation
route visibility
control-plane anomalies
suspected path diversion

Instead of claiming direct traffic measurement.

Data Source Roles

RIS Live

Use RIS Live as the real-time feed.

Recommended usage:

subscribe to update streams over WebSocket
ingest announcements and withdrawals continuously
trigger low-latency alerts

Best suited for:

hijack suspicion
withdrawal bursts
real-time path changes
live Earth event overlay

BGPStream

Use BGPStream as the historical and replay layer.

Recommended usage:

backfill time windows
build normal baselines
compare current events against history
support investigations and playback

Best suited for:

historical anomaly confirmation
baseline path frequency
visibility baselines
postmortem analysis

Recommended Architecture

flowchart LR
  A["RIS Live WebSocket"] --> B["Realtime Collector"]
  C["BGPStream Historical Access"] --> D["Backfill Collector"]
  B --> E["Normalization Layer"]
  D --> E
  E --> F["data_snapshots"]
  E --> G["collected_data"]
  E --> H["bgp_anomalies"]
  H --> I["Alerts API"]
  G --> J["Visualization API"]
  H --> J
  J --> K["Earth Big Screen"]

Storage Design

The current project already has:

So the lowest-risk path is:

keep raw and normalized BGP events in collected_data
use data_snapshots to group each ingest window
add a dedicated anomaly table for higher-value derived events

Proposed Data Types

`collected_data`

Use these source values:

ris_live_bgp
bgpstream_bgp

Use these data_type values:

bgp_update
bgp_rib
bgp_visibility
bgp_path_change

Recommended stable fields:

source
source_id
entity_key
data_type
name
reference_date
metadata

Recommended entity_key strategy:

event entity: collector|peer|prefix|event_time
prefix state entity: collector|peer|prefix
origin state entity: prefix|origin_asn

`metadata` schema for raw events

Store the normalized event payload in metadata:

{
  "project": "ris-live",
  "collector": "rrc00",
  "peer_asn": 3333,
  "peer_ip": "2001:db8::1",
  "event_type": "announcement",
  "prefix": "203.0.113.0/24",
  "origin_asn": 64496,
  "as_path": [3333, 64500, 64496],
  "communities": ["3333:100", "64500:1"],
  "next_hop": "192.0.2.1",
  "med": 0,
  "local_pref": null,
  "timestamp": "2026-03-26T08:00:00Z",
  "raw_message": {}
}

New anomaly table

Add a new table, recommended name: bgp_anomalies

Suggested columns:

id
snapshot_id
task_id
source
anomaly_type
severity
status
entity_key
prefix
origin_asn
new_origin_asn
peer_scope
started_at
ended_at
confidence
summary
evidence
created_at

This table should represent derived intelligence, not raw updates.

Collector Design

1. `RISLiveCollector`

Responsibility:

maintain WebSocket connection
subscribe to relevant message types
normalize messages
write event batches into snapshots
optionally emit derived anomalies in near real time

Suggested runtime mode:

long-running background task

Suggested snapshot strategy:

one snapshot per rolling time window
for example every 1 minute or every 5 minutes

2. `BGPStreamBackfillCollector`

Responsibility:

fetch historical data windows
normalize to the same schema as real-time data
build baselines
re-run anomaly rules on past windows if needed

Suggested runtime mode:

scheduled task
or ad hoc task for investigations

Suggested snapshot strategy:

one snapshot per historical query window

Normalization Rules

Normalize both sources into the same internal event model.

Required normalized fields:

collector
peer_asn
peer_ip
event_type
prefix
origin_asn
as_path
timestamp

Derived normalized fields:

as_path_length
country_guess
prefix_length
is_more_specific
visibility_weight

Anomaly Detection Rules

Start with these five rules first.

1. Origin ASN Change

Trigger when:

the same prefix is announced by a new origin ASN not seen in the baseline window

Use for:

hijack suspicion
origin drift detection

2. More-Specific Burst

Trigger when:

a more-specific prefix appears suddenly
especially from an unexpected origin ASN

Use for:

subprefix hijack suspicion

3. Mass Withdrawal

Trigger when:

the same prefix or ASN sees many withdrawals across collectors within a short window

Use for:

outage suspicion
regional incident detection

4. Path Deviation

Trigger when:

AS path length jumps sharply
or a rarely seen transit ASN appears
or path frequency drops below baseline norms

Use for:

route leak suspicion
unusual path diversion

5. Visibility Drop

Trigger when:

a prefix is visible from far fewer collectors/peers than its baseline

Use for:

regional reachability degradation

Baseline Strategy

Use BGPStream historical data to build:

common origin ASN per prefix
common AS path patterns
collector visibility distribution
normal withdrawal frequency

Recommended baseline windows:

short baseline: last 24 hours
medium baseline: last 7 days
long baseline: last 30 days

The first implementation can start with only the 7-day baseline.

API Design

Raw event API

Add endpoints like:

GET /api/v1/bgp/events
GET /api/v1/bgp/events/{id}

Suggested filters:

prefix
origin_asn
peer_asn
collector
event_type
time_from
time_to
source

Anomaly API

Add endpoints like:

GET /api/v1/bgp/anomalies
GET /api/v1/bgp/anomalies/{id}
GET /api/v1/bgp/anomalies/summary

Suggested filters:

severity
anomaly_type
status
prefix
origin_asn
time_from
time_to

Visualization API

Add an Earth-oriented endpoint like:

GET /api/v1/visualization/geo/bgp-anomalies

Recommended feature shapes:

point: collector locations
arc: inferred propagation or suspicious path edge
pulse point: active anomaly hotspot

Earth Big-Screen Design

Recommended layers:

Layer 1: Collector layer

Show known collector locations and current activity intensity.

Layer 2: Route propagation arcs

Use arcs for:

origin ASN country to collector country
or collector-to-collector visibility edges

Important note:

This is an inferred propagation view, not real packet flow.

Layer 3: Active anomaly overlay

Show:

hijack suspicion in red
mass withdrawal in orange
visibility drop in yellow
path deviation in blue

Layer 4: Time playback

Use data_snapshots to replay:

minute-by-minute route changes
anomaly expansion
recovery timeline

Alerting Strategy

Map anomaly severity to the current alert system.

Recommended severity mapping:

critical
- likely hijack
- very large withdrawal burst
high
- clear origin change
- large visibility drop
medium
- unusual path change
- moderate more-specific burst
low
- weak or localized anomalies

Delivery Plan

Phase 1

add RISLiveCollector
normalize updates into collected_data
create bgp_anomalies
implement 3 rules:
- origin change
- more-specific burst
- mass withdrawal

Phase 2

add BGPStreamBackfillCollector
build 7-day baseline
implement:
- path deviation
- visibility drop

Phase 3

add Earth visualization layer
add time playback
add anomaly filtering and drilldown

Practical Implementation Notes

Start with IPv4 first, then add IPv6 after the event schema is stable.
Store the original raw payload in metadata.raw_message for traceability.
Deduplicate events by a stable hash of collector, peer, prefix, type, and timestamp.
Keep anomaly generation idempotent so replay and backfill do not create duplicate alerts.
Expect noisy data and partial views; confidence scoring matters.

Recommended First Patch Set

The first code milestone should include:

backend/app/services/collectors/ris_live.py
backend/app/services/collectors/bgpstream.py
backend/app/models/bgp_anomaly.py
backend/app/api/v1/bgp.py
backend/app/api/v1/visualization.py add BGP anomaly geo endpoint
frontend/src/pages add a BGP anomaly list or summary page
frontend/public/earth/js add BGP anomaly rendering layer

9.4 KiB Raw Blame History

BGP Observability Plan

Goal

Important Scope Note

Data Source Roles

RIS Live

BGPStream

Recommended Architecture

Storage Design

Proposed Data Types

collected_data

metadata schema for raw events

New anomaly table

Collector Design

1. RISLiveCollector

2. BGPStreamBackfillCollector

Normalization Rules

Anomaly Detection Rules

1. Origin ASN Change

2. More-Specific Burst

3. Mass Withdrawal

4. Path Deviation

5. Visibility Drop

Baseline Strategy

API Design

Raw event API

Anomaly API

Visualization API

Earth Big-Screen Design

Layer 1: Collector layer

Layer 2: Route propagation arcs

Layer 3: Active anomaly overlay

Layer 4: Time playback

Alerting Strategy

Delivery Plan

Phase 1

Phase 2

Phase 3

Practical Implementation Notes

Recommended First Patch Set

Sources

9.4 KiB

Raw Blame History

`collected_data`

`metadata` schema for raw events

1. `RISLiveCollector`

2. `BGPStreamBackfillCollector`