Cluster health
—
of — nodes online
Avg latency
—
last poll
DB replicas
—
streaming WAL
Patroni leader
—
timeline —
Poll requests
—
— success rate
Next poll
—
never polled
Connecting to cluster...
NODE LATENCY — 5 MIN WINDOW
EVENT LOG
—
>>SYSTEM READY. AWAITING PROXY CONNECTION.
Nginx access log last 30 app requests through LB
▼
RELIABILITY TARGETS
| METRIC | TARGET | RESULT | TIME |
|---|---|---|---|
| System Uptime | ≥99% | — | — |
| RTO | <5s | — | — |
| RPO | 0s | — | — |
| DB Failover | <30s (Patroni) | — | — |
| Failure Detection | ≤5s | — | — |
| Auto-Recovery | 100% | — | — |
| DB Replicas | ≥2 | — | — |
PERFORMANCE TARGETS
| METRIC | TARGET | RESULT | TIME |
|---|---|---|---|
| Latency p50 | ≤180ms | — | — |
| Latency p95 | ≤250ms | — | — |
| Throughput | ≥1000 req/s | — | — |
| Replication Lag | <50ms | — | — |
| Lag (peak) | <100ms | — | — |
PHASE 6 — FAILOVER & VALIDATION
TESTS PENDING
EVIDENCE LOG
▼
NO EVIDENCE — RUN TESTS TO POPULATE
MANUAL PROBE
AWAITING INPUT
Configuration
Runs server-side via proxy — accurate results.
Results
Total
—
Success
—
Failed
—
Rate
—
Avg ms
—
Req/sec
—
Benchmark history
| Time | n | c | Path | Success% | Avg ms | P50 | P95 | P99 | Req/sec |
|---|---|---|---|---|---|---|---|---|---|
| No runs yet. | |||||||||
WAL STREAMING TOPOLOGY
—
Connect to load topology
STREAMING REPLICAS
No data — click VERIFY
REPLICATION LAG
—
target <50ms
RPO
0s
synchronous WAL
REPLICAS ONLINE
—
of — configured
PATRONI CLUSTER
—
Click REFRESH to load Patroni status
FAILURE INJECTION
⚠ APP KILL — SIGKILL ha-app process
Kills the C++ HTTP server. systemd auto-restarts within ~2s. Tests Nginx failover and service recovery. Does NOT affect PostgreSQL.
Connect to load nodes
⚠ DB KILL — Stop Patroni on leader (triggers election)
Stops Patroni on the selected node. If it is the leader, Patroni elects a new primary within ~15s. A RESTORE button appears next to any unreachable node — click it to bring Patroni back and rejoin the cluster.
Connect to load nodes
RECOVERY LOG
No chaos events yet.
MANUAL SERVICE COMMANDS
▼
Node details
Navigate to this page while connected to load node details.
ENLISTMENT GUIDE
HOW TO ADD A NEW NODE
▼
01
CREATE VPS
Create a new VPS on Hetzner (or any provider).
Recommended: CX22 — Ubuntu 22.04, 2 vCPU, 4GB RAM.
Note the public IP and root password from the provider panel.
Recommended: CX22 — Ubuntu 22.04, 2 vCPU, 4GB RAM.
Note the public IP and root password from the provider panel.
02
FILL THE FORM
Click ENLIST NODE below and fill in:
• Node ID — unique name (e.g.
• IP — public IP from step 01
• SSH user —
• SSH password — root password from provider
• DB role — Master or Replica
• DB host — master's Tailscale IP (replicas only)
• Node ID — unique name (e.g.
hetzner4)• IP — public IP from step 01
• SSH user —
root• SSH password — root password from provider
• DB role — Master or Replica
• DB host — master's Tailscale IP (replicas only)
03
PROVISION & AUTHORISE
Click PROVISION & ENLIST and watch the progress.
When the Tailscale step appears, a URL will be shown. Open it in your browser to authorise the node on your Tailscale network.
Provisioning resumes automatically. All 11 steps complete in ~3 minutes.
After enlisting, configure etcd + Patroni for automatic DB failover via the Replication page.
When the Tailscale step appears, a URL will be shown. Open it in your browser to authorise the node on your Tailscale network.
Provisioning resumes automatically. All 11 steps complete in ~3 minutes.
After enlisting, configure etcd + Patroni for automatic DB failover via the Replication page.
WHAT IS AUTOMATED
✓ SSH key deployment
✓ Build dependencies (g++, libpqxx)
✓ Tailscale installation
✓ C++ app compilation
✓ systemd service setup
✓ PostgreSQL install & configuration
✓ Replication setup (master or replica)
✓ Nginx upstream update
✓ Keepalived LB failover (Hetzner 1 ↔ 2)
✓ Health check 503 on DB failure
✓ Health check verification
✓ Node registration
DB ROLE: MASTER
Sets up PostgreSQL with WAL replication enabled.
Creates
Opens port 5432 on Tailscale interface.
DB host: leave blank
Creates
appuser and appdb.Opens port 5432 on Tailscale interface.
DB host: leave blank
DB ROLE: REPLICA
Takes a base backup from master.
Streams WAL automatically.
Read-only — serves
DB host: master's Tailscale IP
Streams WAL automatically.
Read-only — serves
/data reads.DB host: master's Tailscale IP
DB ROLE: NONE
App-only node — no PostgreSQL.
Only runs the C++ HTTP server.
Useful for pure compute/LB nodes.
DB host: not required
Only runs the C++ HTTP server.
Useful for pure compute/LB nodes.
DB host: not required
NODE ROSTER
—
DOUBLE CHECK METHOD — FAILURE DETECTION
▼
Outpost monitors external websites with a two-stage verification pipeline before declaring a failure —
based on Naim, M.H. et al. (2025),
"Double Check Method: An Enhancement of Heartbeat Failure Detection by Fog Devices Through Socket and Port Engagement" (SSRN 5099955).
This reduces false-positive failure detection compared to single-shot heartbeat monitoring.
01
HEARTBEAT
Periodic HTTP request to the target URL. If it succeeds — healthy, no further action. If it fails, verification begins.
02
TIME CHECK
Wait a debounce window, then retry the heartbeat. If it recovers within the threshold — transient blip, false positive filtered. No alert raised.
03
QUORUM SOCKET CHECK
If still failing, every cluster node independently opens a raw TCP socket to host:port — not just one device.
ENHANCEMENT — DISTRIBUTED QUORUM VERIFICATION
The original method performs the socket check from a single fog device — leaving it vulnerable to a false DOWN
caused by that device's own localised network path, not the target. This implementation extends the method:
the socket check runs from every cluster node in parallel (each acting as an
independent fog device), and a verdict requires quorum agreement.
All nodes agree reachable → DEGRADED (confirmed app-layer issue, network fine everywhere) · All nodes agree unreachable → DOWN (confirmed outage, high confidence) · Split result → ambiguous, debounce window doubles and the quorum re-checks once before a majority-vote decision is made.
All nodes agree reachable → DEGRADED (confirmed app-layer issue, network fine everywhere) · All nodes agree unreachable → DOWN (confirmed outage, high confidence) · Split result → ambiguous, debounce window doubles and the quorum re-checks once before a majority-vote decision is made.
ADD OUTPOST
0 MONITORED
MONITORED TARGETS
No outposts yet — add one above.