Ops - what to alert on, how to fail over, how to recover.
Everything below is what we run against managed instances. Most of it is automatic - the engine self-heals on common failure modes - but when something needs hands on it, this is the playbook.
Health checks
| endpoint | what it tells you | alert when |
|---|---|---|
| /health | Liveness - process is up and the log is mountable. Returns 200 + JSON build/version. Use for load-balancer target-group health checks. | any 5xx or sustained timeout |
| /ready | Readiness - engine has applied its log up to the latest durable checkpoint, lease is held, follower (if any) is connected. | non-200 for >30s after boot |
| /v1/tenants/:t/usage | Per-tenant usage snapshot: row count, vector count, in-flight queries, subscription state. | watch via dashboard - poll for trends |
Per-tenant usage signal
GET /v1/tenants/:t/usage returns the current snapshot: row count per schema, vector count per table, in-flight Ask queries, subscription status. Poll it from your monitoring system; the dashboard renders the same data live.
A first-class Prometheus exporter is on the roadmap. Until then, hit /usage on a poll interval and graph the deltas.
Backups
Three layers, all archived in the same region as the tenant.
- Continuous archive shipping - default. Each rolled log segment ships to the archive automatically. Worst-case data loss is the unflushed tail of the active segment, typically minutes-to-hours on idle tenants.
- Checkpoint shipping - hourly auto-compaction snapshots. Restore picks the latest snapshot at-or-before the target and replays the archive up to it.
- Restore-to-timestamp -
oc-pitr restore --target <rfc3339>rebuilds a fresh data dir from the chosen snapshot + archive. The result opens cleanly through the existing validating loaders.
Recovery points are retained 30 days; manual purges on data-subject requests are documented in the runbook.
Continuous continuous backup streaming
The default ship-on-roll flow gives PITR granularity at the archive
cadence - fine for most tenants, too coarse for compliance-heavy ones.
Tail-shipping uploads the cumulative bytes of the active log
every window ms (default 500), driving worst-case data loss to ~0.5–1.5 s.
On the managed service the continuous backup stream runs as a built-in task on
oc-server alongside restore-side replay (auto-replay of the latest tail at-or-before the target).
Failover
When the writer is wedged or the underlying compute is silent, we promote the
in-region follower via scripts/promote-follower.sh. RTO is ~25s, drilled twice on
live-test-1 as of 2026-04-30.
The flow:
- Fence the old writer - SSM
systemctl stop oc-http.service. The old writer's lease-heartbeat would self-fence on next tick anyway; stopping the service is faster + deterministic. - Read the writer's ExecStart - strip
--mode follower+--leader-addr, preserve sync-rep flags, TLS paths, LLM config. - Bump replication-epoch - fences any zombie writer that comes back later.
- Restart the follower in writer mode and smoke
/health. - Update DNS at the new writer's public IP, TTL 60s.
Manual intervention looks like: page on oc-controlplane-instance-silent, confirm the writer is genuinely
dead (not a transient network blip), run the script with
--writer-instance + --follower-instance, watch the four step-prints, verify
/health on the new endpoint.
When not to use it: the writer is responding but slow
(load-shed, don't fail over); the follower's applied_lsn is far behind (you'll lose data - investigate replication first); during an
active online schema migration in Backfilling state.
Schema migrations
Online migrations are first-class - no read-only window, no service bounce. The model:
- BackfillRateLimiter - capped at 10% of writer throughput so live traffic stays prioritised.
- Dual-read transform - readers see the v1 shape during backfill via an in-memory transform applied to v0 rows.
- Atomic cutover - version bump is a single commit; reads switch to v1-native on the next read.
- Abort-only-pre-cutover - once cutover lands, the only path forward is an inverse-rewrite migration. Aborts in
Backfillingstate are safe and reversible.
Use online migrations any time the manifest version bumps. Stuck-in-Backfilling troubleshooting is in incident response → RUNBOOK.
Observability
- EXPLAIN - prefix any SELECT with
EXPLAINto return the plan tree without running the query. Useful for verifying that your indexes are being used. See SQL reference → EXPLAIN. - Per-tenant /usage -
GET /v1/tenants/:t/usagereturns row counts, vector counts, in-flight queries, and subscription state. Poll it for monitoring. - OTLP push tracing - opt-in. Configure the collector endpoint via the dashboard; spans are pushed for every
/v1request with tail-based sampling that retains slow + error traces in full. - Dashboard - live metrics tiles for every instance: query latency, replication lag, recent writes, error rate. Visit app.originchain.ai.
Incident response
The full playbook is in docs/RUNBOOK.md. Highlights:
- Pager severity: sev 1 (tenant down) → 5 min ack, 1 hr fix-or-mitigate. Sev 2 (one alarm tripped) → 15 min / 4 hr. Sev 3 (drift) → next business day.
- Status page: publishes per-region health and incident timelines. Subscribers get email + webhook on any sev 1.
- Stuck-in-Backfilling: abort the migration via
POST /v1/tenants/:t/migrations/:id/abort, then resubmit. - Order of operations: stop the bleeding, find root cause, write a postmortem, fix the underlying problem. Quiet incidents become loud ones.
Incidents today are handled by core engineering during extended business hours with best-effort overnight coverage; the pager-severity SLAs above are the targets we hold ourselves to. 24/7 named-engineer coverage is available on Enterprise - contact sales.
Compliance posture
- SOC 2 Type 1: underway with Vanta/Drata and an external CPA. Contact for audit timeline; the in-flight gap analysis is available to procurement under NDA.
- HIPAA BAA: available on Enterprise. PHI workloads must run in a region the BAA covers and on a dedicated-capacity instance.
- GDPR DPA: available on Enterprise. EU-region (Frankfurt) instances support the DPA out of the box; data-subject deletion is documented in
RUNBOOK.md.