07 · ops & runbook

Ops - what to alert on, how to fail over, how to recover.

Everything below is what we run against managed instances. Most of it is automatic - the engine self-heals on common failure modes - but when something needs hands on it, this is the playbook.

Health checks

endpoint	what it tells you	alert when
/health	Liveness - process is up and the log is mountable. Returns 200 + JSON build/version. Use for load-balancer target-group health checks.	any 5xx or sustained timeout
/ready	Readiness - engine has applied its log up to the latest durable checkpoint, lease is held, follower (if any) is connected.	non-200 for >30s after boot
/v1/tenants/:t/usage	Per-tenant usage snapshot: row count, vector count, in-flight queries, subscription state.	watch via dashboard - poll for trends

Per-tenant usage signal

GET /v1/tenants/:t/usage returns the current snapshot: row count per schema, vector count per table, in-flight Ask queries, subscription status. Poll it from your monitoring system; the dashboard renders the same data live.

A first-class Prometheus exporter is on the roadmap. Until then, hit /usage on a poll interval and graph the deltas.

Backups

Three layers, all archived in the same region as the tenant.

Continuous archive shipping - default. Each rolled log segment ships to the archive automatically. Worst-case data loss is the unflushed tail of the active segment, typically minutes-to-hours on idle tenants.
Checkpoint shipping - hourly auto-compaction snapshots. Restore picks the latest snapshot at-or-before the target and replays the archive up to it.
Restore-to-timestamp - oc-pitr restore --target <rfc3339> rebuilds a fresh data dir from the chosen snapshot + archive. The result opens cleanly through the existing validating loaders.

Recovery points are retained 30 days; manual purges on data-subject requests are documented in the runbook.

Continuous continuous backup streaming

Opt-in via the Sub-Second PITR add-on

The default ship-on-roll flow gives PITR granularity at the archive cadence - fine for most tenants, too coarse for compliance-heavy ones. Tail-shipping uploads the cumulative bytes of the active log every window ms (default 500), driving worst-case data loss to ~0.5–1.5 s.

On the managed service the continuous backup stream runs as a built-in task on oc-server alongside restore-side replay (auto-replay of the latest tail at-or-before the target).

Failover

When the writer is wedged or the underlying compute is silent, we promote the in-region follower via scripts/promote-follower.sh. RTO is ~25s, drilled twice on live-test-1 as of 2026-04-30.

The flow:

Fence the old writer - SSM systemctl stop oc-http.service. The old writer's lease-heartbeat would self-fence on next tick anyway; stopping the service is faster + deterministic.
Read the writer's ExecStart - strip --mode follower + --leader-addr, preserve sync-rep flags, TLS paths, LLM config.
Bump replication-epoch - fences any zombie writer that comes back later.
Restart the follower in writer mode and smoke /health.
Update DNS at the new writer's public IP, TTL 60s.

Manual intervention looks like: page on oc-controlplane-instance-silent, confirm the writer is genuinely dead (not a transient network blip), run the script with --writer-instance + --follower-instance, watch the four step-prints, verify /health on the new endpoint.

When not to use it: the writer is responding but slow (load-shed, don't fail over); the follower's applied_lsn is far behind (you'll lose data - investigate replication first); during an active online schema migration in Backfilling state.

Schema migrations

Online migrations are first-class - no read-only window, no service bounce. The model:

BackfillRateLimiter - capped at 10% of writer throughput so live traffic stays prioritised.
Dual-read transform - readers see the v1 shape during backfill via an in-memory transform applied to v0 rows.
Atomic cutover - version bump is a single commit; reads switch to v1-native on the next read.
Abort-only-pre-cutover - once cutover lands, the only path forward is an inverse-rewrite migration. Aborts in Backfilling state are safe and reversible.

Use online migrations any time the manifest version bumps. Stuck-in-Backfilling troubleshooting is in incident response → RUNBOOK.

Observability

EXPLAIN - prefix any SELECT with EXPLAIN to return the plan tree without running the query. Useful for verifying that your indexes are being used. See SQL reference → EXPLAIN.
Per-tenant /usage - GET /v1/tenants/:t/usage returns row counts, vector counts, in-flight queries, and subscription state. Poll it for monitoring.
OTLP push tracing - opt-in. Configure the collector endpoint via the dashboard; spans are pushed for every /v1 request with tail-based sampling that retains slow + error traces in full.
Dashboard - live metrics tiles for every instance: query latency, replication lag, recent writes, error rate. Visit app.originchain.ai.

Incident response

The full playbook is in docs/RUNBOOK.md. Highlights:

Pager severity: sev 1 (tenant down) → 5 min ack, 1 hr fix-or-mitigate. Sev 2 (one alarm tripped) → 15 min / 4 hr. Sev 3 (drift) → next business day.
Status page: publishes per-region health and incident timelines. Subscribers get email + webhook on any sev 1.
Stuck-in-Backfilling: abort the migration via POST /v1/tenants/:t/migrations/:id/abort, then resubmit.
Order of operations: stop the bleeding, find root cause, write a postmortem, fix the underlying problem. Quiet incidents become loud ones.

Incidents today are handled by core engineering during extended business hours with best-effort overnight coverage; the pager-severity SLAs above are the targets we hold ourselves to. 24/7 named-engineer coverage is available on Enterprise - contact sales.

Compliance posture

SOC 2 Type 1: underway with Vanta/Drata and an external CPA. Contact for audit timeline; the in-flight gap analysis is available to procurement under NDA.
HIPAA BAA: available on Enterprise. PHI workloads must run in a region the BAA covers and on a dedicated-capacity instance.
GDPR DPA: available on Enterprise. EU-region (Frankfurt) instances support the DPA out of the box; data-subject deletion is documented in RUNBOOK.md.