One database. Five query modes. Same data. Same bearer. Same instance.
OriginChain answers vector search, BM25 full-text, hybrid retrieval, graph traversal, and natural-language questions through a single endpoint. This page is the concept guide - what each algorithm actually does, where it wins, where it loses, and what the tradeoffs look like. If you want syntax and copy-paste examples, the docs cover that; this page is for the engineer deciding which mode to reach for.
Embeddings turn meaning into geometry.
An embedding model takes a piece of text - or an image, or audio - and returns a list of a few hundred to a few thousand floating-point numbers. That list is a coordinate in a high-dimensional space. The model places semantically similar inputs near each other: "running shoes" lands near "athletic footwear", far from "fruit smoothie". Search becomes geometry. Finding the documents most relevant to a query means finding the document vectors closest to the query vector.
At small scale you can brute-force this - compute the distance between the query and every stored vector and sort. That stops scaling somewhere around a few hundred thousand vectors. Past that you need an approximate nearest-neighbour algorithm. OriginChain uses HNSW (Hierarchical Navigable Small World, Malkov & Yashunin 2016) - the same algorithm that backs every modern production vector engine. HNSW builds a layered graph where each node is a vector and each edge connects nearby vectors. Queries enter at the top layer, greedily walk toward the query point, and descend through layers as they get closer. The walk visits a tiny fraction of the corpus and still finds the true top-k with very high probability.
The two knobs that matter at query time are recall (what fraction of the true nearest neighbours you actually return) and latency. They trade off - search wider, get better recall, pay more time. OriginChain exposes that tradeoff as two named modes:
fast 0.69 37 ms high_recall 0.96 109 ms fast - When a downstream reranker rescues recall, or a tight latency SLO makes 37 ms the headline number.
high_recall - Default. When first-pass retrieval correctness matters - product search, RAG, citation lookup.
Four distance metrics.
"Closest" needs a definition. OriginChain supports the four metrics every embedding model produces vectors for. Pick the one your model recommends - the wrong choice silently degrades retrieval quality.
The default. Measures angular closeness - direction matters, magnitude does not. Right for embeddings from text models (OpenAI, Cohere, Anthropic), where the model already encodes semantics in the angle of the vector.
Faster than cosine when your embeddings are already unit-normalised. Many production models (sentence-transformers, BGE) output unit vectors by default - dot product gives the same ranking as cosine at lower cost.
Measures absolute distance - magnitude carries signal. Right for embeddings where the model encodes intensity (image embeddings, some multi-modal models).
Sum of per-dimension absolute differences. More robust to outlier dimensions than L2 - useful when your embedding space has a few features that dominate the geometry and you want every dimension to weigh in equally.
Four index variants.
Dense HNSW is the default. The other three are opt-in when corpus shape demands it - sparse for BM25-trained models, PQ for storage-constrained workloads, adaptive over-fetch for selective metadata filters. All four live behind the same /v1/tenants/:t/vector/:table/topk endpoint - the index choice is a per-table declaration, not a separate API.
The default. Floating-point f32 embeddings indexed by Hierarchical Navigable Small World. SIMD kernels for the distance compute. v3 inline-embeddings build measured 11,700 QPS at recall@10 = 0.94 on SIFT-1M.
For BM25-trained models (SPLADE, uniCOIL) that emit high-dimensional sparse vectors where most entries are zero. Stored and queried as (index, weight) pairs - 10-100x smaller on disk than dense, and just as fast on selective queries.
Optional quantization layer that compresses a 768-dim f32 embedding to ~96 bytes. Recall drops a few points; storage drops ~30x. Right when the corpus is large enough that f32 storage is the constraint.
When a metadata filter is selective (`region = 'EU'`), the engine widens the HNSW search ef to absorb the filter pre-rate. Customer doesn't tune anything - the planner picks ef based on the filter's observed selectivity.
The ranking formula that beat vector search for fifty years.
BM25 (Best Match 25, Robertson & Sparck Jones, late 1970s; productionised in Okapi by Robertson, Walker, et al. in the 1990s) is the ranking function that powers Lucene, Elasticsearch, OpenSearch, and almost every full-text engine you have ever used. Vector embeddings are newer and feel more sophisticated, but BM25 still wins on exact-phrase queries, on product codes and SKUs, on acronyms, and on long-tail terms that the embedding model has never seen. A production retrieval system needs both.
The formula has three intuitions stacked together:
- Rare terms count more. A query word that appears in 1% of documents is more informative than a word that appears in 90% of them. This is the IDF (inverse document frequency) component.
- Term frequency helps, but with diminishing returns. A document that contains the query word ten times is more relevant than one that contains it once - but not ten times more relevant. BM25 saturates the term-frequency contribution.
- Long documents get a penalty. Otherwise a thousand-word article would always score higher than a hundred-word article on the same topic just because it contains more words.
Put together, the score for a query q against a document d is:
score(q, d) = Σ IDF(t) · [ tf(t,d) · (k1 + 1) ] / [ tf(t,d) + k1 · (1 - b + b · |d| / avgdl) ]
t ∈ q
tf(t,d) is the count of term t in document d. |d| is the length of d in tokens. avgdl is the average document length across the corpus. The two tuneable parameters are k1 (term-frequency saturation) and b (length normalisation).
The two knobs.
Defaults are k1 = 1.2 and b = 0.75 - the values Robertson and Zaragoza arrived at after thousands of TREC experiments. They are right for almost every corpus. The two cases worth tuning:
Term-frequency saturation. As a term appears more times in a document, its contribution to the score plateaus. k1 controls how quickly it plateaus. Lower k1 → repetition stops helping sooner; useful for short, info-dense documents. Higher k1 → repetition keeps mattering; useful for long-form content where the topic legitimately comes up many times.
Length normalisation. Long documents naturally contain more terms; without normalisation they'd always win. b interpolates between no normalisation (b=0) and full normalisation (b=1). The 0.75 default - set by Robertson and Zaragoza after thousands of TREC experiments - works well for prose. Reduce to ~0.5 for documents of similar length; raise toward 1.0 when length varies wildly.
Vector and BM25 in parallel. Then fuse.
Every public benchmark on retrieval quality - BEIR, MTEB, MS MARCO - shows that combining vector and BM25 beats either alone. The two methods make different mistakes. Vector search catches semantic matches the keyword search misses; BM25 catches exact terms the embedding model never saw. Run them in parallel, fuse the two ranked lists, and you keep what each is good at.
The simplest fusion that works well in production is Reciprocal Rank Fusion (Cormack, Clarke, Büttcher, 2009). It ignores the raw scores from each retriever - which are on incomparable scales - and uses only the rank position:
RRF(d) = Σ 1 / (k + rank_i(d))
i ∈ retrievers
with k = 60 (the Cormack default)
rank_i(d) is the position of document d in the result list from retriever i, starting at 1. Documents that appear in both lists rise to the top; documents that appear in only one list still contribute. The constant k damps the contribution of low-ranked hits.
When you want to bias toward one retriever or the other - legal documents typically reward keyword matches more than semantic similarity, conceptual knowledge bases the opposite - a weighted linear combination of normalised scores is the alternative:
α · cos_norm(q, d) + (1 − α) · bm25_norm(q, d).
Tune α per domain. There is no universally right number.
Relations are edges. Traversal is a query.
A row in OriginChain can have typed relations to other rows - an order belongs to a customer, a paper cites another paper, an employee reports to a manager. Once those relations are declared in the schema, you can walk them as a graph: neighbours of a node, reverse neighbours, breadth-first search to a depth, shortest path between two nodes, weighted shortest path with Dijkstra.
Both directions of every relation are first-class. "Customer of order O" and "orders for customer C" are the same cost. You don't pay for a separate reverse-index schema, and you don't write the reverse edge twice on insert - the engine maintains both sides atomically when the row is written.
The graph operations OriginChain exposes cover the multi-hop queries most applications actually need - traversal, reachability, paths, classic algorithms - without forcing you to bolt on a dedicated graph engine. Every operation runs against the same row store the SQL and vector layers use; there is no separate graph cluster to provision or sync.
One-hop, both directions. The reverse side is maintained automatically when relations are declared bidirectional - no double-write on insert.
Breadth-first traversal up to a max depth. The reachability variant runs from both endpoints toward the middle - cuts work from O(d^k) to O(d^(k/2)) on power-law graphs.
Single-statement multi-hop traversal with a depth range. Per-hop WHERE predicates filter intermediate rows before they fan out into the next hop.
Unweighted BFS shortest path plus caller-supplied edge-weight Dijkstra. Negative weights rejected.
Every node-disjoint route between two endpoints up to a depth cap, with a max_paths hard cap to bound output on cyclic graphs. The bidirectional variant traverses forward AND reverse edges in the same walk - useful when 'is there any connection' matters more than direction.
Iterative power-method with damping; dangling-node mass redistributed uniformly. Returns nodes sorted by score, ties broken by PK lex for deterministic output.
Classic graph-algorithm primitives. Components run on Union-Find; triangle enumeration uses the same forward-prefix scan the traversal helpers do.
Every traversal endpoint accepts ?explain=true and returns per-hop stats - rows visited, edges expanded, time per hop. Useful when a path query gets slow and you need to know whether the cost is in fan-out or in the per-hop predicate.
Ask in English. Get rows back.
POST a question - "top ten customers by revenue this quarter, excluding refunds" - to the /ask endpoint and the engine returns the rows. Underneath, a foundation model reads your schema and your question, drafts a structured query, executes it against your data, and streams the result back. You can ask for the plan along with the answer (?explain=true) and see exactly what SQL was produced - so it stays auditable.
Two things make this useful in production rather than a demo. First, the question is compiled to a structured plan and executed by the engine itself - the LLM is the compiler, not the runtime. That means the answer is grounded in your actual data, never hallucinated. Second, the plan goes through the same cost model and the same security boundary as any other query, so an /ask call cannot exfiltrate data the bearer wasn't entitled to see.
Bring your own LLM key.
The LLM bill is yours, not ours. Configure your API key for OpenAI, Anthropic, Google Gemini, or Groq once in the dashboard; every /ask call routes through your account. We never see the key in plaintext - it's encrypted at rest under your tenant's KMS key and decrypted only in-memory when a request needs it. You keep procurement-side control over which models you pay for; we keep the planner, the cache, and the security boundary.
One write. Every mode sees it.
When you insert a row that has a vector embedding and a full-text field and a typed relation, all four updates land together. A query running in another connection cannot see a state where the row exists but the vector index hasn't caught up, or the full-text posting is missing, or the reverse edge isn't there yet. This is the property that makes hybrid retrieval honest. Your Stage-7 faithfulness check, your reranker, your audit log all see a corpus that is internally consistent at every instant.
This is also why OriginChain replaces a stack rather than augments one. A Pinecone-plus-Elasticsearch-plus-Postgres deployment cannot guarantee this without a saga, a reconciler, and a worker that detects skew. The cross-system version drift is what produces the "sometimes the LLM hallucinates" reports that postmortems struggle to explain. On a single-engine design it is structurally impossible.
Your plan, enforced at the engine.
OriginChain plans are not honour-system. Throughput limits, concurrent-call limits, and vector-corpus size limits are enforced inside the engine itself - not bolted on at an API gateway you could route around. The plan's caps live in a config the engine reads at boot; when a request would exceed one, the engine returns a structured rejection with a machine-readable body, plus the cap, the current usage, and an upgrade URL.
Three kinds of cap, three kinds of rejection:
- 429 Too Many Requests + Retry-After for rate-style breaches (sustained req/s, concurrent
/askcalls, active reactive subscriptions). Retry the same request after the header's window and it lands. - 402 Payment Required + upgrade URL for state-style breaches (lifetime vector embeddings indexed). Retrying won't help; the plan needs more headroom. The body carries the upgrade link so client SDKs render a billing prompt directly.
- Soft warnings emitted as Prometheus gauges (
oc_tier_*) and surfaced on the dashboard at 80% utilisation. Most customers see their headroom shrinking before any hard rejection fires.
Live visibility from two angles.
The same numbers render in two formats so the customer's dashboard and the operator's monitoring see the same truth.
JSON response: current plan, the full caps table, live counters (rows stored, vector embeddings indexed, concurrent /ask calls, active subscriptions), plus a per-schema breakdown showing which table is consuming what share of the budget. One call powers the dashboard's usage panel without scraping each shape separately.
Standard Prometheus text exposition. Cap gauges (oc_tier_*_cap) and headroom gauges side-by-side so an alert rule of the shape used / cap > 0.8 fires at the right threshold. Constant cardinality - the cap rows render at every scrape regardless of plan.
The decision table.
The features above all exist on the same engine. When you build, you still have to pick which one answers a given user question. Here is the short version.
Vector search Conceptual queries - "running shoes for marathons" matches "endurance trainers" because the embedding model captures intent. Cross-language queries. Semantic deduplication. Recommender systems. Exact identifiers, SKUs, error codes, names of new products that weren't in the embedding model's training data. BM25 full-text Exact phrase matching. Acronyms and product codes. Long-tail queries with unusual terms. Recall on documents that contain the literal keyword. Conceptual queries where the user's wording doesn't match the document's wording. Hybrid (vector + BM25) Production retrieval. Catches both the semantic match and the keyword match, fuses them, and beats either alone on every public benchmark. Nothing - this is the right default for any production RAG or search application. Graph traversal Multi-hop questions - "customers who bought from suppliers I haven't reviewed yet", "shortest path through citations", "who reports to whom". Single-table queries - those belong in SQL. Natural language Letting non-technical users ask questions of structured data without learning SQL. Internal dashboards. Customer-support agents. Cases where the question is ambiguous enough that a typed query would be more honest. One database for your whole AI stack.
Vector, BM25, hybrid, graph, and natural language against the same managed database - one bearer, one endpoint, one instance per tenant. A managed instance comes online in about ninety seconds, and the quickstart walks you from signup to your first English query in under ten minutes.