OriginChain docs
examples · fts · 5 / 6

5. BM25 ranked retrieval

← FTS examples
what this does

Returns the top k documents ranked by BM25 relevance to the query. The shape changes from doc_ids to hits - each entry is { doc_id, score } sorted by descending score. Higher score = more relevant.

when to use it
  • Search bars - users expect the best match at the top.
  • Recommendations: rank a candidate pool by textual similarity.
  • Any time you want "best k" rather than "all that match".
the request
GET /v1/tenants/:t/fts/:schema/:field?q=...&mode=bm25&k=10
curl -G "https://$OC_HOST/v1/tenants/$OC_TENANT/fts/shop.products/description" \
  -H "Authorization: Bearer $OC_TOKEN" \
  --data-urlencode "q=wireless headphones" \
  --data-urlencode "mode=bm25" \
  --data-urlencode "k=10"
what you get back
{
  "mode": "bm25",
  "hits": [
    { "doc_id": "p001", "score": 9.42 },
    { "doc_id": "p027", "score": 7.18 },
    { "doc_id": "p014", "score": 4.55 }
  ]
}
how it works
  • The query is tokenised, then for each token the engine fetches the posting list along with term frequencies and document lengths.
  • Each candidate document gets a BM25 score using the Lucene defaults: k1 = 1.2, b = 0.75. Score grows with how often the rare terms appear and shrinks if the document is much longer than average.
  • A top-k heap keeps the highest scorers; everything else is discarded.

Optional query params: fuzzy=1 for single-character typo tolerance, highlight=true to return matched-snippet text, facets=col,col for grouped counts alongside hits.

common mistakes
  • Comparing scores across queries. A BM25 score of 9.4 means nothing on its own and can't be compared to the 9.4 from a different query. Use scores to rank within one result set only.
  • Forgetting k. If you omit it, you get a default cap. Set k to what you actually need; ranking the entire corpus is wasted work.
  • Reaching for BM25 when boolean would do. If you only need "does it match", boolean is cheaper and the answer is the same set.