examples · fts · 5 / 6

5. BM25 ranked retrieval

what this does

Returns the top k documents ranked by BM25 relevance to the query. The shape changes from doc_ids to hits - each entry is { doc_id, score } sorted by descending score. Higher score = more relevant.

when to use it

Search bars - users expect the best match at the top.
Recommendations: rank a candidate pool by textual similarity.
Any time you want "best k" rather than "all that match".

the request

GET /v1/tenants/:t/fts/:schema/:field?q=...&mode=bm25&k=10

curl -G "https://$OC_HOST/v1/tenants/$OC_TENANT/fts/shop.products/description" \
  -H "Authorization: Bearer $OC_TOKEN" \
  --data-urlencode "q=wireless headphones" \
  --data-urlencode "mode=bm25" \
  --data-urlencode "k=10"

hits = db.fts.search(
    "shop.products",
    "description",
    q="wireless headphones",
    mode="bm25",
    k=10,
)
for hit in hits.hits:
    print(hit["doc_id"], hit["score"])

const result = await db.ftsSearch("shop.products", "description", {
  q: "wireless headphones",
  mode: "bm25",
  k: 10,
});
for (const hit of result.hits) {
  console.log(hit.doc_id, hit.score);
}

result, _ := db.FTSSearch(ctx, "shop.products", "description", originchain.FTSSearchRequest{
    Q:    "wireless headphones",
    Mode: "bm25",
    K:    10,
})
for _, hit := range result.Hits {
    fmt.Println(hit.DocID, hit.Score)
}

what you get back

{
  "mode": "bm25",
  "hits": [
    { "doc_id": "p001", "score": 9.42 },
    { "doc_id": "p027", "score": 7.18 },
    { "doc_id": "p014", "score": 4.55 }
  ]
}

how it works

The query is tokenised, then for each token the engine fetches the posting list along with term frequencies and document lengths.
Each candidate document gets a BM25 score using the Lucene defaults: k1 = 1.2, b = 0.75. Score grows with how often the rare terms appear and shrinks if the document is much longer than average.
A top-k heap keeps the highest scorers; everything else is discarded.

Optional query params: fuzzy=1 for single-character typo tolerance, highlight=true to return matched-snippet text, facets=col,col for grouped counts alongside hits.

common mistakes

Comparing scores across queries. A BM25 score of 9.4 means nothing on its own and can't be compared to the 9.4 from a different query. Use scores to rank within one result set only.
Forgetting k. If you omit it, you get a default cap. Set k to what you actually need; ranking the entire corpus is wasted work.
Reaching for BM25 when boolean would do. If you only need "does it match", boolean is cheaper and the answer is the same set.