Skip to content

Evaluation Overview

Evret evaluates a retriever against a dataset of queries and gold supporting chunks.

Evaluation Flow

  1. Load dataset from JSON or CSV
  2. Choose metric list
  3. Evaluator asks retriever for top results
  4. Metrics are computed per query
  5. Final score is mean across queries
  6. Export results to JSON or CSV

Minimal Example

from evret import EvaluationDataset, Evaluator, HitRate, MRR

retriever = my_retriever

dataset = EvaluationDataset.from_json("examples/eval_data.json")
evaluator = Evaluator(
    retriever=retriever,
    metrics=[HitRate(k=4), MRR(k=4)],
)

results = evaluator.evaluate(dataset)
print(results.summary())

Output Object

EvaluationResults contains:

  • metric_scores: dictionary of metric names and scores
  • query_count: total evaluated queries
  • generated_at: UTC timestamp

Export methods:

  • results.to_json("results.json")
  • results.to_csv("results.csv")

Important Notes

  • Evaluator validates that metric names are unique
  • Evaluator validates dataset has at least one query
  • Evaluator asks retriever using max(k) across selected metrics
  • Use expected_doc_ids when gold relevant document IDs are known
  • Use expected_answers when a judge should match retrieved text against gold text snippets
  • For LLM-generated datasets, prefer direct expected_doc_ids matching; if using generated expected_answers for text matching, use LLMJudge rather than TokenOverlapJudge
  • For top-4 context evaluation, set metrics with k=4
  • Pass judge= to customize relevance matching logic (defaults to TokenOverlapJudge)