Evaluation Overview¶
Evret evaluates a retriever against a dataset of queries and gold supporting chunks.
Evaluation Flow¶
- Load dataset from JSON or CSV
- Choose metric list
- Evaluator asks retriever for top results
- Metrics are computed per query
- Final score is mean across queries
- Export results to JSON or CSV
Minimal Example¶
from evret import EvaluationDataset, Evaluator, HitRate, MRR
retriever = my_retriever
dataset = EvaluationDataset.from_json("examples/eval_data.json")
evaluator = Evaluator(
retriever=retriever,
metrics=[HitRate(k=4), MRR(k=4)],
)
results = evaluator.evaluate(dataset)
print(results.summary())
Output Object¶
EvaluationResults contains:
metric_scores: dictionary of metric names and scoresquery_count: total evaluated queriesgenerated_at: UTC timestamp
Export methods:
results.to_json("results.json")results.to_csv("results.csv")
Important Notes¶
- Evaluator validates that metric names are unique
- Evaluator validates dataset has at least one query
- Evaluator asks retriever using
max(k)across selected metrics - Use
expected_doc_idswhen gold relevant document IDs are known - Use
expected_answerswhen a judge should match retrieved text against gold text snippets - For LLM-generated datasets, prefer direct
expected_doc_idsmatching; if using generatedexpected_answersfor text matching, useLLMJudgerather thanTokenOverlapJudge - For top-4 context evaluation, set metrics with
k=4 - Pass
judge=to customize relevance matching logic (defaults to TokenOverlapJudge)