Evaluation Overview¶
Evret evaluates a retriever against a dataset of queries and gold supporting chunks.
Evaluation Flow¶
- Load dataset from JSON or CSV
- Choose metric list
- Evaluator asks retriever for top results
- Metrics are computed per query
- Final score is mean across queries
- Export results to JSON or CSV
Minimal Example¶
from evret import EvaluationDataset, Evaluator, HitRate, MRR
retriever = my_retriever
dataset = EvaluationDataset.from_json("examples/eval_data.json")
evaluator = Evaluator(
retriever=retriever,
metrics=[HitRate(k=4), MRR(k=4)],
)
results = evaluator.evaluate(dataset)
print(results.summary())
Output Object¶
EvaluationResults contains:
metric_scores: dictionary of metric names and scoresquery_count: total evaluated queriesgenerated_at: UTC timestamp
Export methods:
results.to_json("results.json")results.to_csv("results.csv")
Important Notes¶
- Evaluator validates that metric names are unique
- Evaluator validates dataset has at least one query
- Evaluator asks retriever using
max(k)across selected metrics - Use
expected_answersfor judge-based evaluation (text snippets the judge matches against) - Use
relevant_doc_idsfor classic IR evaluation (pre-labeled document IDs) - For top-4 context evaluation, set metrics with
k=4 - Pass
judge=to customize relevance matching logic (defaults to TokenOverlapJudge)