Evaluation API¶
evret.evaluation.dataset
¶
Evaluation dataset models and loaders.
DocumentExample
dataclass
¶
One document entry in an evaluation dataset.
EvaluationDataset
dataclass
¶
Evaluation dataset containing query examples and optional documents.
QueryExample
dataclass
¶
One query item in an evaluation dataset.
Supports two evaluation patterns: 1. Classic IR: Provide relevant_doc_ids (pre-labeled document identifiers) 2. Judge-based: Provide expected_answers (answer text snippets for judge to match)
Use relevant_doc_ids when you have pre-labeled ground truth document IDs. Use expected_answers when you want a judge to determine relevance by comparing against expected answer text.
evret.evaluation.evaluator
¶
Evaluation orchestrator for retriever metrics.
Evaluator
¶
Run a list of metrics over a retriever and dataset.
Uses pluggable Judge system for text-based relevance matching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retriever
|
BaseRetriever
|
Retriever to evaluate |
required |
metrics
|
Sequence[Metric]
|
List of metrics to compute |
required |
judge
|
Judge | None
|
Relevance judge (defaults to TokenOverlapJudge if None) |
None
|
Examples:
>>> from evret import Evaluator, HitRate, Recall
>>> from evret.judges import TokenOverlapJudge, SemanticJudge, LLMJudge
>>>
>>> # Default: TokenOverlapJudge
>>> evaluator = Evaluator(retriever, [HitRate(k=4), Recall(k=4)])
>>>
>>> # Custom judge
>>> evaluator = Evaluator(
... retriever,
... [Recall(k=4)],
... judge=SemanticJudge(threshold=0.8)
... )
evret.evaluation.judges
¶
Relevance judge helpers for text-based evaluation matching.
default_relevance_judge(query_text, relevant_label, candidate_text)
¶
Return whether one candidate matches a relevance label.
make_token_overlap_judge(*, min_shared_tokens=2, min_overlap_ratio=0.6)
¶
Build a token-overlap relevance judge with custom thresholds.
token_overlap_relevance_judge(query_text, relevant_label, candidate_text, *, min_shared_tokens=2, min_overlap_ratio=0.6)
¶
Return True when token overlap passes configured thresholds.
evret.evaluation.results
¶
Evaluation result container and exporters.