Evaluation API¶
evret.evaluation.dataset
¶
Evaluation dataset models and loaders.
DocumentExample
dataclass
¶
One document entry in an evaluation dataset.
EvaluationDataset
dataclass
¶
Evaluation dataset containing query examples and optional documents.
QueryExample
dataclass
¶
One query item in an evaluation dataset.
Provide expected_doc_ids when gold relevant document IDs are known. Provide expected_answers as gold answer text snippets for the judge to match against retrieved content when document IDs are not available.
evret.evaluation.evaluator
¶
Evaluation orchestrator for retriever metrics.
Evaluator
¶
Run a list of metrics over a retriever and dataset.
Uses pluggable Judge system for text-based relevance matching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retriever
|
BaseRetriever
|
Retriever to evaluate |
required |
metrics
|
Sequence[Metric]
|
List of metrics to compute |
required |
judge
|
Judge | None
|
Relevance judge (defaults to TokenOverlapJudge if None) |
None
|
Examples:
>>> from evret import Evaluator, HitRate, Recall
>>> from evret.judges import TokenOverlapJudge, SemanticJudge, LLMJudge
>>>
>>> # Default: TokenOverlapJudge
>>> evaluator = Evaluator(retriever, [HitRate(k=4), Recall(k=4)])
>>>
>>> # Custom judge
>>> evaluator = Evaluator(
... retriever,
... [Recall(k=4)],
... judge=SemanticJudge(threshold=0.8)
... )
evret.evaluation.judges
¶
Relevance judge helpers for text-based evaluation matching.
default_relevance_judge(query_text, relevant_label, candidate_text)
¶
Return whether one candidate matches a relevance label.
make_token_overlap_judge(*, min_shared_tokens=15, min_overlap_ratio=0.6)
¶
Build a token-overlap relevance judge with custom thresholds.
token_overlap_relevance_judge(query_text, relevant_label, candidate_text, *, min_shared_tokens=15, min_overlap_ratio=0.6)
¶
Return True when token overlap passes configured thresholds.
evret.evaluation.results
¶
Evaluation result container and exporters.