Skip to content

Evaluation API

evret.evaluation.dataset

Evaluation dataset models and loaders.

DocumentExample dataclass

One document entry in an evaluation dataset.

EvaluationDataset dataclass

Evaluation dataset containing query examples and optional documents.

QueryExample dataclass

One query item in an evaluation dataset.

Supports two evaluation patterns: 1. Classic IR: Provide relevant_doc_ids (pre-labeled document identifiers) 2. Judge-based: Provide expected_answers (answer text snippets for judge to match)

Use relevant_doc_ids when you have pre-labeled ground truth document IDs. Use expected_answers when you want a judge to determine relevance by comparing against expected answer text.

evret.evaluation.evaluator

Evaluation orchestrator for retriever metrics.

Evaluator

Run a list of metrics over a retriever and dataset.

Uses pluggable Judge system for text-based relevance matching.

Parameters:

Name Type Description Default
retriever BaseRetriever

Retriever to evaluate

required
metrics Sequence[Metric]

List of metrics to compute

required
judge Judge | None

Relevance judge (defaults to TokenOverlapJudge if None)

None

Examples:

>>> from evret import Evaluator, HitRate, Recall
>>> from evret.judges import TokenOverlapJudge, SemanticJudge, LLMJudge
>>>
>>> # Default: TokenOverlapJudge
>>> evaluator = Evaluator(retriever, [HitRate(k=4), Recall(k=4)])
>>>
>>> # Custom judge
>>> evaluator = Evaluator(
...     retriever,
...     [Recall(k=4)],
...     judge=SemanticJudge(threshold=0.8)
... )

evret.evaluation.judges

Relevance judge helpers for text-based evaluation matching.

default_relevance_judge(query_text, relevant_label, candidate_text)

Return whether one candidate matches a relevance label.

make_token_overlap_judge(*, min_shared_tokens=2, min_overlap_ratio=0.6)

Build a token-overlap relevance judge with custom thresholds.

token_overlap_relevance_judge(query_text, relevant_label, candidate_text, *, min_shared_tokens=2, min_overlap_ratio=0.6)

Return True when token overlap passes configured thresholds.

evret.evaluation.results

Evaluation result container and exporters.

EvaluationResults dataclass

Aggregated metric results for an evaluation run.

summary()

Return metric summary map.

to_csv(path)

Write metric rows as CSV (metric, score).

to_dict()

Return serializable representation of this result object.

to_json(path)

Write results as JSON.