Package API¶
evret
¶
Evret package.
BaseRetriever
¶
ChromaRetriever
¶
DocumentExample
dataclass
¶
One document entry in an evaluation dataset.
EvaluationDataset
dataclass
¶
Evaluation dataset containing query examples and optional documents.
EvaluationResults
dataclass
¶
Evaluator
¶
Run a list of metrics over a retriever and dataset.
Uses pluggable Judge system for text-based relevance matching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retriever
|
BaseRetriever
|
Retriever to evaluate |
required |
metrics
|
Sequence[Metric]
|
List of metrics to compute |
required |
judge
|
Judge | None
|
Relevance judge (defaults to TokenOverlapJudge if None) |
None
|
Examples:
>>> from evret import Evaluator, HitRate, Recall
>>> from evret.judges import TokenOverlapJudge, SemanticJudge, LLMJudge
>>>
>>> # Default: TokenOverlapJudge
>>> evaluator = Evaluator(retriever, [HitRate(k=4), Recall(k=4)])
>>>
>>> # Custom judge
>>> evaluator = Evaluator(
... retriever,
... [Recall(k=4)],
... judge=SemanticJudge(threshold=0.8)
... )
EvretError
¶
Bases: Exception
Base exception for Evret.
EvretValidationError
¶
Bases: ValueError, EvretError
Raised when user input or data format is invalid.
HitRate
¶
Bases: Metric
Binary top-k relevance presence metric.
Formula:
HitRate@k = (1 / |Q|) * sum(1[relevant_i ∩ retrieved_i[:k] != ∅])
Judge
¶
Bases: ABC
Base interface for relevance judges.
All judges implement a simple contract: - judge(context) → bool (is relevant?) - batch_judge(contexts) → list[bool] (batch evaluation)
Subclasses should override judge() and optionally batch_judge() for optimized batch processing.
name
abstractmethod
property
¶
Judge display name for logging/debugging.
batch_judge(contexts)
¶
Batch evaluation of multiple contexts.
Default implementation calls judge() for each context sequentially. Override this method for optimized batch processing (e.g., vectorized operations, async API calls, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contexts
|
list[JudgmentContext]
|
List of judgment contexts |
required |
Returns:
| Type | Description |
|---|---|
list[bool]
|
List of boolean judgments (same order as input) |
judge(context)
abstractmethod
¶
Return True if retrieved_text is relevant to expected_text given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if retrieved text is relevant, False otherwise |
JudgmentContext
dataclass
¶
Context passed to judge for relevance decision.
Attributes:
| Name | Type | Description |
|---|---|---|
query |
str
|
User query text |
expected_text |
str
|
Expected/ground-truth relevant text |
retrieved_text |
str
|
Retrieved candidate text to judge |
LangChainRetrieverAdapter
¶
Bases: LangChainBaseRetriever
Bridge Evret and LangChain retrievers.
LlamaIndexRetrieverAdapter
¶
Bases: BaseRetriever
Wrap an Evret retriever as a LlamaIndex-compatible retriever.
MRR
¶
Bases: Metric
Mean Reciprocal Rank query metric at top-k.
Formula:
RR@k = 1 / rank_first_relevant if a hit exists in top-k, else 0.
MilvusRetriever
¶
OptionalDependencyError
¶
Bases: ImportError, EvretError
Raised when an optional dependency required by a feature is missing.
Precision
¶
Bases: Metric
Purity metric over retrieved top-k documents.
Formula:
Precision@k = |relevant ∩ retrieved[:k]| / k
QdrantRetriever
¶
QueryExample
dataclass
¶
One query item in an evaluation dataset.
Supports two evaluation patterns: 1. Classic IR: Provide relevant_doc_ids (pre-labeled document identifiers) 2. Judge-based: Provide expected_answers (answer text snippets for judge to match)
Use relevant_doc_ids when you have pre-labeled ground truth document IDs. Use expected_answers when you want a judge to determine relevance by comparing against expected answer text.
Recall
¶
Bases: Metric
Coverage metric over relevant documents.
Formula:
Recall@k = |relevant ∩ retrieved[:k]| / |relevant|
RetrievalResult
dataclass
¶
Standard retrieval output for all retriever backends.
TokenOverlapJudge
¶
Bases: Judge
Fast keyword/token-based relevance matching.
Suitable for exact/fuzzy text matching without semantic understanding. Uses token overlap with configurable thresholds to determine relevance.
Algorithm
- Try exact match
- Try substring containment
- Check token overlap with minimum token and ratio thresholds
- Optionally boost with query token overlap
Examples:
>>> judge = TokenOverlapJudge() # Default settings
>>> judge = TokenOverlapJudge(min_tokens=3, overlap_ratio=0.7)
>>> judge = TokenOverlapJudge(min_tokens=2, overlap_ratio=0.6, query_boost=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_tokens
|
int
|
Minimum shared tokens required (default: 2) |
2
|
overlap_ratio
|
float
|
Minimum overlap ratio 0-1 (default: 0.6) |
0.6
|
query_boost
|
bool
|
Allow query tokens to relax threshold (default: True) |
True
|
name
property
¶
Judge display name.
__init__(min_tokens=2, overlap_ratio=0.6, query_boost=True)
¶
Initialize token overlap judge with configurable thresholds.
judge(context)
¶
Judge using token overlap algorithm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if retrieved text matches expected text |