Package API¶
evret
¶
Evret package.
BaseRetriever
¶
ChromaRetriever
¶
ChunkingConfig
dataclass
¶
Settings for structure-aware text chunking.
DatasetGenerator
¶
Generate diverse retrieval-evaluation examples from documents.
DocumentExample
dataclass
¶
One document entry in an evaluation dataset.
ERR
¶
Bases: Metric
Expected Reciprocal Rank with cascade model for graded relevance.
Formula:
ERR@k = Σ(i=1 to k) [ (1/i) × R(i) × Π(j=1 to i-1)(1 - R(j)) ]
where R(i) = (2^grade - 1) / 2^max_grade
name
property
¶
Metric display name including cutoff and max_grade.
__init__(k, max_grade=4)
¶
Initialize ERR metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Rank cutoff position. |
required |
max_grade
|
int
|
Maximum relevance grade (default: 4). Grades should be in range [0, max_grade]. |
4
|
score_query(retrieved_doc_ids, expected_answers)
¶
Score a single query using ERR.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieved_doc_ids
|
Sequence[str]
|
Ordered list of retrieved document IDs. |
required |
expected_answers
|
Collection[str] | dict[str, int]
|
Either a set/list of expected answer IDs (binary relevance) or a dict mapping doc_id → relevance grade (0 to max_grade). |
required |
Returns:
| Type | Description |
|---|---|
float
|
ERR score in range [0, 1]. |
ElasticsearchRetriever
¶
Bases: BaseRetriever
Retrieve documents from Elasticsearch kNN search with a unified Evret interface.
EvaluationDataset
dataclass
¶
Evaluation dataset containing query examples and optional documents.
EvaluationResults
dataclass
¶
Evaluator
¶
Run a list of metrics over a retriever and dataset.
Uses pluggable Judge system for text-based relevance matching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retriever
|
BaseRetriever
|
Retriever to evaluate |
required |
metrics
|
Sequence[Metric]
|
List of metrics to compute |
required |
judge
|
Judge | None
|
Relevance judge (defaults to TokenOverlapJudge if None) |
None
|
Examples:
>>> from evret import Evaluator, HitRate, Recall
>>> from evret.judges import TokenOverlapJudge, SemanticJudge, LLMJudge
>>>
>>> # Default: TokenOverlapJudge
>>> evaluator = Evaluator(retriever, [HitRate(k=4), Recall(k=4)])
>>>
>>> # Custom judge
>>> evaluator = Evaluator(
... retriever,
... [Recall(k=4)],
... judge=SemanticJudge(threshold=0.8)
... )
EvretError
¶
Bases: Exception
Base exception for Evret.
EvretValidationError
¶
Bases: ValueError, EvretError
Raised when user input or data format is invalid.
GeneratedChunk
dataclass
¶
Chunk emitted by the dataset-generation chunker.
to_document_example()
¶
Convert to Evret's evaluation document shape.
GeneratedDataset
dataclass
¶
GeneratedExample
dataclass
¶
HaystackRetrieverAdapter
¶
Bridge Evret retrievers and Haystack 2.x retriever components.
HitRate
¶
Bases: Metric
Binary top-k relevance presence metric.
Formula:
HitRate@k = (1 / |Q|) * sum(1[relevant_i ∩ retrieved_i[:k] != ∅])
Judge
¶
Bases: ABC
Base interface for relevance judges.
All judges implement a simple contract: - judge(context) → bool (is relevant?) - batch_judge(contexts) → list[bool] (batch evaluation)
Subclasses should override judge() and optionally batch_judge() for optimized batch processing.
name
abstractmethod
property
¶
Judge display name for logging/debugging.
batch_judge(contexts)
¶
Batch evaluation of multiple contexts.
Default implementation calls judge() for each context sequentially. Override this method for optimized batch processing (e.g., vectorized operations, async API calls, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contexts
|
list[JudgmentContext]
|
List of judgment contexts |
required |
Returns:
| Type | Description |
|---|---|
list[bool]
|
List of boolean judgments (same order as input) |
judge(context)
abstractmethod
¶
Return True if retrieved_text is relevant to expected_text given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if retrieved text is relevant, False otherwise |
JudgmentContext
dataclass
¶
Context passed to judge for relevance decision.
Attributes:
| Name | Type | Description |
|---|---|---|
query |
str
|
User query text |
expected_text |
str
|
Expected/ground-truth relevant text |
retrieved_text |
str
|
Retrieved candidate text to judge |
LangChainRetrieverAdapter
¶
Bases: LangChainBaseRetriever
Bridge Evret and LangChain retrievers.
LlamaIndexRetrieverAdapter
¶
Bases: BaseRetriever
Wrap an Evret retriever as a LlamaIndex-compatible retriever.
MRR
¶
Bases: Metric
Mean Reciprocal Rank query metric at top-k.
Formula:
RR@k = 1 / rank_first_relevant if a hit exists in top-k, else 0.
MilvusRetriever
¶
OptionalDependencyError
¶
Bases: ImportError, EvretError
Raised when an optional dependency required by a feature is missing.
Precision
¶
Bases: Metric
Purity metric over retrieved top-k documents.
Formula:
Precision@k = |relevant ∩ retrieved[:k]| / k
QdrantRetriever
¶
QueryExample
dataclass
¶
One query item in an evaluation dataset.
Provide expected_doc_ids when gold relevant document IDs are known. Provide expected_answers as gold answer text snippets for the judge to match against retrieved content when document IDs are not available.
RBP
¶
Bases: Metric
Rank-Biased Precision with geometric persistence weighting.
Formula:
RBP(p) = (1 - p) × Σ(i=1 to k) [ p^(i-1) × rel(i) ]
expected_search_depth
property
¶
Expected number of positions a user examines.
Expected depth = 1 / (1 - p)
name
property
¶
Metric display name including cutoff and persistence.
__init__(k, p=0.8)
¶
Initialize RBP metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Rank cutoff position. |
required |
p
|
float
|
Persistence parameter (0 < p < 1). Default is 0.8. Higher p = more patient user, examines deeper. Lower p = impatient user, focuses on top ranks. |
0.8
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If p is not in the valid range (0, 1). |
compute_residual(num_retrieved)
¶
Compute residual for incomplete rankings.
The residual represents the upper bound contribution from unseen ranks (k+1, k+2, ...) if all were relevant.
Residual = p^k
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_retrieved
|
int
|
Number of documents actually retrieved. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Residual value (upper bound on unseen contribution). |
score_query(retrieved_doc_ids, expected_answers)
¶
Score a single query using RBP.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieved_doc_ids
|
Sequence[str]
|
Ordered list of retrieved document IDs. |
required |
expected_answers
|
Collection[str] | dict[str, int]
|
Either a set/list of expected answer IDs (binary relevance) or a dict mapping doc_id → relevance grade. For graded relevance, grades are normalized to [0, 1]. |
required |
Returns:
| Type | Description |
|---|---|
float
|
RBP score in range [0, 1]. |
Recall
¶
Bases: Metric
Coverage metric over expected answeruments.
Formula:
Recall@k = |relevant ∩ retrieved[:k]| / |relevant|
RetrievalResult
dataclass
¶
Standard retrieval output for all retriever backends.
SourceDocument
dataclass
¶
Input document for dataset generation.
TokenOverlapJudge
¶
Bases: Judge
Fast keyword/token-based relevance matching.
Suitable for exact/fuzzy text matching without semantic understanding. Uses weighted token overlap with configurable thresholds to determine relevance.
Algorithm
- Try exact match
- Try substring containment
- Remove common stopwords
- Score weighted token overlap
- Add small phrase and query overlap bonuses
- Penalize negation mismatches
Examples:
>>> judge = TokenOverlapJudge() # Default settings
>>> judge = TokenOverlapJudge(min_tokens=50, overlap_ratio=0.7)
>>> judge = TokenOverlapJudge(min_tokens=15, overlap_ratio=0.6, query_boost=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_tokens
|
int
|
Minimum shared tokens required (default: 15, suitable for RAG chunks) |
15
|
overlap_ratio
|
float
|
Minimum overlap ratio 0-1 (default: 0.6) |
0.6
|
query_boost
|
bool
|
Allow query tokens to relax threshold (default: True) |
True
|
stopwords
|
Iterable[str] | None
|
Tokens ignored during overlap scoring |
None
|
name
property
¶
Judge display name.
__init__(min_tokens=15, overlap_ratio=0.6, query_boost=True, stopwords=None)
¶
Initialize token overlap judge with configurable thresholds.
judge(context)
¶
Judge using token overlap algorithm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if retrieved text matches expected text |
score(context)
¶
Return a relevance score in the range [0, 1].
WeaviateRetriever
¶
build_generation_prompt(chunk, *, num_examples=6)
¶
Build the single diverse generation prompt for one chunk.
chunk_documents(documents, *, config=None)
¶
Split documents into structure-aware chunks.
configure_logging(*, level='INFO', structured=False, force=False)
¶
Configure Evret logger handlers and formatter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
level
|
str
|
Log level string (e.g. DEBUG, INFO, WARNING, ERROR). |
'INFO'
|
structured
|
bool
|
When True, emit JSON logs. |
False
|
force
|
bool
|
When True, replace existing handlers on the evret root logger. |
False
|
get_logger(name)
¶
Return a package-scoped logger.
If name does not start with evret, it is namespaced under evret.