Skip to content

Package API

evret

Evret package.

AveragePrecision

Bases: Metric

Average Precision at top-k cutoff.

BaseRetriever

Bases: ABC

Abstract retriever interface used by evaluation pipelines.

batch_retrieve(queries, k)

Retrieve for each query using the same cutoff k.

retrieve(query, k) abstractmethod

Return top-k results for a single query.

ChromaRetriever

Bases: BaseRetriever

Retrieve documents from ChromaDB with a unified Evret interface.

ChunkingConfig dataclass

Settings for structure-aware text chunking.

DatasetGenerator

Generate diverse retrieval-evaluation examples from documents.

from_provider(provider, *, model=None, api_key=None, temperature=0.2, max_retries=3, chunking_config=None, examples_per_chunk=6, show_progress=True) classmethod

Create a generator using Evret's configured LLM providers.

generate(documents)

Chunk documents and generate a rich evaluation dataset.

DocumentExample dataclass

One document entry in an evaluation dataset.

ERR

Bases: Metric

Expected Reciprocal Rank with cascade model for graded relevance.

Formula: ERR@k = Σ(i=1 to k) [ (1/i) × R(i) × Π(j=1 to i-1)(1 - R(j)) ] where R(i) = (2^grade - 1) / 2^max_grade

name property

Metric display name including cutoff and max_grade.

__init__(k, max_grade=4)

Initialize ERR metric.

Parameters:

Name Type Description Default
k int

Rank cutoff position.

required
max_grade int

Maximum relevance grade (default: 4). Grades should be in range [0, max_grade].

4

score_query(retrieved_doc_ids, expected_answers)

Score a single query using ERR.

Parameters:

Name Type Description Default
retrieved_doc_ids Sequence[str]

Ordered list of retrieved document IDs.

required
expected_answers Collection[str] | dict[str, int]

Either a set/list of expected answer IDs (binary relevance) or a dict mapping doc_id → relevance grade (0 to max_grade).

required

Returns:

Type Description
float

ERR score in range [0, 1].

ElasticsearchRetriever

Bases: BaseRetriever

Retrieve documents from Elasticsearch kNN search with a unified Evret interface.

EvaluationDataset dataclass

Evaluation dataset containing query examples and optional documents.

EvaluationResults dataclass

Aggregated metric results for an evaluation run.

summary()

Return metric summary map.

to_csv(path)

Write metric rows as CSV (metric, score).

to_dict()

Return serializable representation of this result object.

to_json(path)

Write results as JSON.

Evaluator

Run a list of metrics over a retriever and dataset.

Uses pluggable Judge system for text-based relevance matching.

Parameters:

Name Type Description Default
retriever BaseRetriever

Retriever to evaluate

required
metrics Sequence[Metric]

List of metrics to compute

required
judge Judge | None

Relevance judge (defaults to TokenOverlapJudge if None)

None

Examples:

>>> from evret import Evaluator, HitRate, Recall
>>> from evret.judges import TokenOverlapJudge, SemanticJudge, LLMJudge
>>>
>>> # Default: TokenOverlapJudge
>>> evaluator = Evaluator(retriever, [HitRate(k=4), Recall(k=4)])
>>>
>>> # Custom judge
>>> evaluator = Evaluator(
...     retriever,
...     [Recall(k=4)],
...     judge=SemanticJudge(threshold=0.8)
... )

EvretError

Bases: Exception

Base exception for Evret.

EvretValidationError

Bases: ValueError, EvretError

Raised when user input or data format is invalid.

GeneratedChunk dataclass

Chunk emitted by the dataset-generation chunker.

to_document_example()

Convert to Evret's evaluation document shape.

GeneratedDataset dataclass

Generated dataset with rich examples and Evret-compatible export.

to_dict()

Return a rich JSON-serializable dataset.

to_evaluation_dataset()

Convert to Evret's existing evaluation dataset model.

GeneratedExample dataclass

Rich generated query example before conversion to EvaluationDataset.

to_dict()

Return the rich JSON-serializable example.

to_query_example()

Convert to Evret's evaluation query shape.

HaystackRetrieverAdapter

Bridge Evret retrievers and Haystack 2.x retriever components.

HitRate

Bases: Metric

Binary top-k relevance presence metric.

Formula: HitRate@k = (1 / |Q|) * sum(1[relevant_i ∩ retrieved_i[:k] != ∅])

Judge

Bases: ABC

Base interface for relevance judges.

All judges implement a simple contract: - judge(context) → bool (is relevant?) - batch_judge(contexts) → list[bool] (batch evaluation)

Subclasses should override judge() and optionally batch_judge() for optimized batch processing.

name abstractmethod property

Judge display name for logging/debugging.

batch_judge(contexts)

Batch evaluation of multiple contexts.

Default implementation calls judge() for each context sequentially. Override this method for optimized batch processing (e.g., vectorized operations, async API calls, etc.).

Parameters:

Name Type Description Default
contexts list[JudgmentContext]

List of judgment contexts

required

Returns:

Type Description
list[bool]

List of boolean judgments (same order as input)

judge(context) abstractmethod

Return True if retrieved_text is relevant to expected_text given query.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context with query and texts

required

Returns:

Type Description
bool

True if retrieved text is relevant, False otherwise

JudgmentContext dataclass

Context passed to judge for relevance decision.

Attributes:

Name Type Description
query str

User query text

expected_text str

Expected/ground-truth relevant text

retrieved_text str

Retrieved candidate text to judge

LangChainRetrieverAdapter

Bases: LangChainBaseRetriever

Bridge Evret and LangChain retrievers.

LlamaIndexRetrieverAdapter

Bases: BaseRetriever

Wrap an Evret retriever as a LlamaIndex-compatible retriever.

MRR

Bases: Metric

Mean Reciprocal Rank query metric at top-k.

Formula: RR@k = 1 / rank_first_relevant if a hit exists in top-k, else 0.

MilvusRetriever

Bases: BaseRetriever

Retrieve documents from Milvus with a unified Evret interface.

NDCG

Bases: Metric

Normalized Discounted Cumulative Gain at top-k.

OptionalDependencyError

Bases: ImportError, EvretError

Raised when an optional dependency required by a feature is missing.

Precision

Bases: Metric

Purity metric over retrieved top-k documents.

Formula: Precision@k = |relevant ∩ retrieved[:k]| / k

QdrantRetriever

Bases: BaseRetriever

Retrieve documents from Qdrant with a unified Evret interface.

QueryExample dataclass

One query item in an evaluation dataset.

Provide expected_doc_ids when gold relevant document IDs are known. Provide expected_answers as gold answer text snippets for the judge to match against retrieved content when document IDs are not available.

RBP

Bases: Metric

Rank-Biased Precision with geometric persistence weighting.

Formula: RBP(p) = (1 - p) × Σ(i=1 to k) [ p^(i-1) × rel(i) ]

expected_search_depth property

Expected number of positions a user examines.

Expected depth = 1 / (1 - p)

name property

Metric display name including cutoff and persistence.

__init__(k, p=0.8)

Initialize RBP metric.

Parameters:

Name Type Description Default
k int

Rank cutoff position.

required
p float

Persistence parameter (0 < p < 1). Default is 0.8. Higher p = more patient user, examines deeper. Lower p = impatient user, focuses on top ranks.

0.8

Raises:

Type Description
ValueError

If p is not in the valid range (0, 1).

compute_residual(num_retrieved)

Compute residual for incomplete rankings.

The residual represents the upper bound contribution from unseen ranks (k+1, k+2, ...) if all were relevant.

Residual = p^k

Parameters:

Name Type Description Default
num_retrieved int

Number of documents actually retrieved.

required

Returns:

Type Description
float

Residual value (upper bound on unseen contribution).

score_query(retrieved_doc_ids, expected_answers)

Score a single query using RBP.

Parameters:

Name Type Description Default
retrieved_doc_ids Sequence[str]

Ordered list of retrieved document IDs.

required
expected_answers Collection[str] | dict[str, int]

Either a set/list of expected answer IDs (binary relevance) or a dict mapping doc_id → relevance grade. For graded relevance, grades are normalized to [0, 1].

required

Returns:

Type Description
float

RBP score in range [0, 1].

Recall

Bases: Metric

Coverage metric over expected answeruments.

Formula: Recall@k = |relevant ∩ retrieved[:k]| / |relevant|

RetrievalResult dataclass

Standard retrieval output for all retriever backends.

SourceDocument dataclass

Input document for dataset generation.

TokenOverlapJudge

Bases: Judge

Fast keyword/token-based relevance matching.

Suitable for exact/fuzzy text matching without semantic understanding. Uses weighted token overlap with configurable thresholds to determine relevance.

Algorithm
  1. Try exact match
  2. Try substring containment
  3. Remove common stopwords
  4. Score weighted token overlap
  5. Add small phrase and query overlap bonuses
  6. Penalize negation mismatches

Examples:

>>> judge = TokenOverlapJudge()  # Default settings
>>> judge = TokenOverlapJudge(min_tokens=50, overlap_ratio=0.7)
>>> judge = TokenOverlapJudge(min_tokens=15, overlap_ratio=0.6, query_boost=False)

Parameters:

Name Type Description Default
min_tokens int

Minimum shared tokens required (default: 15, suitable for RAG chunks)

15
overlap_ratio float

Minimum overlap ratio 0-1 (default: 0.6)

0.6
query_boost bool

Allow query tokens to relax threshold (default: True)

True
stopwords Iterable[str] | None

Tokens ignored during overlap scoring

None

name property

Judge display name.

__init__(min_tokens=15, overlap_ratio=0.6, query_boost=True, stopwords=None)

Initialize token overlap judge with configurable thresholds.

judge(context)

Judge using token overlap algorithm.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context with query and texts

required

Returns:

Type Description
bool

True if retrieved text matches expected text

score(context)

Return a relevance score in the range [0, 1].

WeaviateRetriever

Bases: BaseRetriever

Retrieve documents from Weaviate with a unified Evret interface.

build_generation_prompt(chunk, *, num_examples=6)

Build the single diverse generation prompt for one chunk.

chunk_documents(documents, *, config=None)

Split documents into structure-aware chunks.

configure_logging(*, level='INFO', structured=False, force=False)

Configure Evret logger handlers and formatter.

Parameters:

Name Type Description Default
level str

Log level string (e.g. DEBUG, INFO, WARNING, ERROR).

'INFO'
structured bool

When True, emit JSON logs.

False
force bool

When True, replace existing handlers on the evret root logger.

False

get_logger(name)

Return a package-scoped logger.

If name does not start with evret, it is namespaced under evret.