Skip to content

Judges API

evret.judges.base

Base interface for relevance judges.

Judge

Bases: ABC

Base interface for relevance judges.

All judges implement a simple contract: - judge(context) → bool (is relevant?) - batch_judge(contexts) → list[bool] (batch evaluation)

Subclasses should override judge() and optionally batch_judge() for optimized batch processing.

name abstractmethod property

Judge display name for logging/debugging.

batch_judge(contexts)

Batch evaluation of multiple contexts.

Default implementation calls judge() for each context sequentially. Override this method for optimized batch processing (e.g., vectorized operations, async API calls, etc.).

Parameters:

Name Type Description Default
contexts list[JudgmentContext]

List of judgment contexts

required

Returns:

Type Description
list[bool]

List of boolean judgments (same order as input)

judge(context) abstractmethod

Return True if retrieved_text is relevant to expected_text given query.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context with query and texts

required

Returns:

Type Description
bool

True if retrieved text is relevant, False otherwise

JudgmentContext dataclass

Context passed to judge for relevance decision.

Attributes:

Name Type Description
query str

User query text

expected_text str

Expected/ground-truth relevant text

retrieved_text str

Retrieved candidate text to judge

evret.judges.token_overlap

Token overlap judge for keyword-based relevance matching.

TokenOverlapJudge

Bases: Judge

Fast keyword/token-based relevance matching.

Suitable for exact/fuzzy text matching without semantic understanding. Uses token overlap with configurable thresholds to determine relevance.

Algorithm
  1. Try exact match
  2. Try substring containment
  3. Check token overlap with minimum token and ratio thresholds
  4. Optionally boost with query token overlap

Examples:

>>> judge = TokenOverlapJudge()  # Default settings
>>> judge = TokenOverlapJudge(min_tokens=3, overlap_ratio=0.7)
>>> judge = TokenOverlapJudge(min_tokens=2, overlap_ratio=0.6, query_boost=False)

Parameters:

Name Type Description Default
min_tokens int

Minimum shared tokens required (default: 2)

2
overlap_ratio float

Minimum overlap ratio 0-1 (default: 0.6)

0.6
query_boost bool

Allow query tokens to relax threshold (default: True)

True

name property

Judge display name.

__init__(min_tokens=2, overlap_ratio=0.6, query_boost=True)

Initialize token overlap judge with configurable thresholds.

judge(context)

Judge using token overlap algorithm.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context with query and texts

required

Returns:

Type Description
bool

True if retrieved text matches expected text

evret.judges.semantic

Semantic similarity judge using sentence embeddings.

SemanticJudge

Bases: Judge

Embedding-based semantic similarity matching.

Uses sentence-transformers to compute dense embeddings and cosine similarity for relevance judgment. More accurate than token overlap but requires additional dependencies and computation.

Examples:

>>> judge = SemanticJudge()  # Default model
>>> judge = SemanticJudge(model="all-MiniLM-L6-v2", threshold=0.8)
>>> judge = SemanticJudge(threshold=0.7, device="cuda")

Parameters:

Name Type Description Default
model str

HuggingFace model name (default: sentence-transformers/all-MiniLM-L6-v2)

'sentence-transformers/all-MiniLM-L6-v2'
threshold float

Cosine similarity threshold 0-1 (default: 0.75)

0.75
device str

Device for computation: "cpu" or "cuda" (default: "cpu")

'cpu'
Requires

pip install sentence-transformers

name property

Judge display name.

__init__(model='sentence-transformers/all-MiniLM-L6-v2', threshold=0.75, device='cpu')

Initialize semantic judge with embedding model.

batch_judge(contexts)

Optimized batch evaluation using vectorized embeddings.

Parameters:

Name Type Description Default
contexts list[JudgmentContext]

List of judgment contexts

required

Returns:

Type Description
list[bool]

List of boolean judgments

judge(context)

Judge using embedding cosine similarity.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context with query and texts

required

Returns:

Type Description
bool

True if cosine similarity >= threshold

evret.judges.llm.base

LLM-powered semantic relevance judge.

LLMJudge

Bases: Judge

LLM-powered semantic relevance judgment.

Uses GPT/Claude/other LLMs to determine if retrieved text matches expected content semantically. Most accurate but slowest judge option.

Examples:

>>> judge = LLMJudge(provider="openai")  # Uses OPENAI_API_KEY env
>>> judge = LLMJudge(provider="openai", api_key="sk-...")
>>> judge = LLMJudge(provider="anthropic", model="claude-3-5-sonnet-20241022")
>>> judge = LLMJudge(provider="google", model="gemini-2.5-flash")

Parameters:

Name Type Description Default
provider str

LLM provider ("openai", "anthropic", or "google")

'openai'
model str | None

Model name (uses provider default if None)

None
api_key str | None

API key (reads from env if None)

None
temperature float

Sampling temperature (default: 0.0 for deterministic)

0.0
max_retries int

Max retry attempts for failed API calls

3
Requires

pip install openai # for OpenAI pip install anthropic # for Anthropic pip install google-genai # for Google Gen AI

name property

Judge display name.

__init__(provider='openai', model=None, api_key=None, temperature=0.0, max_retries=3)

Initialize LLM judge with specified provider.

ajudge(context) async

Async judge using LLM.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context

required

Returns:

Type Description
bool

Boolean relevance judgment

batch_judge(contexts)

Batch evaluation with concurrent async API calls.

Parameters:

Name Type Description Default
contexts list[JudgmentContext]

List of judgment contexts

required

Returns:

Type Description
list[bool]

List of boolean judgments

judge(context)

Judge using LLM prompt.

Parameters:

Name Type Description Default
context JudgmentContext

Judgment context with query and texts

required

Returns:

Type Description
bool

True if LLM determines texts are semantically relevant