Judges API¶
evret.judges.base
¶
Base interface for relevance judges.
Judge
¶
Bases: ABC
Base interface for relevance judges.
All judges implement a simple contract: - judge(context) → bool (is relevant?) - batch_judge(contexts) → list[bool] (batch evaluation)
Subclasses should override judge() and optionally batch_judge() for optimized batch processing.
name
abstractmethod
property
¶
Judge display name for logging/debugging.
batch_judge(contexts)
¶
Batch evaluation of multiple contexts.
Default implementation calls judge() for each context sequentially. Override this method for optimized batch processing (e.g., vectorized operations, async API calls, etc.).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contexts
|
list[JudgmentContext]
|
List of judgment contexts |
required |
Returns:
| Type | Description |
|---|---|
list[bool]
|
List of boolean judgments (same order as input) |
judge(context)
abstractmethod
¶
Return True if retrieved_text is relevant to expected_text given query.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if retrieved text is relevant, False otherwise |
JudgmentContext
dataclass
¶
Context passed to judge for relevance decision.
Attributes:
| Name | Type | Description |
|---|---|---|
query |
str
|
User query text |
expected_text |
str
|
Expected/ground-truth relevant text |
retrieved_text |
str
|
Retrieved candidate text to judge |
evret.judges.token_overlap
¶
Token overlap judge for keyword-based relevance matching.
TokenOverlapJudge
¶
Bases: Judge
Fast keyword/token-based relevance matching.
Suitable for exact/fuzzy text matching without semantic understanding. Uses token overlap with configurable thresholds to determine relevance.
Algorithm
- Try exact match
- Try substring containment
- Check token overlap with minimum token and ratio thresholds
- Optionally boost with query token overlap
Examples:
>>> judge = TokenOverlapJudge() # Default settings
>>> judge = TokenOverlapJudge(min_tokens=3, overlap_ratio=0.7)
>>> judge = TokenOverlapJudge(min_tokens=2, overlap_ratio=0.6, query_boost=False)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
min_tokens
|
int
|
Minimum shared tokens required (default: 2) |
2
|
overlap_ratio
|
float
|
Minimum overlap ratio 0-1 (default: 0.6) |
0.6
|
query_boost
|
bool
|
Allow query tokens to relax threshold (default: True) |
True
|
name
property
¶
Judge display name.
__init__(min_tokens=2, overlap_ratio=0.6, query_boost=True)
¶
Initialize token overlap judge with configurable thresholds.
judge(context)
¶
Judge using token overlap algorithm.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if retrieved text matches expected text |
evret.judges.semantic
¶
Semantic similarity judge using sentence embeddings.
SemanticJudge
¶
Bases: Judge
Embedding-based semantic similarity matching.
Uses sentence-transformers to compute dense embeddings and cosine similarity for relevance judgment. More accurate than token overlap but requires additional dependencies and computation.
Examples:
>>> judge = SemanticJudge() # Default model
>>> judge = SemanticJudge(model="all-MiniLM-L6-v2", threshold=0.8)
>>> judge = SemanticJudge(threshold=0.7, device="cuda")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
str
|
HuggingFace model name (default: sentence-transformers/all-MiniLM-L6-v2) |
'sentence-transformers/all-MiniLM-L6-v2'
|
threshold
|
float
|
Cosine similarity threshold 0-1 (default: 0.75) |
0.75
|
device
|
str
|
Device for computation: "cpu" or "cuda" (default: "cpu") |
'cpu'
|
Requires
pip install sentence-transformers
name
property
¶
Judge display name.
__init__(model='sentence-transformers/all-MiniLM-L6-v2', threshold=0.75, device='cpu')
¶
Initialize semantic judge with embedding model.
batch_judge(contexts)
¶
Optimized batch evaluation using vectorized embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contexts
|
list[JudgmentContext]
|
List of judgment contexts |
required |
Returns:
| Type | Description |
|---|---|
list[bool]
|
List of boolean judgments |
judge(context)
¶
Judge using embedding cosine similarity.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if cosine similarity >= threshold |
evret.judges.llm.base
¶
LLM-powered semantic relevance judge.
LLMJudge
¶
Bases: Judge
LLM-powered semantic relevance judgment.
Uses GPT/Claude/other LLMs to determine if retrieved text matches expected content semantically. Most accurate but slowest judge option.
Examples:
>>> judge = LLMJudge(provider="openai") # Uses OPENAI_API_KEY env
>>> judge = LLMJudge(provider="openai", api_key="sk-...")
>>> judge = LLMJudge(provider="anthropic", model="claude-3-5-sonnet-20241022")
>>> judge = LLMJudge(provider="google", model="gemini-2.5-flash")
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
provider
|
str
|
LLM provider ("openai", "anthropic", or "google") |
'openai'
|
model
|
str | None
|
Model name (uses provider default if None) |
None
|
api_key
|
str | None
|
API key (reads from env if None) |
None
|
temperature
|
float
|
Sampling temperature (default: 0.0 for deterministic) |
0.0
|
max_retries
|
int
|
Max retry attempts for failed API calls |
3
|
Requires
pip install openai # for OpenAI pip install anthropic # for Anthropic pip install google-genai # for Google Gen AI
name
property
¶
Judge display name.
__init__(provider='openai', model=None, api_key=None, temperature=0.0, max_retries=3)
¶
Initialize LLM judge with specified provider.
ajudge(context)
async
¶
Async judge using LLM.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context |
required |
Returns:
| Type | Description |
|---|---|
bool
|
Boolean relevance judgment |
batch_judge(contexts)
¶
Batch evaluation with concurrent async API calls.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
contexts
|
list[JudgmentContext]
|
List of judgment contexts |
required |
Returns:
| Type | Description |
|---|---|
list[bool]
|
List of boolean judgments |
judge(context)
¶
Judge using LLM prompt.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
context
|
JudgmentContext
|
Judgment context with query and texts |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if LLM determines texts are semantically relevant |