Evret Architecture: Judge System Design¶
Overview¶
Evret implements a judge-based architecture for text-based relevance matching in RAG evaluation. This document explains the design decisions, data flow, and why we chose boolean judgments over continuous scores.
System Architecture¶
High-Level Flow¶
┌─────────────────────────────────────────────────────────────────┐
│ USER INPUT │
│ Dataset with text-based relevance labels │
│ { │
│ "query": "What is RAG?", │
│ "expected_answers": [ │
│ "RAG combines retrieval with generation...", │
│ "Retrieval-augmented generation improves accuracy..." │
│ ] │
│ } │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVER │
│ Returns: Retrieved documents with text/metadata │
│ [ │
│ RetrievalResult( │
│ doc_id="doc_123", │
│ score=0.95, │
│ metadata={"text": "RAG is retrieval-augmented..."} │
│ ) │
│ ] │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ JUDGE SYSTEM │
│ │
│ Input: JudgmentContext( │
│ query="What is RAG?", │
│ expected_text="RAG combines retrieval...", │
│ retrieved_text="RAG is retrieval-augmented..." │
│ ) │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ TokenOverlapJudge / SemanticJudge / LLMJudge │ │
│ │ Determines: Is retrieved text relevant? │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ Output: Boolean (True/False) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ EVALUATOR │
│ Maps boolean judgments → ID sets │
│ │
│ retrieved_ids = ["relevant_doc_1", "relevant_doc_2", ...] │
│ relevant_ids = {"relevant_doc_1", "relevant_doc_2"} │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ METRICS │
│ Compute standard IR metrics on ID sets │
│ │
│ - Precision@k = |relevant ∩ retrieved[:k]| / k │
│ - Recall@k = |relevant ∩ retrieved[:k]| / |relevant| │
│ - MRR@k = 1 / rank_of_first_relevant │
│ - NDCG@k, HitRate@k, Average Precision@k │
│ │
│ Output: {"recall@4": 0.75, "precision@4": 0.5, ...} │
└─────────────────────────────────────────────────────────────────┘
Detailed Component Flow¶
1. Judge System Architecture¶
┌────────────────────────────────────────────────────────────────┐
│ Judge Interface │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ abstract class Judge: │ │
│ │ def judge(context: JudgmentContext) -> bool │ │
│ │ def batch_judge(contexts: List[Context]) -> List[bool]│ │
│ └──────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ TokenOverlap │ │ SemanticJudge │ │ LLMJudge │
│ Judge │ │ │ │ │
│ │ │ │ │ │
│ • Fast │ │ • Embeddings │ │ • LLM API │
│ • Token-based │ │ • Cosine sim │ │ • Highest acc │
│ • No deps │ │ • Batched │ │ • Async batch │
└─────────────────┘ └─────────────────┘ └─────────────────┘
2. Evaluator Processing Pipeline¶
┌─────────────────────────────────────────────────────────────────┐
│ Evaluator.evaluate() │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 1: Retrieve Documents │
│ ──────────────────────────── │
│ retrieved_results = retriever.batch_retrieve(queries, k=max_k) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 2: Build Judgment Contexts │
│ ──────────────────────────────── │
│ For each query: │
│ For each retrieved result: │
│ For each expected relevant text: │
│ Create JudgmentContext( │
│ query=query_text, │
│ expected_text=relevant_label, │
│ retrieved_text=result.metadata["text"] │
│ ) │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 3: Batch Judge All Contexts │
│ ────────────────────────────────── │
│ all_judgments = judge.batch_judge(all_contexts) │
│ │
│ Returns: [True, False, True, True, False, ...] │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 4: Map Judgments to IDs │
│ ───────────────────────────── │
│ For each query: │
│ For each retrieved result: │
│ Find first matching expected text (judgment=True) │
│ Assign matched ID to retrieved result │
│ │
│ retrieved_ids = ["relevant_1", "relevant_2", "irrelevant_0"] │
│ relevant_ids = {"relevant_1", "relevant_2"} │
└────────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Step 5: Compute Metrics │
│ ──────────────────────── │
│ For each metric: │
│ score = metric.score(retrieved_ids, relevant_ids) │
│ │
│ Results: {"recall@4": 0.75, "precision@4": 0.5, ...} │
└─────────────────────────────────────────────────────────────────┘
Why Boolean Judgments?¶
Design Decision: Boolean vs. Continuous Scores¶
We chose boolean judgments (True/False) over continuous relevance scores (e.g., 0.0 to 1.0) for several critical reasons:
1. Alignment with IR Metrics Semantics¶
Traditional Information Retrieval metrics are set-based, not score-based:
# Recall: What fraction of relevant documents were retrieved?
Recall@k = |relevant ∩ retrieved[:k]| / |relevant|
# Precision: What fraction of retrieved documents are relevant?
Precision@k = |relevant ∩ retrieved[:k]| / k
These metrics operate on binary relevance: a document is either relevant or not. There's no "50% relevant" in the classic IR formulation.
Example:
# Boolean approach (what we use)
relevant = {"doc_1", "doc_2", "doc_3"}
retrieved = ["doc_1", "doc_5", "doc_2", "doc_9"]
recall = len({"doc_1", "doc_2"}) / 3 = 0.667
# Continuous approach (problematic)
# How do you compute recall with scores?
# Do you sum? Average? Threshold?
2. Clean Separation of Concerns¶
┌─────────────────────────────────────────────────────────────┐
│ Judge: Answers "Is this relevant?" │
│ • Input: (query, expected_text, retrieved_text) │
│ • Output: Boolean decision │
│ • Responsibility: Relevance matching logic │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Metrics: Answers "How good is the retrieval?" │
│ • Input: (retrieved_ids, relevant_ids) │
│ • Output: Metric score (0.0 to 1.0) │
│ • Responsibility: Ranking quality measurement │
└─────────────────────────────────────────────────────────────┘
Why this matters: - Judges focus on: "Does this match?" (domain-specific) - Metrics focus on: "How good is the ranking?" (domain-agnostic) - No confusion between "relevance score" and "metric score"
3. Ambiguity of Continuous Scores¶
Continuous scores introduce ambiguity:
# What does a score of 0.7 mean?
judge.judge_with_score(context) → 0.7
# Is this:
# - 70% confident it's relevant? → Use threshold (boolean anyway)
# - 70% semantically similar? → Metric already measures this
# - 70% of tokens overlap? → Implementation detail, not user concern
With boolean:
# Clear decision boundary
judge.judge(context) → True # It's relevant
judge.judge(context) → False # It's not relevant
Users can configure the threshold internally:
SemanticJudge(threshold=0.7) # Internally uses 0.7, returns boolean
4. Simplicity for Custom Judges¶
Boolean approach (what we use):
class MyCustomJudge(Judge):
def judge(self, context: JudgmentContext) -> bool:
# Simple: return True or False
return my_matching_logic(context.expected_text, context.retrieved_text)
Continuous approach (problematic):
class MyCustomJudge(Judge):
def judge(self, context: JudgmentContext) -> float:
# Complex: What scale? How to interpret?
# How does this interact with metrics?
return ??? # 0.0 to 1.0? Unbounded? Normalized?
5. Consistency with NDCG Binary Relevance¶
NDCG (Normalized Discounted Cumulative Gain) in Evret uses binary relevance:
# Document is either relevant (1) or not (0)
relevance_scores = [1, 0, 1, 1, 0] # Binary
# Not continuous:
relevance_scores = [0.8, 0.3, 0.9, 0.7, 0.1] # Would require graded relevance
Our boolean approach naturally maps:
- judge() → True → relevance = 1
- judge() → False → relevance = 0
6. Extensibility Without Breaking Changes¶
Boolean is a strict interface. If we need graded relevance in the future:
# Future: Graded relevance (backward compatible)
class GradedJudge(Judge):
def judge(self, context: JudgmentContext) -> bool:
# Still implements boolean interface
return self.grade(context) >= self.threshold
def grade(self, context: JudgmentContext) -> float:
# Optional: Additional method for continuous scores
return self._compute_relevance_score(context)
Boolean is the minimal contract. We can always add more, but can't remove.
Comparison: Boolean vs. Continuous¶
| Aspect | Boolean (Our Choice) | Continuous |
|---|---|---|
| Simplicity | ✅ Clear True/False | ❌ Ambiguous scale |
| IR Metrics Alignment | ✅ Natural set operations | ❌ Requires thresholding |
| User Experience | ✅ Easy to understand | ❌ "What does 0.7 mean?" |
| Custom Judges | ✅ Simple to implement | ❌ Complex interface |
| Performance | ✅ Fast comparisons | ⚠️ Needs normalization |
| Extensibility | ✅ Can add grading later | ❌ Breaking change to simplify |
Judge Implementation Details¶
TokenOverlapJudge Algorithm¶
Input: JudgmentContext(query, expected_text, retrieved_text)
Step 1: Exact Match Check
├─ IF normalized(expected_text) == normalized(retrieved_text)
│ └─ RETURN True
│
Step 2: Substring Match
├─ IF expected_text in retrieved_text OR retrieved_text in expected_text
│ └─ RETURN True
│
Step 3: Token Overlap Computation
├─ expected_tokens = tokenize(expected_text)
├─ retrieved_tokens = tokenize(retrieved_text)
├─ shared_tokens = expected_tokens ∩ retrieved_tokens
│
├─ IF len(shared_tokens) < min_tokens
│ └─ RETURN False
│
├─ overlap_ratio = len(shared_tokens) / len(expected_tokens)
│
├─ IF overlap_ratio >= threshold
│ └─ RETURN True
│
└─ IF query_boost AND query shares tokens
├─ relaxed_threshold = threshold × 0.75
└─ RETURN overlap_ratio >= relaxed_threshold
Otherwise: RETURN False
SemanticJudge Algorithm¶
Input: JudgmentContext(query, expected_text, retrieved_text)
Step 1: Encode Texts
├─ emb_expected = model.encode(expected_text)
├─ emb_retrieved = model.encode(retrieved_text)
Step 2: Compute Cosine Similarity
├─ similarity = dot(emb_expected, emb_retrieved) /
│ (norm(emb_expected) × norm(emb_retrieved))
Step 3: Apply Threshold
└─ RETURN similarity >= threshold
Batch Optimization:
├─ Encode all texts at once (vectorized)
├─ Compute all similarities (matrix operation)
└─ Apply threshold to all results
LLMJudge Algorithm¶
Input: JudgmentContext(query, expected_text, retrieved_text)
Step 1: Build Prompt
├─ prompt = TEMPLATE.format(
│ query=query,
│ expected_text=expected_text,
│ retrieved_text=retrieved_text
│ )
Step 2: Call LLM API
├─ response = provider.complete(prompt)
Step 3: Parse Response
├─ IF response starts with "YES" → RETURN True
├─ IF response starts with "NO" → RETURN False
├─ IF response contains positive keywords → RETURN True
├─ IF response contains negative keywords → RETURN False
└─ OTHERWISE → RETURN False (conservative)
Batch Optimization:
├─ Build prompts for all contexts
├─ Call API concurrently (asyncio.gather)
└─ Parse all responses
Performance Characteristics¶
Time Complexity¶
| Judge | Single | Batch (n contexts) |
|---|---|---|
| TokenOverlap | O(t) | O(n·t) |
| Semantic | O(m) | O(n·m) |
| LLM | O(a) | O(a) with async |
Where:
- t = average text length (token operations)
- m = model inference time
- a = API latency
- n = number of contexts
Space Complexity¶
| Judge | Memory Usage |
|---|---|
| TokenOverlap | O(t) - temporary token sets |
| Semantic | O(d·n) - embeddings (d=dimensions) |
| LLM | O(1) - no local storage |
Example: Complete Evaluation Flow¶
Input Data¶
dataset = EvaluationDataset(
queries=[
QueryExample(
query_id="q1",
query_text="What is RAG?",
expected_answers=[
"RAG combines retrieval with generation for better accuracy",
"Retrieval-augmented generation improves LLM responses"
]
)
]
)
Retrieval Results¶
retrieved = [
RetrievalResult(
doc_id="doc_123",
score=0.95,
metadata={"text": "RAG is a technique that combines retrieval with generation"}
),
RetrievalResult(
doc_id="doc_456",
score=0.87,
metadata={"text": "Vector databases store embeddings"}
)
]
Judgment Phase¶
# Context 1: Expected text 1 vs. Retrieved doc 1
context_1_1 = JudgmentContext(
query="What is RAG?",
expected_text="rag combines retrieval with generation for better accuracy",
retrieved_text="rag is a technique that combines retrieval with generation"
)
judge.judge(context_1_1) → True # Match!
# Context 1: Expected text 2 vs. Retrieved doc 1
context_1_2 = JudgmentContext(
query="What is RAG?",
expected_text="retrieval augmented generation improves llm responses",
retrieved_text="rag is a technique that combines retrieval with generation"
)
judge.judge(context_1_2) → True # Match!
# Context 2: Expected text 1 vs. Retrieved doc 2
context_2_1 = JudgmentContext(
query="What is RAG?",
expected_text="rag combines retrieval with generation for better accuracy",
retrieved_text="vector databases store embeddings"
)
judge.judge(context_2_1) → False # No match
# Context 2: Expected text 2 vs. Retrieved doc 2
context_2_2 = JudgmentContext(
query="What is RAG?",
expected_text="retrieval augmented generation improves llm responses",
retrieved_text="vector databases store embeddings"
)
judge.judge(context_2_2) → False # No match
ID Mapping¶
# Retrieved doc 1 matches expected text 1
retrieved_ids = ["relevant_doc_1", "retrieved_1:vector databases store embeddings"]
relevant_ids = {"relevant_doc_1", "relevant_doc_2"}
Metric Computation¶
# Recall@2
recall = len({"relevant_doc_1"} ∩ retrieved_ids[:2]) / 2
= 1 / 2
= 0.5
# Precision@2
precision = len({"relevant_doc_1"} ∩ retrieved_ids[:2]) / 2
= 1 / 2
= 0.5
# HitRate@2
hit_rate = 1.0 # At least one relevant doc found
Summary¶
The boolean judge design provides:
✅ Clarity - Binary decisions are unambiguous ✅ Correctness - Aligns with IR metric semantics ✅ Simplicity - Easy to implement and understand ✅ Performance - Fast comparisons, efficient batching ✅ Extensibility - Can add graded relevance later ✅ Separation - Clean boundary between matching and measurement
This architecture enables production-ready RAG evaluation with text-based matching while maintaining mathematical rigor and user-friendly interfaces.