Skip to content

Judges

Judges decide whether retrieved text matches the expected text for a query. The evaluator maps those judge decisions into the expected labels used by the standard IR metrics.

TokenOverlapJudge

Use TokenOverlapJudge for fast local checks with no external dependency. It removes common stopwords, scores weighted token overlap, adds small phrase and query-match bonuses, and rejects negation mismatches.

Do not use TokenOverlapJudge as the text judge for LLM-generated evaluation datasets. Generated expected answers may be paraphrased or summarized from the source chunk, so use LLMJudge for text-based judgment. If the generated dataset keeps expected_doc_ids, the evaluator scores those document IDs directly and does not need a judge for metric matching.

from evret.judges import TokenOverlapJudge

judge = TokenOverlapJudge(
    min_tokens=15,
    overlap_ratio=0.6,
    query_boost=True,
)
Parameter Type Default Meaning
min_tokens int 15 Minimum number of shared tokens required before a match can pass.
overlap_ratio float 0.6 Minimum score required for a match.
query_boost bool True Adds a small score boost when matched tokens also appear in the query.
stopwords Iterable[str] \| None None Optional custom stopword list. Defaults to Evret's built-in English stopwords.

RAG Use Case: Configuring for Different Chunk Sizes

The default min_tokens=15 is suitable for medium-sized chunks. Adjust based on your chunk size:

# For ~1000 token chunks
judge = TokenOverlapJudge(
    min_tokens=50,      # 5% of chunk size
    overlap_ratio=0.6,
)

# For ~500 token chunks (default is good)
judge = TokenOverlapJudge(
    min_tokens=15,      # default
    overlap_ratio=0.6,
)

# For smaller chunks (~200 tokens)
judge = TokenOverlapJudge(
    min_tokens=10,      # 5% of chunk size
    overlap_ratio=0.6,
)

Rule of thumb: Set min_tokens to 5-10% of your average chunk size to ensure meaningful overlap beyond common stopwords.

SemanticJudge

Use SemanticJudge when exact words differ but the meaning should match.

from evret.judges import SemanticJudge

judge = SemanticJudge(
    model="sentence-transformers/all-MiniLM-L6-v2",
    threshold=0.75,
    device="cpu",
)

Install the optional dependency first:

pip install evret[semantic]
Parameter Type Default Meaning
model str sentence-transformers/all-MiniLM-L6-v2 SentenceTransformers model used for embeddings.
threshold float 0.75 Cosine similarity cutoff. Values must be in [0, 1].
device str cpu Device passed to SentenceTransformers, such as cpu or cuda.

LLMJudge

Use LLMJudge when the expected and retrieved text need semantic judgment from an LLM.

Use LLMJudge for text-based evaluation of LLM-generated datasets. It is a better fit than token overlap when expected answers were generated from chunks and may not share exact wording with retrieved content.

from evret.judges import LLMJudge

judge = LLMJudge(
    provider="openai",
    model="gpt-5.4-nano",
    api_key=None,
    temperature=0.0,
    max_retries=3,
)

Install the provider dependency first:

pip install evret[llm-openai]
pip install evret[llm-anthropic]
pip install evret[llm-google]
Parameter Type Default Meaning
provider str openai LLM provider. Supported values are openai, anthropic, and google.
model str | None None Model name. When omitted, Evret uses the provider default.
api_key str | None None API key. When omitted, Evret reads the provider environment variable.
temperature float 0.0 Sampling temperature for deterministic relevance decisions.
max_retries int 3 Retry attempts for failed API calls.

Provider defaults:

Provider Default Model Environment Variable
openai gpt-5.4-nano OPENAI_API_KEY
anthropic claude-haiku-4-5-20251001 ANTHROPIC_API_KEY
google gemini-3-flash-preview GEMINI_API_KEY or GOOGLE_API_KEY

Use A Judge With Evaluator

from evret import EvaluationDataset, Evaluator, HitRate, Recall
from evret.judges import TokenOverlapJudge

dataset = EvaluationDataset.from_json("examples/eval_data.json")
judge = TokenOverlapJudge(min_tokens=15, overlap_ratio=0.6)

evaluator = Evaluator(
    retriever=my_retriever,
    metrics=[HitRate(k=4), Recall(k=4)],
    judge=judge,
)

results = evaluator.evaluate(dataset)
print(results.summary())