Evret Documentation¶
Evret is a lightweight Python framework for evaluating retrievers in information retrieval and search systems.
It provides standard Information Retrieval metrics, a judge system for text-based relevance matching, and retriever adapters for common vector databases.
What Evret Does¶
Evret turns retrieved results into metric scores:
- Load an evaluation dataset.
- Ask your retriever for top-
kresults. - Match retrieved content against gold labels with a judge.
- Compute IR metrics over the matched relevant ids.
- Export a summary to JSON or CSV.
The evaluator supports two relevance styles:
relevant_doc_idsfor classic IR evaluation with known document idsexpected_answersfor judge-based information retrieval evaluation with gold text snippets
Key Features¶
- Standard IR metrics: Hit Rate, Recall, Precision, MRR, nDCG, and Average Precision
- Judge-based matching: token overlap by default, with optional semantic and LLM judges
- Vector database support: Qdrant, Chroma, Weaviate, and Milvus retrievers
- Framework integrations: LangChain and LlamaIndex adapters
- Simple result exports:
summary(),to_json(), andto_csv()
Quick Example¶
from evret import EvaluationDataset, Evaluator, HitRate, MRR, NDCG, TokenOverlapJudge
dataset = EvaluationDataset.from_json("eval_data.json")
evaluator = Evaluator(
retriever=my_retriever,
metrics=[HitRate(k=4), MRR(k=4), NDCG(k=4)],
judge=TokenOverlapJudge(min_tokens=2, overlap_ratio=0.6),
)
results = evaluator.evaluate(dataset)
print(results.summary())
Metrics¶
| Metric | What It Answers | Best Use |
|---|---|---|
| Hit Rate@k | Did at least one relevant result appear? | Basic retrieval sanity checks |
| Recall@k | How much relevant content was found? | Coverage and completeness |
| Precision@k | How clean are the top-k results? | LLM context quality |
| MRR@k | How early is the first relevant result? | Single-answer QA |
| nDCG@k | Are relevant results ranked high? | Ranking and reranker quality |
| Average Precision@k | How good is precision across relevant ranks? | Benchmark-style ranking comparison |
See Metrics Overview for formulas and examples.
Judges¶
Judges decide whether retrieved text matches the expected text in your dataset. Evaluator uses TokenOverlapJudge() by default.
| Judge | Best For | Dependency |
|---|---|---|
TokenOverlapJudge |
Fast local checks | none |
SemanticJudge |
Embedding similarity | evret[semantic] |
LLMJudge |
Complex paraphrase matching | LLM provider extra |
See Judges for parameters and provider defaults.
Installation¶
pip install evret
Optional extras:
pip install "evret[qdrant]"
pip install "evret[langchain]"
pip install "evret[semantic]"
pip install "evret[all]"
Docs Map¶
- Quickstart: end-to-end first evaluation
- Metrics: metric formulas and examples
- Evaluation: evaluator flow and result object
- Retrievers: retriever contract and backend guides
- Integrations: framework adapters
- API Reference: generated API docs