Metrics API¶
evret.metrics.base
¶
Base interface for retrieval evaluation metrics.
Metric
¶
Bases: ABC
Base class for metrics evaluated at a top-k cutoff.
For query i with retrieved labels R_i and expected labels G_i,
each metric computes a per-query score at k and then averages:
score = (1 / |Q|) * sum(metric_i(R_i[:k], G_i))
name
property
¶
Metric display name including cutoff.
score(retrieved_by_query, expected_by_query)
¶
Score a batch of queries by averaging per-query metric values.
score_query(retrieved_doc_ids, expected_answers)
abstractmethod
¶
Score a single query.
top_k(retrieved_doc_ids)
¶
Return the retrieval list trimmed to metric cutoff.
evret.metrics.hit_rate
¶
evret.metrics.recall
¶
evret.metrics.precision
¶
evret.metrics.mrr
¶
evret.metrics.ndcg
¶
evret.metrics.err
¶
ERR@K metric implementation.
ERR
¶
Bases: Metric
Expected Reciprocal Rank with cascade model for graded relevance.
Formula:
ERR@k = Σ(i=1 to k) [ (1/i) × R(i) × Π(j=1 to i-1)(1 - R(j)) ]
where R(i) = (2^grade - 1) / 2^max_grade
name
property
¶
Metric display name including cutoff and max_grade.
__init__(k, max_grade=4)
¶
Initialize ERR metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Rank cutoff position. |
required |
max_grade
|
int
|
Maximum relevance grade (default: 4). Grades should be in range [0, max_grade]. |
4
|
score_query(retrieved_doc_ids, expected_answers)
¶
Score a single query using ERR.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieved_doc_ids
|
Sequence[str]
|
Ordered list of retrieved document IDs. |
required |
expected_answers
|
Collection[str] | dict[str, int]
|
Either a set/list of expected answer IDs (binary relevance) or a dict mapping doc_id → relevance grade (0 to max_grade). |
required |
Returns:
| Type | Description |
|---|---|
float
|
ERR score in range [0, 1]. |
evret.metrics.rbp
¶
RBP@K metric implementation.
RBP
¶
Bases: Metric
Rank-Biased Precision with geometric persistence weighting.
Formula:
RBP(p) = (1 - p) × Σ(i=1 to k) [ p^(i-1) × rel(i) ]
expected_search_depth
property
¶
Expected number of positions a user examines.
Expected depth = 1 / (1 - p)
name
property
¶
Metric display name including cutoff and persistence.
__init__(k, p=0.8)
¶
Initialize RBP metric.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
k
|
int
|
Rank cutoff position. |
required |
p
|
float
|
Persistence parameter (0 < p < 1). Default is 0.8. Higher p = more patient user, examines deeper. Lower p = impatient user, focuses on top ranks. |
0.8
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If p is not in the valid range (0, 1). |
compute_residual(num_retrieved)
¶
Compute residual for incomplete rankings.
The residual represents the upper bound contribution from unseen ranks (k+1, k+2, ...) if all were relevant.
Residual = p^k
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
num_retrieved
|
int
|
Number of documents actually retrieved. |
required |
Returns:
| Type | Description |
|---|---|
float
|
Residual value (upper bound on unseen contribution). |
score_query(retrieved_doc_ids, expected_answers)
¶
Score a single query using RBP.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
retrieved_doc_ids
|
Sequence[str]
|
Ordered list of retrieved document IDs. |
required |
expected_answers
|
Collection[str] | dict[str, int]
|
Either a set/list of expected answer IDs (binary relevance) or a dict mapping doc_id → relevance grade. For graded relevance, grades are normalized to [0, 1]. |
required |
Returns:
| Type | Description |
|---|---|
float
|
RBP score in range [0, 1]. |