Appearance
Evaluation Metrics
This section explains both the formulas and the exact calculation logic used in SearchTweak.
Why This Matters
The same metric name can be implemented differently across tools (especially around missing grades and multi-feedback aggregation). This page documents SearchTweak's behavior explicitly.
Notation
k: cutoff depth (@k)i: rank position (1..k)q: keyword indexQ: number of keywords with a computable valuer_i: binary relevance at positioni(0or1)g_i: gain at positioni(graded/detail numeric value)
Two-Stage Aggregation in SearchTweak
When strategy is Multiple (3), each snapshot can have up to 3 feedback slots.
Stage 1, per-snapshot aggregation:
- Binary scale:
- majority vote (
1if relevant votes > irrelevant votes,0if opposite) - tie becomes
null(for example1,0or1,0,null)
- majority vote (
- Graded/detail scale:
- arithmetic mean of non-null grades
Stage 2, metric computation:
- Metrics are computed over aggregated snapshot values at ranks
1..k. - For full evaluation, SearchTweak stores mean across keywords with non-null metric values.
Missing Grades and null Handling
- If all positions used by a metric are ungraded, metric value is
null. - For
P@k, denominator is count of graded positions (not alwaysk). - For
CG/DCG/nDCG, ungraded positions contribute0in summation, but if everything is ungraded result isnull. - For binary metrics with ties in strategy
3, ties producenulland affect the metric as above.
Binary Metrics
Precision@k (P@k)
In simple words: What percentage of the top results are actually relevant? (e.g. if 4 out of 10 items are good, P@10 is 40% or 0.4).
Where:
relevant_gradedis number of positions with aggregated binary value1graded_countis number of positions with non-null aggregated value
Use P@k when you need a simple "share of relevant among judged".
Average Precision@k (AP@k)
In simple words: A smarter Precision. It rewards a search system much more if it places the relevant items at the very top of the list rather than at the bottom. The multi-keyword version of this is called MAP (Mean Average Precision).
Where is number of relevant positions () in top-, and is the precision at rank calculated using the rank position as the denominator ().
Note on Missing Grades for AP@k (IMPORTANT): Unlike the standalone metric which ignores ungraded items in its denominator, inside the AP formula always uses the strict rank . Therefore, an ungraded item acts effectively as an irrelevant item (0) during the calculation.
SearchTweak behavior:
- returns
nullif nothing is graded at all - returns
0if graded data exists but no relevant result exists
Reciprocal Rank@k (RR@k)
In simple words: How deep do you have to scroll to find the first good result? If the 1st result is relevant, you get 1. If the 2nd is the first relevant one, you get 1/2. If the 3rd, 1/3, and so on. The multi-keyword version is called MRR.
SearchTweak behavior:
- returns
nullif nothing is graded - returns
0if graded data exists but no relevant result exists
Graded Metrics
Cumulative Gain@k (CG@k)
In simple words: Just adds up all the grades in the top results. It tells you the total amount of "goodness" returned, but it doesn't care whether the best results are at the very top or at the very bottom.
No positional discount.
Discounted Cumulative Gain@k (DCG@k)
In simple words: Similar to CG, but features a "discount" that severely punishes the system for putting good items at the bottom. A perfect match at rank 1 is worth much more than a perfect match at rank 10.
Earlier relevant documents contribute more.
Normalized DCG@k (nDCG@k)
In simple words: Since some queries naturally have many good answers and others have only one, DCG scores can be hard to compare. nDCG solves this by taking the DCG score and dividing it by the "perfect" possible score for that exact query. The result is always a percentage from 0 to 1, making it easy to compare quality across completely different queries.
IDCG@k is DCG for the same gains sorted descending.
SearchTweak behavior:
- returns
nullif nothing is graded - returns
0ifIDCG@k = 0
Detail-Scale Variants
For detail scale (1..10), formulas are identical:
CG(d)@kDCG(d)@knDCG(d)@k
Only gain range differs.
Multi-Keyword Metrics
For evaluations with multiple keywords, SearchTweak uses mean over keyword metrics with non-null values:
MP@kfor precisionMAP@kfor average precisionMRR@kfor reciprocal rank
This avoids contaminating averages with keywords that still have no computable signal.
Transformers and Mixed Scale Metrics
If an evaluation uses one grading scale but selected metrics require another scale, transformer rules are applied before metric calculation.
Example:
- evaluation scale
detail - metric
P@10(binary) - detail grades are first mapped to binary via transformer, then
P@10is computed
Metric Selection Guide
P@k: quick relevance ratioAP@k/MAP@k: ranking quality sensitivity across the listRR@k/MRR@k: first-hit experiencenDCG@k: graded ranking quality with position awareness- detail variants: finer gain control when
0..3is too coarse
Interpretation Checklist
- Keep same keyword set and same cutoff (
@k) for comparisons. - Check grading coverage before trusting small changes.
- Keep grading guidelines stable over time.
- When using strategy
3, monitor tie rate (binary) because ties reduce computable signal.
Recommended Baseline Metric Set
For most teams:
MAP@10MRR@10nDCG@10
Add domain-specific metrics only after baseline process is stable.