Appearance
Search Evaluations
Search Evaluation is the execution unit that turns keywords and search results into measurable relevance metrics.
Evaluation Lifecycle
Status flow:
text
Pending -> Active -> FinishedKey actions:
Start: creates/runs keyword snapshot jobs.Stop/Pause: pauses active execution.Finish: closes evaluation and finalizes state.
Required Configuration
- Model
- Name
- Scale type:
binary,graded,detail - Metric set (one or more)
- Keywords
- Feedback strategy (
1or3)
Scale Types
- Binary:
0(irrelevant),1(relevant) - Graded:
0..3 - Detail:
1..10
Metrics and Transformers
You can mix metrics that require different scales. In this case, transformers are required to map grades from evaluation scale to destination metric scales.
Important:
- transformer source scale must match evaluation scale
- transformer rules must cover all required destination scales
- after evaluation starts, scale/transformers are effectively locked by business rules
Advanced Settings
Feedback Strategy
Single (1): one grade slot per snapshotMultiple (3): up to three slots per snapshot
Practical trade-off:
Single: fastest and cheapest collection cycleMultiple: higher agreement quality for ambiguous queries, but more effort
Show Position
If enabled, evaluators see original rank position while grading.
Reuse Strategy
0: none1: reuse by(query, doc)2: reuse by(query, doc, position)
Notes:
- Reuse can include both human and AI judge grades.
- Tag constraints are respected.
- Reuse should not be combined with auto-restart.
Human and AI Collaboration Model
- Human and AI judges use the same feedback-slot mechanism.
- A slot is owned either by
user_id(human) orjudge_id(AI). - When a slot is overwritten by the opposite side, the previous owner field is cleared.
- Under strategy
3, one AI judge can fill at most one slot per snapshot.
Auto-Restart
Automatically spawns a new evaluation with same config after completion.
Evaluation Outputs
- Snapshot-level feedback records
- Metric values per scorer
- Keyword-level metric breakdowns
- Exportable judgements
Operational Best Practices
- Start with focused keyword sets (20-100) per domain slice.
- Keep one stable baseline evaluation for comparison.
- Use
Singleearly,Multiplefor high-stakes decisions. - Review unfinished snapshots before finalizing conclusions.