Appearance
Overview
SearchTweak is a platform for measuring and improving search relevance with human feedback and AI judges.
Who This Documentation Is For
- Product managers who need reliable relevance KPIs.
- Search engineers who tune ranking and query logic.
- QA/relevance teams who provide graded feedback.
- Team admins who manage roles, tags, and AI judges.
Core Concept
SearchTweak helps you run repeatable relevance experiments for traditional search, recommendation systems, and the "Retrieval" phase of RAG (Retrieval-Augmented Generation) applications:
- Define how to call your search API (
Search Endpoint). - Define request templates (
Search Model). - Create an evaluation with keywords, scale, and metrics (
Search Evaluation). - Collect grades from humans and/or AI judges.
- Compare metric trends, baselines, and productivity.
System Workflow
text
╭─────────────────╮ ╭───────────────╮ ╭───────────────────╮
│ Search Endpoint │──▶│ Search Model │──▶│ Search Evaluation │
╰─────────────────╯ ╰───────────────╯ ╰───────────────────╯
│
▼
╭─────────────────────────────────╮
│ Keywords + Results (Snapshots) │
╰─────────────────────────────────╯
│
▼
╭─────────────────────────────────╮
│ Human/AI Grades (Feedback) │
╰─────────────────────────────────╯
│
▼
╭─────────────────────────────────╮
│ Metrics + Exports │
╰─────────────────────────────────╯Main Sections
- Search Endpoints: connect SearchTweak to your API.
- Mapper Code: extract
id,name, and optional fields from API responses. - Search Models: define query params/body templates.
- Search Evaluations: configure strategy, scale, and metric set.
- Judges (AI): configure LLM judges, prompts, monitoring, and JSONL log export.
- Leaderboard: compare throughput of users and AI judges.
- Evaluation Metrics: formulas, interpretation, and caveats.
- Team Management: roles, membership, and permissions.
- Tags: routing and access segmentation.
- API Reference: automate evaluations via HTTP API.
Quick Start (Non-Technical)
- Create one endpoint using your production/staging search API URL.
- Create one model and test with 2-3 sample queries.
- Create one evaluation with 20-50 representative keywords.
- Start with
Singlestrategy for speed. - Add AI judges after baseline human validation.
- Review metric trends before changing ranking logic.
Terminology
- Snapshot: one query-document pair at a fixed rank position.
- Feedback slot: one grade cell for a snapshot.
- Strategy Single (1): one slot per snapshot.
- Strategy Multiple (3): up to three slots per snapshot.
- Reuse: copy compatible grades from earlier evaluations.
- Baseline: reference evaluation for metric comparison.