Skip to content

Overview

SearchTweak is a platform for measuring and improving search relevance with human feedback and AI judges.

Who This Documentation Is For

  • Product managers who need reliable relevance KPIs.
  • Search engineers who tune ranking and query logic.
  • QA/relevance teams who provide graded feedback.
  • Team admins who manage roles, tags, and AI judges.

Core Concept

SearchTweak helps you run repeatable relevance experiments for traditional search, recommendation systems, and the "Retrieval" phase of RAG (Retrieval-Augmented Generation) applications:

  1. Define how to call your search API (Search Endpoint).
  2. Define request templates (Search Model).
  3. Create an evaluation with keywords, scale, and metrics (Search Evaluation).
  4. Collect grades from humans and/or AI judges.
  5. Compare metric trends, baselines, and productivity.

System Workflow

text
╭─────────────────╮   ╭───────────────╮   ╭───────────────────╮
│ Search Endpoint │──▶│  Search Model │──▶│ Search Evaluation │
╰─────────────────╯   ╰───────────────╯   ╰───────────────────╯


             ╭─────────────────────────────────╮
             │  Keywords + Results (Snapshots) │
             ╰─────────────────────────────────╯


             ╭─────────────────────────────────╮
             │   Human/AI Grades (Feedback)    │
             ╰─────────────────────────────────╯


             ╭─────────────────────────────────╮
             │        Metrics + Exports        │
             ╰─────────────────────────────────╯

Main Sections

Quick Start (Non-Technical)

  1. Create one endpoint using your production/staging search API URL.
  2. Create one model and test with 2-3 sample queries.
  3. Create one evaluation with 20-50 representative keywords.
  4. Start with Single strategy for speed.
  5. Add AI judges after baseline human validation.
  6. Review metric trends before changing ranking logic.

Terminology

  • Snapshot: one query-document pair at a fixed rank position.
  • Feedback slot: one grade cell for a snapshot.
  • Strategy Single (1): one slot per snapshot.
  • Strategy Multiple (3): up to three slots per snapshot.
  • Reuse: copy compatible grades from earlier evaluations.
  • Baseline: reference evaluation for metric comparison.
  1. Search Endpoints
  2. Mapper Code
  3. Search Models
  4. Search Evaluations
  5. Evaluation Metrics
  6. Judges (AI)