✨ New: Scale evaluations with LLM-as-a-Judge

Ship Better Search
with Confidence.

The all-in-one evaluation platform for search and ML teams. Mix human raters and AI Judges to track NDCG, compare ranking models, and build datasets for Learning-to-Rank. Use our cloud platform or self-host the open-source edition.

Step 1: Connect & Configure

Stop hardcoding test scripts.

SearchTweak securely connects to your API infrastructure. Fully proxy requests, parse JSON responses with custom code, and test ranking changes in seconds.

No more fragile config files. Define your endpoints and mapper code right in the UI to rapidly experiment with new search models.

Models

Collaborate like a pro

Bring your ML engineers, Product Managers, and human raters into a single workspace. Organize tasks with tags and roles to keep everyone aligned.

Stop sharing evaluation spreadsheets. Manage teams, distribute tasks, and analyze evaluation results in one centralized platform.

Teams Teams
Teams

Built for any search engine

Bring your own stack. We handle the integration complexity so you can focus entirely on search relevance tuning.

  • Avoid CORS issues with fully proxyable requests
  • Extract nested payload attributes via mapper code

Step 2: Evaluate at Scale

Combine human precision with AI scale

Scale labeling with LLMs. Run large batches and merge AI judgments with human feedback in a single workflow. Evaluate new ranking models instantly to slash time-to-production.

  • Multi-provider support (OpenAI-compatible, Anthropic, Gemini, DeepSeek, Groq, xAI, Ollama)
  • Prompt templates per scale (Binary, Graded, Detail) and advanced model params
  • Full auditability: run status, logs, token usage, and JSONL export

Use AI for throughput and humans for quality control. This hybrid loop lets you validate changes faster with zero risk.

AI Judges list with providers, models and statuses AI Judge logs with latency and token usage

Reasoned AI scoring

AI scoring shouldn't be a black box. Each judge returns a concise reason alongside its label, making evaluations easy to review and discuss.

Audit score changes instantly, resolve disagreements with human raters, and refine your grading prompts over time.

  • Score + reason per query-document pair
  • Faster error analysis and calibration
AI Judges reasoned scoring example for one query-document pair
Teams Teams

Step 3: Measure & Analyze

Track metrics that actually matter

Stop flying blind. Compare your production model against a new experiment before running costly A/B tests. Track NDCG, MAP, and Precision across multiple test endpoints in real time.

  • P@10 (Precision)
  • MP@10 (Mean Precision)
  • AP@10 (Average Precision)
  • MAP@10 (Mean Average Precision)
  • RR@10 (Reciprocal Rank)
  • MRR@10 (Mean Reciprocal Rank)
  • CG@10 (Cumulative Gain)
  • DCG@10 (Discounted Cumulative Gain)
  • NDCG@10 (Normalized Discounted Cumulative Gain)

Track and improve search performance with a wide range of metrics.

Build LTR datasets in days, not months. Export clean judgment lists to train your own Learning-to-Rank models or run deep analytics.

Teams
Teams

Everything your search team needs to scale

Keep your raters engaged

Eliminate spreadsheet fatigue. Manage human raters with intuitive keyboard shortcuts, progress bars, and localized leaderboards.

Widget Widget Leaderboard Widget Widget Widget

Search evaluator leaderboards

Motivate and recognize top evaluators with leaderboards that showcase the best performing team members based on their evaluation contributions.

Intuitive search evaluator interface

Enjoy an easy-to-use, attractive interface designed for efficient and effective search evaluation.

Customizable dashboard widgets

Personalize your dashboard with widgets that display the most relevant data and insights for your needs.

Progress and evaluation graphs

Visualize your search evaluation progress. Monitor improvements and identify areas for further optimization.

Embed it in your data pipeline

REST API makes it easy to make it a part of your data pipeline and integrate it with your existing tools and services.

10x
Faster ranking iterations
Zero
Scripts required
100%
Data ownership (Open-Source)

"SearchTweak cut our ranking experimentation cycle from weeks to days. Having AI judges validate changes before we send them to human raters saved us countless hours and significantly improved our NDCG."

Emily Davis
Emily Davis
Senior Search Analyst at DataSolutions

Designed for search teams like yours

At Search Tweak, we specialize in leveraging technology, innovation, and expertise to enhance search relevancy and drive user satisfaction.

Free

Best option to get started with Search Tweak™

0 /month
  • Fully functional
  • No setup, or hidden fees
  • Endpoints: unlimited
  • Models: unlimited
  • Evaluations: unlimited
  • Team size: 10 members
  • Keywords: unlimited
Get Started

Enterprise

Best for large teams and organizations with advanced needs

Enquire
  • Fully functional
  • No setup, or hidden fees
  • Endpoints: unlimited
  • Models: unlimited
  • Evaluations: unlimited
  • Team size: unlimited
  • Keywords: unlimited
  • Programmatic access (API)
  • Priority support
  • Dedicated account manager
Enquire

Frequently asked questions

Search Tweak is a relevance evaluation platform for search and recommendation systems. It combines human judgments, LLM-as-a-Judge (AI Judges), experiment tracking, and quality metrics in one workflow, so teams can ship ranking improvements faster and with more confidence.

Create your account today

Try SearchTweak™ for free. No credit card required.

Get Started