New Scale evaluations with LLM-as-a-Judge

Ship Better Search
with Confidence.

The all-in-one evaluation platform for search and ML teams. Mix human raters and AI Judges to track NDCG, compare ranking models, and build datasets for Learning-to-Rank. Use our cloud platform or self-host the open-source edition.

Get Started GitHub

Works seamlessly with your stack

Elasticsearch OpenSearch Solr Algolia OpenAI Anthropic Ollama

Step 1: Connect & Configure

Stop hardcoding test scripts.

SearchTweak securely connects to your API infrastructure. Fully proxy requests, parse JSON responses with custom code, and test ranking changes in seconds.

No more fragile config files. Define your endpoints and mapper code right in the UI to rapidly experiment with new search models.

Collaborate like a pro

Bring your ML engineers, Product Managers, and human raters into a single workspace. Organize tasks with tags and roles to keep everyone aligned.

Stop sharing evaluation spreadsheets. Manage teams, distribute tasks, and analyze evaluation results in one centralized platform.

Built for any search engine

Bring your own stack. We handle the integration complexity so you can focus entirely on search relevance tuning.

Avoid CORS issues with fully proxyable requests
Extract nested payload attributes via mapper code

Step 2: Evaluate at Scale

Combine human precision with AI scale

Scale labeling with LLMs. Run large batches and merge AI judgments with human feedback in a single workflow. Evaluate new ranking models instantly to slash time-to-production.

Multi-provider support (OpenAI-compatible, Anthropic, Gemini, DeepSeek, Groq, xAI, Ollama)
Prompt templates per scale (Binary, Graded, Detail) and advanced model params
Full auditability: run status, logs, token usage, and JSONL export

Use AI for throughput and humans for quality control. This hybrid loop lets you validate changes faster with zero risk.

Try AI Judges

AI Judges list with providers, models and statuses

AI Judge logs with latency and token usage

AI Judges reasoned scoring example for one query-document pair

Reasoned AI scoring

AI scoring shouldn't be a black box. Each judge returns a concise reason alongside its label, making evaluations easy to review and discuss.

Audit score changes instantly, resolve disagreements with human raters, and refine your grading prompts over time.

Score + reason per query-document pair
Faster error analysis and calibration

Step 3: Measure & Analyze

Track metrics that actually matter

Stop flying blind. Compare your production model against a new experiment before running costly A/B tests. Track NDCG, MAP, and Precision across multiple test endpoints in real time.

P@10 (Precision)
MP@10 (Mean Precision)
AP@10 (Average Precision)
MAP@10 (Mean Average Precision)
RR@10 (Reciprocal Rank)
MRR@10 (Mean Reciprocal Rank)
CG@10 (Cumulative Gain)
DCG@10 (Discounted Cumulative Gain)
NDCG@10 (Normalized Discounted Cumulative Gain)

Track and improve search performance with a wide range of metrics.

Build LTR datasets in days, not months. Export clean judgment lists to train your own Learning-to-Rank models or run deep analytics.

Everything your search team needs to scale

Keep your raters engaged

Eliminate spreadsheet fatigue. Manage human raters with intuitive keyboard shortcuts, progress bars, and localized leaderboards.

Search evaluator leaderboards

Motivate and recognize top evaluators with leaderboards that showcase the best performing team members based on their evaluation contributions.

Intuitive search evaluator interface

Enjoy an easy-to-use, attractive interface designed for efficient and effective search evaluation.

Customizable dashboard widgets

Personalize your dashboard with widgets that display the most relevant data and insights for your needs.

Progress and evaluation graphs

Visualize your search evaluation progress. Monitor improvements and identify areas for further optimization.

Embed it in your data pipeline

REST API makes it easy to make it a part of your data pipeline and integrate it with your existing tools and services.

10x: Faster ranking iterations
Zero: Scripts required
100%: Data ownership (Open-Source)

Emily Davis

Senior Search Analyst at DataSolutions

Designed for search teams like yours

Choose the plan that fits your search evaluation needs. Simple, transparent pricing.

Open Source

Self-hosted for teams needing total data control

$0 /forever

Unlimited evaluations & models
Unlimited team members
REST API included
Requires self-hosting

GitHub Repo

Cloud Starter

Best for solo developers and quick evaluation tests

$0 /month

Team size: 1 member
Endpoints: up to 2
Models: up to 5
Keywords: up to 50 / eval
AI Judges: up to 1 (BYOK)
Data Retention: 30 days

Get Started for Free

Cloud Team

Best for growing teams and mid-sized organizations

$49 /month

Team size: up to 10 members
Endpoints: up to 10
Models: up to 50
Keywords: up to 1,000 / eval
AI Judges: up to 5 (BYOK)
REST API included (CI/CD)
Data Retention: 1 year

Upgrade to Team

Cloud Enterprise

Best for large teams and organizations with advanced needs

$199 /month

Unlimited team members
Unlimited endpoints & models
Unlimited keywords
Unlimited AI Judges (BYOK)
REST API included (CI/CD)
Data Retention: Unlimited
Priority Support & Setup Assistance

Upgrade to Enterprise

Frequently asked questions

Search Tweak is a relevance evaluation platform for search and recommendation systems. It combines human judgments, LLM-as-a-Judge (AI Judges), experiment tracking, and quality metrics in one workflow, so teams can ship ranking improvements faster and with more confidence.

Create your account today

Try SearchTweak™ for free. No credit card required.

Get Started

Ship Better Searchwith Confidence.

Choose a role

Stop hardcoding test scripts.

Collaborate like a pro

Built for any search engine

Combine human precision with AI scale

Reasoned AI scoring

Track metrics that actually matter

Keep your raters engaged

Search evaluator leaderboards

Intuitive search evaluator interface

Customizable dashboard widgets

Progress and evaluation graphs

Embed it in your data pipeline

Designed for search teams like yours

Open Source

Cloud Starter

Cloud Team

Cloud Enterprise

Frequently asked questions

What is Search Tweak?

Can I use SearchTweak for RAG (Retrieval-Augmented Generation) evaluation?

How are our LLM API keys stored? Are they secure?

Can I connect custom search endpoints or use the API?

Can I export judgments and AI Judge outputs?

How do AI Judges work and what are the best practices?

Which LLM providers are supported for AI Judges?

Can we use local, self-hosted LLMs?

Who pays for the AI Judge LLM tokens?

How does the Cloud Starter differ from the Team/Enterprise plans?

Why did you choose the Functional Source License 1.1?

Create your account today

Ship Better Search
with Confidence.