How it works

LLM-as-a-judge

BotMetrica automatically analyzes AI agent’s conversations with users, detects issues, and classifies them using predefined tags. Our LLM judge attaches tags to specific messages and explains why each tag was applied.

The judge sees the AI agent’s context (e.g. retrieved information from a database) as well as the whole conversation

Labeling Reliability

Key advantage: automated analysis identifies more issues with higher accuracy than manual review. At the same time, the distribution of error types closely matches human labeling, ensuring consistency and trustworthiness.

You can see the reasoning and specific quoted parts of the message on each automatically found issue

Labeling speed comparison

Example from one of our clients: 467 conversations — one month of traffic

  • Manual human review: ~11 hours

  • AI review: ~10 minutes — more than 65 times faster

    • Finer-grained tagging

    • More detected events and issues

    • Structure comparable to manual labeling

    • Stable, repeatable results across different samples

Last updated