BotMetrica automatically analyzes AI agent’s conversations with users, detects issues, and classifies them using predefined tags.
Our LLM judge attaches tags to specific messages and explains why each tag was applied.
The judge sees the AI agent’s context (e.g. retrieved information from a database) as well as the whole conversation
Labeling Reliability
Key advantage: automated analysis identifies more issues with higher accuracy than manual review.
At the same time, the distribution of error types closely matches human labeling, ensuring consistency and trustworthiness.
You can see the reasoning and specific quoted parts of the message on each automatically found issue
Labeling speed comparison
Example from one of our clients: 467 conversations — one month of traffic
Manual human review: ~11 hours
AI review:
~10 minutes — more than 65 times faster
Finer-grained tagging
More detected events and issues
Structure comparable to manual labeling
Stable, repeatable results across different samples