# How it works

## LLM-as-a-judge

BotMetrica automatically analyzes AI agent’s conversations with users, detects issues, and classifies them using predefined tags.\
Our LLM judge attaches tags to specific messages and explains **why** each tag was applied.

<figure><img src="https://1491081040-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHjI7fM1WA6p8r70SflSx%2Fuploads%2F2UMIsZs8raOQRPMgpI4y%2Fchat-json.png?alt=media&#x26;token=f64d3788-29cc-4a12-92ed-adff74abb38d" alt=""><figcaption><p>The judge sees the AI agent’s context (e.g. retrieved information from a database) as well as the whole conversation</p></figcaption></figure>

### Labeling Reliability

**Key advantage:** automated analysis identifies **more issues with higher accuracy** than manual review.\
At the same time, the **distribution of error types closely matches human labeling**, ensuring consistency and trustworthiness.

<figure><img src="https://1491081040-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FHjI7fM1WA6p8r70SflSx%2Fuploads%2FqwlEU42riZPHKw9w53Dd%2Fchat-auto-tag.png?alt=media&#x26;token=89768961-3367-4575-a154-a9a7264b0278" alt=""><figcaption><p>You can see the reasoning and specific quoted parts of the message on each automatically found issue</p></figcaption></figure>

## Labeling speed comparison

Example from one of our clients: **467 conversations — one month of traffic**

* **Manual human review:** \~11 hours
* **AI review:**\
  \~10 minutes — more than **65 times faster**
  * Finer-grained tagging
  * More detected events and issues
  * Structure comparable to manual labeling
  * Stable, repeatable results across different samples
