"Yes, we do prorate move-outs daily — so on the 17th you'd get refunded for the remaining 14 days of the month back to your card on file. I can start that process for you right now if you'd like."

Why this was flagged: No KB article was retrieved before this answer. Policy POL-MOVE-OUT-001 for SuperStorage Grover Beach states "No prorated refunds on move-out". The AI fabricated the policy.

Auto-judge: claim "we prorate move-outs daily" — unsupported · contradicted by KB.

1:02
caller

Caller

"OK wait, are you sure? Because last time I moved out somewhere they said no refunds, so I just want to make sure."

1:10
ai

Doubling down · severity MEDIUM

AI

"Yes, I can confirm. Here at SuperStorage Grover Beach we do issue daily prorated refunds for move-outs."

1:24
caller

Caller

"Hm. Can I just talk to a real person to confirm? I don't want to get charged extra."

1:32
ai

AI escalation_trigger: explicit_human_request

"Absolutely, I'll connect you with Andre, the manager here — one moment."

1:42
tool

Tool call · transfer

transfer_to_human ok · picked up in 7s

targetandre@hansacapital.co (manager · grover-beach)

reasonexplicit_human_request

handoff"Tenant B-204 Smith — prorate question on 17th move-out. Wants human confirmation."

1:49
system

Picked up by Andre · 0:29 remaining handled by staff

Staff portion transcript: view 29-second staff segment →

Conversation signals computed at ingest

Caller sentimentNeutral → Concerned Δ −0.6

Interruptions2 · both AI cutting caller off

Dead air1 stretch · 4.2s after 1:10

Repeats / clarifications1 (caller asked "are you sure")

Avg AI latency1.2s · P95 1.4s

Talk ratioAI 62% · Caller 38%

Tokens (in / out)2,148 / 612

Cost$0.043 · telephony $0.018

47

Wrong · 3 flagged turns Composite is weighted: Factual ×2, Resolution ×1.5, Policy ×1.5, Action ×1, Quality ×0.5, Containment ×1

scoring policy v2.4 · auto + 1 manager override

Factual Did the AI state true things about price, availability, gate hours, policies? Judge: LLM-as-judge against KB + tenant record. Flags any unsupported claim.

20

1 unsupported claim (prorate policy) · 1 doubling-down · no KB cited at 0:48

Action Did it call the right tools in the right order? Compared against the canonical playbook for the classified intent.

65

get_tenant called correctly · skipped lookup_facility_policies (expected for prorate questions)

Resolution Did the caller's problem actually get solved? Outcome label: booked / paid / escalated correctly / abandoned.

30

Required human escalation to confirm correct answer · tenant left uncertain

Policy Disclosures, no-promise rules, PII handling, transfer-to-human triggers. Rule-based + LLM judge.

25

Promised a refund without authority · violates "no-financial-promise" rule

Quality Latency, interruptions, dead air, repeats, customer sentiment shift. Audio + transcript analytics.

85

Latency & flow were fine · sentiment dropped after 0:48

Containment Did AI finish the job without human handoff *when it should have*?

60

Escalation was caller-initiated · would have been correct had policy been right

Override · your judgement

FACTUAL

POLICY

RESOLUTION

Data model: call.id · turns[] (speaker, ts, text, intent, latency, tokens, tool_calls) · retrieved[] per turn · scores (auto + override) · flags[] · outcome_ref (FK → lease/payment/ticket) · training_eligibility (boolean, gated by manager verdict)

Call Review · Escalation review

Escalation review

Every transfer (was it warranted?) and every missed transfer (caller asked / threatened / repeated · AI kept going). This is where you tune the model's escalation policy.

SLA: review within 24h · 4 unreviewed past SLA

Correct-containment rate · last 7 days

78.4%

Share of calls AI completed end-to-end where it should have. This is containment minus calls AI should have escalated but didn't — the metric that catches the dangerous false-negatives plain containment hides.

Raw containment82.1%

Missed escalations−3.7%

Correct-containment78.4%

Esc. precision91.2%

Esc. recall83.5%

All 38 Transfers · awaiting review 14 Missed escalations 5 Reviewed · approved 17 Reviewed · disagreed 2

What triggered transfers · last 30 days 214 transfers total

Trigger	Type	Volume	Approved	Disagreed	Precision
Explicit "speak to human"	phrase match	81	79	2	97.5%
Threat keywords (lawyer, sue, BBB)	phrase match	12	12	0	100%
Sentiment threshold (≤ −0.5)	classifier	34	28	6	82.4%
Intent: complaint	classifier	22	19	3	86.4%
3+ unresolved clarification turns	heuristic	28	22	6	78.6%
Intent: auction / lien / legal	policy	19	19	0	100%
Intent: payment dispute	policy	11	10	1	90.9%
Caller silent > 20s	heuristic	7	4	3	57.1%

Missed escalations · last 7 days View all 5 →

"This is the third time I'm calling, my stuff is missing"

Linda Park · B-77 · Grover Beach · Caller used "lawyer" twice (1:42, 2:18) — threat-keyword rule should have fired · sentiment dropped to −0.7 at 1:15 · hung up after 5:09.

SHOULD HAVE: threat_keywords SHOULD HAVE: sentiment_threshold DID: kept asking unit number

Severity HIGH 5:09 · May 22, 3:47 pm

"Just let me talk to whoever runs this place"

+1 805 119 4023 · Lakeside · Explicit human request not detected — phrasing "whoever runs this place" missed by the phrase matcher · AI offered to email instead.

SHOULD HAVE: explicit_human_request phrase_matcher_gap

Severity HIGH 3:42 · May 22, 1:02 pm

Auction question — AI tried to answer

Carmen R · L-12 · Lakeside · Policy says auction/lien questions go to lien team. AI answered "your stuff has not been sold yet" — accurate but should not be self-served.

SHOULD HAVE: intent_lien_policy caught by policy compliance, not escalation logic

Severity MED 2:33 · May 22, 11:18 am

7 turns of "I don't understand"

+1 415 992 1180 · Morro Bay · Loop heuristic should have fired after 3 unresolved clarifications · fired threshold is 5 · this hit 7 before caller abandoned.

TUNE: clarification_loop threshold (5→3)

Severity MED 3:18 · May 21, 8:14 pm

Transfers awaiting review 14 transfers · was each one warranted?

Call	Trigger	To	Caller waited	Resolved by staff?
CALL-2026-05-22-0614 prorate question · Smith B-204	explicit_human_request	Andre · 7s	7s	Yes · "no prorate"
CALL-2026-05-22-1453 auction · Carmen L-12	intent_lien_policy	Lien team · 12s	14s	Yes
CALL-2026-05-22-1601 gate code reset	3+_clarifications	Andre · 22s	38s	Yes
CALL-2026-05-22-1822 "I need to speak to someone"	explicit_human_request	Voicemail (after hours)	2:11	No · VM only

Call Review · Escalations · Missed · CALL-2026-05-22-1547

Missed escalation · "my stuff is missing"

Linda Park · B-77 · Grover Beach. Three independent rules should have triggered a transfer; none did. Caller hung up after 5:09.

Back to queue

Trigger timeline — what fired, what should have

5:09 · hung up

1:42 · "lawyer"

1:15 · sentiment −0.7

2:18 · "lawyer" ×2

3:30 · 5th repeat

0:00 start

Rule	Threshold	Observed	Should have fired?	Fired?	Action
Threat keywords	contains(lawyer\|sue\|attorney\|BBB\|fraud)	2× "lawyer" @ 1:42, 2:18	Yes · HIGH	✗ No	Inspect why →
Sentiment threshold	avg(last 3 turns) ≤ −0.5	−0.7 @ 1:15	Yes	✗ No	Inspect →
Repeat / clarification loop	≥ 3 unresolved clarifications	5 by 3:30	Yes	✗ No	Inspect →
Intent: complaint	classifier ≥ 0.7	0.62 (sub-threshold)	Borderline	— No	Inspect →
Explicit human request	phrase_match(human\|person\|manager…)	none	No	— No	—

Root cause analysis

Why threat_keywords didn't fire: Rule is gated on turn.role == 'caller' AND turn.text contains a keyword. Caller said "my lawyer told me to call" — keyword matched. The rule fired upstream but was suppressed by the already_resolved flag set incorrectly when get_tenant succeeded earlier in the call.

Why sentiment didn't fire: Sentiment classifier window is 3 turns, but caller's frustrated turns were interleaved with 2 neutral AI clarifications, pulling the rolling avg back above threshold.

Pattern: third missed escalation this month with the same suppression bug.

Recommended fixes

CRITICAL · bugfix

Remove already_resolved suppression from threat_keywords rule. Threats should always escalate.

TUNE · sentiment

Switch sentiment from rolling-avg to min-over-window; a single −0.7 turn should be enough.

PLAYBOOK · intent gap

Add "lost / damaged property" sub-intent · auto-transfer to facility manager regardless of sentiment.

CALLBACK

Call Linda Park back · log as service recovery · suppresses against churn metric.

Call Review · KB gaps

KB gaps

Questions the AI couldn't answer well — either no article was retrieved, or the answer contradicted reality. Each row groups all calls with the same underlying gap so you author once and fix n.

11 open gaps · 4 closed this week

KB coverage gap rate Share of calls where the AI had no good KB answer for the asked question. Drives KB authoring backlog.

7.2%

−2.1% vs. last week

61 of 847 calls had at least one unanswered question.

Hallucination rate % of calls with at least one factual claim that wasn't supported by retrieved KB.

1.9%

+0.4% vs. last week

⚠ above 1.5% target

Gaps closed → recheck

4

After publishing, 2 saw immediate accuracy lift; 1 had no effect; 1 still pending recheck.

Open KB drafts

3

Author: Andre · Reviewer: Sarah

Sort

"Do you prorate move-outs?" across 4 facilities

7 calls · 5 hallucinations · 3 escalations · No policy article exists for any facility. AI improvised 4 different answers across the 7 calls.

FACT · POLICY grover-beacholivelakesidemorro-bay avg score 42

HIGH · 7× first seen Apr 28

"What time does the gate close on Sundays?"

5 calls · 3 wrong answers · Gate-hours article exists but is generic — doesn't reflect facility-specific Sunday early-close at Morro Bay (9pm) and Olive (8pm).

FACT · per-facility override needed existing: KB-GATE-001

HIGH · 5× first seen May 11

"Do you have RV / boat storage with electric?"

4 calls · 4 lost leads · AI said "no" at Grover Beach (correct) and Lakeside (wrong — 6 RV spots with electric since March). Inventory article doesn't include electric flag.

FACT · inventory gap lakeside $640/mo × 4 lost

MED · 4× first seen May 7

"My code isn't working at the gate"

6 calls · 5 escalations · Diagnosis article exists but doesn't cover the "overlock-cleared but gate-code-still-disabled" edge case the new LockEngine v2 introduced.

edge case · lockengine v2 existing: KB-GATE-005

MED · 6× first seen May 18

"Can I store a piano / motorcycle / classic car?" (oversized / specialty)

3 calls · AI deflected to "call back during business hours" — no specialty-storage policy article.

policy missed conversion

LOW · 3× first seen May 4

7 more gaps below threshold (1–2 occurrences) · Show all 11

Call Review · KB gaps

KB gaps

KB gap list (behind modal)

Call detail

Move-in inquiry · AI handled

Call detail (behind modal)

Console · AI performance

AI performance

Every metric that matters for the AI receptionist, in one view. Roll up by facility, drill into any number to see the underlying calls.

Correct-containment rate · 30d

78.4%

The number to watch. Plain containment hides false-negatives (AI shouldn't have finished the call but did). This subtracts those.

Raw containment82.1%

Missed escalations−3.7%

Correct-containment78.4%

Last quarter71.2%

Target Q382.0%

Hallucination rate % of calls with at least one factual claim not supported by retrieved KB. The metric that gets you sued.

1.9%

+0.4%

⚠ above 1.5% target

KB coverage gap rate % of calls where AI had no good KB answer for the asked question. Drives KB authoring backlog.

7.2%

−2.1%

11 open gaps · 4 closed this week

First-call resolution Caller didn't have to call back within 7d for the same issue. Self-derived from phone number + intent.

73.6%

+5.1%

Same caller + same intent within 7d.

Latency · P50 / P95 Time-to-first-token after caller stops speaking. Callers hang up around 1.5s P95.

0.82s / 1.6s

±0%

P95 above 1.5s target — TTS provider switch pending.

Escalation precision

91.2%

+1.8%

When AI transferred, manager agreed.

Escalation recall

83.5%

−2.4%

When AI should have transferred, it did.

Sentiment delta Caller sentiment at end of call minus start of call. Positive = AI made them happier.

+0.18

+0.05

Net positive across 612 AI calls.

Cost / resolved call Tokens + telephony / # resolved (not all calls). Lower is better but watch quality.

$0.41

−$0.07

vs. $4.20 fully-loaded staff cost.

Performance by intent · 30d click any cell to see calls

Intent	Volume	Avg score	Hallucination	Containment	Conversion	FCR	Sentiment Δ
Move-in inquiry	187	86	0.5%	94%	41% booked	88%	+0.31
Payment	142	81	0.7%	89%	82% paid	79%	+0.12
Hours / access	98	71	3.1%	97%	—	82%	+0.04
Move-out / prorate	38	48	15.8%	52%	—	61%	−0.22
Lockout / overlock	52	76	1.9%	71%	—	76%	+0.08
Lien / auction	27	84	0%	11% (by design)	—	91%	+0.03
Complaint	31	62	0.0%	19% (should be 0%)	—	54%	−0.41
Tour booking	37	89	0.0%	97%	73% scheduled	92%	+0.28

Trends · 90 days composite score + hallucination

Cohort: model versions composite score by deploy

Version	Deployed	Volume	Avg score	Hallucination	Δ vs. prev
v1.0 baseline	Feb 14	1,847	61	5.2%	—
v1.1 + prorate KB	Apr 03	1,612	68	3.8%	+7
v1.2 + escalation tune	Apr 28	1,288	74	2.4%	+6
v1.3 DPO round 1	May 12	782	79	2.0%	+5
v1.4 current	May 19	412	83	1.9%	+4
v1.5 candidate	staged	94 (shadow)	87	1.4%	+4 (shadow)

All numbers above are derived from: calls ⋈ scores ⋈ outcome_links ⋈ retrievals ⋈ model_version refresh: hourly · materialized view mv_ai_perf_daily

Console · AI · Training pipeline

Training pipeline

Where Call Review's harvest goes. Three loops, ordered by leverage: KB authoring (instant, no training) · Prompt / playbook (1-day cycle, A/B-able) · DPO fine-tune (weekly, gated).

Loop 1 · KB authoringHIGHEST LEVERAGE

Each KB gap closed = the AI knows a fact next call. Most accuracy wins live here. No model training.

Open drafts3

Closed this week4

Avg time to publish1.4 days

Lift after publish+11 score pts on retest

Open KB gap queue

Loop 2 · Prompt / playbookDAILY · A/B

Patterns of failure → updates to system prompt or intent-specific playbooks. Version each change; A/B against shadow traffic.

Active playbooks9 (per intent)

In A/B test2 · prorate, complaint

Rule changes pending3

Last deployMay 19 · v1.4

Loop 3 · DPO fine-tuneWEEKLY · GATED

Manager-corrected pairs become preference data. Don't fine-tune on raw transcripts — bakes in mistakes. Gated by 2-person sign-off + held-out eval.

Approved pairs3,412

Pending review47

Next runSun May 25 · 9 pm

Last eval+4 score · no regress

DPO pair funnel · last 30 days

Calls

847

Reviewed

412

Corrected

189

Author-approved

156

Reviewer-approved

142

Eval-passed → shipped

138

Acceptance rate: 16.3% · Author→Reviewer agreement: 91% · Eval-pass rate: 97% · Avg pair latency: 2.3 days

Pending training pairs · awaiting your sign-off

Pair ID	Source call	Intent	Rejected (AI said)	Preferred (mgr rewrote)	Severity	Author
tp_8h2k9	CALL-…0614	prorate	"Yes, we do prorate move-outs daily…"	"Let me check our move-out policy before…"	HIGH ×3	Andre
tp_4j1m7	CALL-…1547	complaint	"OK and your unit number is again?"	"I hear you, I'm transferring you to Andre right now — one moment."	HIGH ×3	Andre
tp_2k9p4	CALL-…1822	hours	"Gate is open 6am to 10pm every day."	"At Morro Bay the gate closes at 9pm on Sundays — every other day it's 6am to 10pm."	MED ×1	Sarah
tp_9x7s2	CALL-…2014	lockout	"Sure I can clear your overlock right now."	"Overlocks can only be cleared after payment is confirmed — your balance is $148 past due. Want to pay now?"	HIGH ×3	Andre

Held-out eval set 42 frozen calls · never touched by training · regression gate

v1.4 (current) vs. v1.5 (candidate)

Composite83 → 87 +4

Factual86 → 92 +6

Policy79 → 88 +9

Hallucination1.9% → 1.4%

Regressions0 (gate threshold: ≤ 2)

Eval set composition · cohort coverage

Move-in8

Payment7

Prorate / move-out6

Hours / access5

Complaint5

Lockout4

Lien4

Edge cases3

Call log

Prorate question · AI handled

"If I move out on the 17th do I get prorated?"

Correct this turn

Escalation review

Missed escalation · "my stuff is missing"

KB gaps

KB gaps

Move-out prorate policy

Move-in inquiry · AI handled

Link this call to its outcome

AI performance

Training pipeline