42 lines
1.5 KiB
Markdown
42 lines
1.5 KiB
Markdown
|
|
# Worker Token/Latency Optimization TODO
|
||
|
|
|
||
|
|
Date: 2026-03-17
|
||
|
|
Owner: backend runtime
|
||
|
|
Status: pending
|
||
|
|
|
||
|
|
## Background
|
||
|
|
|
||
|
|
- Router cost/latency is acceptable.
|
||
|
|
- Worker stage (deepseek-chat) has significantly higher input tokens and latency.
|
||
|
|
- Current optimization work is deferred due to prioritization.
|
||
|
|
|
||
|
|
## Observations (from `public.messages`)
|
||
|
|
|
||
|
|
- Worker avg input tokens are much higher than router (about 12k+ vs 3k).
|
||
|
|
- Worker avg latency is much higher than router (about 41s vs 4s).
|
||
|
|
- Worker cost dominates total cost.
|
||
|
|
|
||
|
|
## Root Cause Hypothesis
|
||
|
|
|
||
|
|
- Worker ReAct path repeatedly includes full tool schemas per model call.
|
||
|
|
- `calendar_write` tool schema is large and contributes major prompt overhead.
|
||
|
|
- Finalize JSON step performs an additional model call after ReAct.
|
||
|
|
|
||
|
|
## Deferred Optimization Items
|
||
|
|
|
||
|
|
1. Tool schema slimming for calendar write path.
|
||
|
|
- Split `calendar_write` into focused tools (`calendar_create`, `calendar_update`, `calendar_delete`).
|
||
|
|
- Reduce redundant/verbose field descriptions where possible.
|
||
|
|
2. Dynamic tool set exposure by routed intent.
|
||
|
|
- Only expose tools needed for current task.
|
||
|
|
3. Evaluate finalize overhead.
|
||
|
|
- Verify whether finalize call can be reduced or replaced in specific flows.
|
||
|
|
4. Add before/after benchmark script.
|
||
|
|
- Compare worker `input_tokens`, `latency_ms`, and `cost` for the same scripted multi-turn scenario.
|
||
|
|
|
||
|
|
## Acceptance Metrics (target)
|
||
|
|
|
||
|
|
- Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.
|
||
|
|
- Reduce worker p95 latency by >= 25%.
|
||
|
|
- Keep functional behavior unchanged for agent runs.
|