docs/todo/2026-03-17-worker-token-latency-optimization.md

# Worker Token/Latency Optimization TODO

Date: 2026-03-17
Owner: backend runtime
Status: pending

## Background

- Router cost/latency is acceptable.
- Worker stage (deepseek-chat) has significantly higher input tokens and latency.
- Current optimization work is deferred due to prioritization.

## Observations (from `public.messages`)

- Worker avg input tokens are much higher than router (about 12k+ vs 3k).
- Worker avg latency is much higher than router (about 41s vs 4s).
- Worker cost dominates total cost.

## Root Cause Hypothesis

- Worker ReAct path repeatedly includes full tool schemas per model call.
- `calendar_write` tool schema is large and contributes major prompt overhead.
- Finalize JSON step performs an additional model call after ReAct.

## Deferred Optimization Items

1. Tool schema slimming for calendar write path.
   - Split `calendar_write` into focused tools (`calendar_create`, `calendar_update`, `calendar_delete`).
   - Reduce redundant/verbose field descriptions where possible.
2. Dynamic tool set exposure by routed intent.
   - Only expose tools needed for current task.
3. Evaluate finalize overhead.
   - Verify whether finalize call can be reduced or replaced in specific flows.
4. Add before/after benchmark script.
   - Compare worker `input_tokens`, `latency_ms`, and `cost` for the same scripted multi-turn scenario.

## Acceptance Metrics (target)

- Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.
- Reduce worker p95 latency by >= 25%.
- Keep functional behavior unchanged for agent runs.
refactor: 移除 LiteLLM proxy 架构，后端直连 Provider API 2026-03-17 18:05:49 +08:00			`# Worker Token/Latency Optimization TODO`

			`Date: 2026-03-17`
			`Owner: backend runtime`
			`Status: pending`

			`## Background`

			`- Router cost/latency is acceptable.`
			`- Worker stage (deepseek-chat) has significantly higher input tokens and latency.`
			`- Current optimization work is deferred due to prioritization.`

			## Observations (from `public.messages`)

			`- Worker avg input tokens are much higher than router (about 12k+ vs 3k).`
			`- Worker avg latency is much higher than router (about 41s vs 4s).`
			`- Worker cost dominates total cost.`

			`## Root Cause Hypothesis`

			`- Worker ReAct path repeatedly includes full tool schemas per model call.`
			- `calendar_write` tool schema is large and contributes major prompt overhead.
			`- Finalize JSON step performs an additional model call after ReAct.`

			`## Deferred Optimization Items`

			`1. Tool schema slimming for calendar write path.`
			- Split `calendar_write` into focused tools (`calendar_create`, `calendar_update`, `calendar_delete`).
			`- Reduce redundant/verbose field descriptions where possible.`
			`2. Dynamic tool set exposure by routed intent.`
			`- Only expose tools needed for current task.`
			`3. Evaluate finalize overhead.`
			`- Verify whether finalize call can be reduced or replaced in specific flows.`
			`4. Add before/after benchmark script.`
			- Compare worker `input_tokens`, `latency_ms`, and `cost` for the same scripted multi-turn scenario.

			`## Acceptance Metrics (target)`

			`- Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.`
			`- Reduce worker p95 latency by >= 25%.`
			`- Keep functional behavior unchanged for agent runs.`