19981964fb
- 移除 backend/scripts/build_litellm_proxy_config.py - 简化 LiteLLMService,移除 run_completion_with_cost 方法 - AgentScopeRunner 改为从 LlmFactory 获取 api_base 和 api_key - 部署配置移除 litellm/litellm-config-job 服务 - Flutter 新增 AuthBootScreen 引导页 - Android 添加通知权限 (POST_NOTIFICATIONS, RECEIVE_BOOT_COMPLETED, SCHEDULE_EXACT_ALARM) - 优化 LocalNotificationService 调度失败 fallback - 更新 manifest.json (version 3)
1.5 KiB
1.5 KiB
Worker Token/Latency Optimization TODO
Date: 2026-03-17 Owner: backend runtime Status: pending
Background
- Router cost/latency is acceptable.
- Worker stage (deepseek-chat) has significantly higher input tokens and latency.
- Current optimization work is deferred due to prioritization.
Observations (from public.messages)
- Worker avg input tokens are much higher than router (about 12k+ vs 3k).
- Worker avg latency is much higher than router (about 41s vs 4s).
- Worker cost dominates total cost.
Root Cause Hypothesis
- Worker ReAct path repeatedly includes full tool schemas per model call.
calendar_writetool schema is large and contributes major prompt overhead.- Finalize JSON step performs an additional model call after ReAct.
Deferred Optimization Items
- Tool schema slimming for calendar write path.
- Split
calendar_writeinto focused tools (calendar_create,calendar_update,calendar_delete). - Reduce redundant/verbose field descriptions where possible.
- Split
- Dynamic tool set exposure by routed intent.
- Only expose tools needed for current task.
- Evaluate finalize overhead.
- Verify whether finalize call can be reduced or replaced in specific flows.
- Add before/after benchmark script.
- Compare worker
input_tokens,latency_ms, andcostfor the same scripted multi-turn scenario.
- Compare worker
Acceptance Metrics (target)
- Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.
- Reduce worker p95 latency by >= 25%.
- Keep functional behavior unchanged for agent runs.