Files
social-app/docs/todo/2026-03-17-worker-token-latency-optimization.md
T
qzl 19981964fb refactor: 移除 LiteLLM proxy 架构,后端直连 Provider API
- 移除 backend/scripts/build_litellm_proxy_config.py
- 简化 LiteLLMService,移除 run_completion_with_cost 方法
- AgentScopeRunner 改为从 LlmFactory 获取 api_base 和 api_key
- 部署配置移除 litellm/litellm-config-job 服务
- Flutter 新增 AuthBootScreen 引导页
- Android 添加通知权限 (POST_NOTIFICATIONS, RECEIVE_BOOT_COMPLETED, SCHEDULE_EXACT_ALARM)
- 优化 LocalNotificationService 调度失败 fallback
- 更新 manifest.json (version 3)
2026-03-17 18:05:49 +08:00

1.5 KiB

Worker Token/Latency Optimization TODO

Date: 2026-03-17 Owner: backend runtime Status: pending

Background

  • Router cost/latency is acceptable.
  • Worker stage (deepseek-chat) has significantly higher input tokens and latency.
  • Current optimization work is deferred due to prioritization.

Observations (from public.messages)

  • Worker avg input tokens are much higher than router (about 12k+ vs 3k).
  • Worker avg latency is much higher than router (about 41s vs 4s).
  • Worker cost dominates total cost.

Root Cause Hypothesis

  • Worker ReAct path repeatedly includes full tool schemas per model call.
  • calendar_write tool schema is large and contributes major prompt overhead.
  • Finalize JSON step performs an additional model call after ReAct.

Deferred Optimization Items

  1. Tool schema slimming for calendar write path.
    • Split calendar_write into focused tools (calendar_create, calendar_update, calendar_delete).
    • Reduce redundant/verbose field descriptions where possible.
  2. Dynamic tool set exposure by routed intent.
    • Only expose tools needed for current task.
  3. Evaluate finalize overhead.
    • Verify whether finalize call can be reduced or replaced in specific flows.
  4. Add before/after benchmark script.
    • Compare worker input_tokens, latency_ms, and cost for the same scripted multi-turn scenario.

Acceptance Metrics (target)

  • Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.
  • Reduce worker p95 latency by >= 25%.
  • Keep functional behavior unchanged for agent runs.