Files

T

qzl 19981964fb refactor: 移除 LiteLLM proxy 架构，后端直连 Provider API

- 移除 backend/scripts/build_litellm_proxy_config.py
- 简化 LiteLLMService，移除 run_completion_with_cost 方法
- AgentScopeRunner 改为从 LlmFactory 获取 api_base 和 api_key
- 部署配置移除 litellm/litellm-config-job 服务
- Flutter 新增 AuthBootScreen 引导页
- Android 添加通知权限 (POST_NOTIFICATIONS, RECEIVE_BOOT_COMPLETED, SCHEDULE_EXACT_ALARM)
- 优化 LocalNotificationService 调度失败 fallback
- 更新 manifest.json (version 3)

2026-03-17 18:05:49 +08:00

1.5 KiB

Raw Blame History

Worker Token/Latency Optimization TODO

Date: 2026-03-17 Owner: backend runtime Status: pending

Background

Router cost/latency is acceptable.
Worker stage (deepseek-chat) has significantly higher input tokens and latency.
Current optimization work is deferred due to prioritization.

Observations (from `public.messages`)

Worker avg input tokens are much higher than router (about 12k+ vs 3k).
Worker avg latency is much higher than router (about 41s vs 4s).
Worker cost dominates total cost.

Root Cause Hypothesis

Worker ReAct path repeatedly includes full tool schemas per model call.
calendar_write tool schema is large and contributes major prompt overhead.
Finalize JSON step performs an additional model call after ReAct.

Deferred Optimization Items

Tool schema slimming for calendar write path.
- Split calendar_write into focused tools (calendar_create, calendar_update, calendar_delete).
- Reduce redundant/verbose field descriptions where possible.
Dynamic tool set exposure by routed intent.
- Only expose tools needed for current task.
Evaluate finalize overhead.
- Verify whether finalize call can be reduced or replaced in specific flows.
Add before/after benchmark script.
- Compare worker input_tokens, latency_ms, and cost for the same scripted multi-turn scenario.

Acceptance Metrics (target)

Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.
Reduce worker p95 latency by >= 25%.
Keep functional behavior unchanged for agent runs.

1.5 KiB Raw Blame History

Worker Token/Latency Optimization TODO

Background

Observations (from public.messages)

Root Cause Hypothesis

Deferred Optimization Items

Acceptance Metrics (target)

1.5 KiB

Raw Blame History

Observations (from `public.messages`)