docs: 更新自动化记忆设计文档与协议路由

- 重构 automation-memory-design.md 为 v2 版本，新增 Execution Profile 抽象层 - 删除 auth-global-rewrite-design.md 和 auth-global-rewrite-plan.md - 更新 agent/api-endpoints.md 协议文档 - 更新 ASR 与 worker token latency 优化 TODO 文档
2026-03-18 17:03:33 +08:00
parent 8539f05a66
commit 257cb0f5d5
6 changed files with 303 additions and 439 deletions
@@ -1 +1 @@
- 语音识别计费
+当前项目有语音识别功能，但是语音识别的cost成本计算没有实现。目前我们用的模型是fun-asr-realtime-2026-02-28，价格是0.00033元/每秒。我希望把它做到backend/src/core/config/static/database/llm_catalog.yaml，加一个asr字段，引入model_code代替原agent router里的硬编码，通过加载配置获取模型信息和报价，然后根据后端路由接收到的音频长度然后来估算价格，或者看看dashscope的sdk是否会返回消耗token金额，将这个token金额看看如何审计
@@ -1,41 +1,78 @@
-# Worker Token/Latency Optimization TODO
+# Worker Token/Latency 优化 TODO

-Date: 2026-03-17
+日期: 2026-03-17
 Owner: backend runtime
-Status: pending
+状态: pending

-## Background
+## 背景

- Router cost/latency is acceptable.
- Worker stage (deepseek-chat) has significantly higher input tokens and latency.
- Current optimization work is deferred due to prioritization.
+- Router 阶段成本与延迟基本可接受。
+- Worker 阶段（deepseek-chat）`input_tokens` 与 `latency` 显著偏高，是总成本的主要来源。
+- 优化目标是在不降低结果质量与稳定性的前提下，优先压缩 Worker 输入 token。

-## Observations (from `public.messages`)
+## 现状观察

- Worker avg input tokens are much higher than router (about 12k+ vs 3k).
- Worker avg latency is much higher than router (about 41s vs 4s).
- Worker cost dominates total cost.
+- Worker 平均 `input_tokens` 明显高于 Router。
+- Worker 平均延迟明显高于 Router。
+- 成本主要由 Worker 阶段贡献。

-## Root Cause Hypothesis
+## 核心优化方向（按优先级）

- Worker ReAct path repeatedly includes full tool schemas per model call.
- `calendar_write` tool schema is large and contributes major prompt overhead.
- Finalize JSON step performs an additional model call after ReAct.
+### P0（优先执行，低风险高收益）

-## Deferred Optimization Items
+1. 路由提示词瘦身：从“全量路由清单”改为“route_id 约束 + 服务端映射”。
+   - 模型仅输出 `route_id` 与必要参数。
+   - 后端基于静态 route catalog 映射到最终 `path`。
+   - 目标：减少每次 system prompt 的固定 token 开销。

-1. Tool schema slimming for calendar write path.
-   - Split `calendar_write` into focused tools (`calendar_create`, `calendar_update`, `calendar_delete`).
-   - Reduce redundant/verbose field descriptions where possible.
-2. Dynamic tool set exposure by routed intent.
-   - Only expose tools needed for current task.
-3. Evaluate finalize overhead.
-   - Verify whether finalize call can be reduced or replaced in specific flows.
-4. Add before/after benchmark script.
-   - Compare worker `input_tokens`, `latency_ms`, and `cost` for the same scripted multi-turn scenario.
+2. Finalize 最小上下文化：避免 finalize 回放完整 memory。
+   - finalize 阶段仅输入：最后一轮候选答案 + 必要工具结果摘要 + schema 指令。
+   - 不再注入完整历史会话。
+   - 目标：降低两段式结构化输出的额外输入成本。

-## Acceptance Metrics (target)
+3. 工具按需暴露（dynamic tool allowlist）。
+   - 按 router 的 task/result typing 只下发当前任务必需工具。
+   - 避免每轮 ReAct 携带全量工具 schema。
+   - 目标：降低每次 reasoning 的工具描述负担。

- Reduce worker input tokens by >= 30% in multi-turn calendar CRUD scenario.
- Reduce worker p95 latency by >= 25%.
- Keep functional behavior unchanged for agent runs.
+### P1（次优先，稳定收益）
+
+4. system prompt 分层裁剪。
+   - 按 `agent_type` 与 `ui_mode` 组装最小提示词集合。
+   - Router 不携带 Worker 专属规则；`ui_mode=none` 不携带 rich UI 细则。
+
+5. 输出体积约束。
+   - 限制 `key_points`、`suggested_actions`、`ui_hints.actions` 数量与文本长度。
+   - 降低 `output_tokens`，同时减少前端渲染负担。
+
+6. 上下文策略优化（摘要 + 最近少量原文）。
+   - 从“固定最近 N 轮原文”改为“结构化摘要 + 最近 1~2 轮原文”。
+   - 控制长会话 token 膨胀。
+
+### P2（可选增强）
+
+7. Prompt 缓存命中优化。
+   - 固定可缓存前缀，动态段后置。
+   - 利用 provider prompt cache 降低计费 token（若模型侧支持）。
+
+## 不建议作为当前主线
+
+- 直接切换为 ReAct 原生 `structured_model` 作为主方案（当前实测稳定性与成本不占优）。
+- 在未完成 P0 优化前，优先投入复杂的 ReAct 内核重写。
+
+## 验收指标（更新）
+
+- 在典型多轮场景中，Worker `input_tokens` 降低 >= 30%。
+- Worker p95 `latency_ms` 降低 >= 20%。
+- 结构化输出校验成功率不低于当前基线。
+- 关键路径功能行为保持不变（agent run 结果与前端交互不回退）。
+
+## 验证方式
+
+1. 固定场景脚本对比（优化前/后同输入）：
+   - 指标：`input_tokens`、`output_tokens`、`latency_ms`、`cost`、结构化成功率。
+2. 线上观测（`public.messages`）：
+   - 按 stage（router/worker）聚合对比日均与 p95。
+3. 回归校验：
+   - 工具调用结果一致性；
+   - `ui_hints`/`ui_schema` 可渲染性与导航动作正确性。