feat: AG-UI 协议对齐与路由导航功能

- 前端: 添加 SSE 流式支持、stateSnapshot 事件、路由导航工具 - 前端: 实现工具调用审批流程，支持 pending 状态展示 - 后端: Agent 状态管理与会话持久化相关重构 - 文档: 新增 agent-agui-full-alignance 设计文档 - 测试: 补充相关单元测试和集成测试
2026-03-07 17:30:20 +08:00
parent ec33bb0cee
commit 120df903d2
52 changed files with 4305 additions and 1672 deletions
@@ -0,0 +1,188 @@
+# Agent 模块审查报告
+
+**日期**: 2026-03-07
+**范围**: `backend/src/core/agent`
+**状态**: 待修复
+
+---
+
+## 🔴 HIGH - 阻塞性问题
+
+### 1. 同步 LLM 调用阻塞异步事件循环
+
+**文件**: `infrastructure/crewai/runtime.py:126`
+
+**问题**:
+```python
+response = run_completion(...)  # 同步调用
+```
+
+`run_completion` 使用 `litellm.completion()` 是同步的，但 `RunService.run()` 是异步方法。这会阻塞整个事件循环，在高并发下严重影响性能。
+
+**建议**: 使用 `litellm.acompletion()` 或 `asyncio.to_thread()`。
+
+**影响范围**:
+- `infrastructure/litellm/client.py` - 需要添加异步版本
+- `infrastructure/crewai/runtime.py` - `_run_stage()` 需要改为异步
+
+---
+
+## 🟡 MEDIUM - 需要修复
+
+### 2. 缺少输入长度验证
+
+**文件**: `application/run_service.py:63`
+
+**问题**:
+```python
+async def run(self, *, session_id: str, user_input: str) -> dict[str, object]:
+```
+
+`user_input` 没有长度限制，恶意用户可发送超大输入消耗 tokens 和资源。
+
+**建议**: 添加最大长度验证（如 10000 字符）。
+
+```python
+MAX_USER_INPUT_LENGTH = 10000
+
+if len(user_input) > MAX_USER_INPUT_LENGTH:
+    raise ValueError(f"user_input exceeds maximum length of {MAX_USER_INPUT_LENGTH}")
+```
+
+---
+
+### 3. LLM 调用无超时控制
+
+**文件**: `infrastructure/crewai/runtime.py:126`
+
+**问题**: `run_completion` 没有设置超时，如果 LLM API 挂起，请求会无限期阻塞。
+
+**建议**: 添加 `timeout` 参数。
+
+```python
+def run_completion(
+    *,
+    model: str,
+    api_key: str,
+    messages: list[dict[str, Any]],
+    temperature: float | None = None,
+    max_tokens: int | None = None,
+    timeout: float | None = None,  # 新增
+) -> Any:
+    kwargs["timeout"] = timeout
+    ...
+```
+
+---
+
+### 4. 硬编码工具结果
+
+**文件**: `application/resume_service.py:52`
+
+**问题**:
+```python
+content='{"status":"ok"}',
+```
+
+工具执行结果被硬编码为 `{"status":"ok"}`，看起来是占位符代码，实际工具执行结果未被使用。
+
+**建议**: 实现真正的工具执行逻辑，或明确标注为待实现。
+
+---
+
+### 5. 缓存写入异常静默失败
+
+**文件**: `infrastructure/persistence/user_context_cache.py:95-96`
+
+**问题**:
+```python
+async def set(self, *, session_id: UUID, context: UserAgentContext) -> None:
+    ...
+    except Exception:
+        return None
+```
+
+`set()` 方法失败时静默返回 `None`，调用方无法知道缓存是否成功，可能导致缓存失效问题难以排查。
+
+**建议**: 记录日志或抛出异常。
+
+```python
+except Exception as exc:
+    logger.warning("Failed to cache user context", session_id=str(session_id), error=str(exc))
+    return None
+```
+
+---
+
+## 🟢 LOW - 建议改进
+
+### 6. Redis Stream 响应格式校验缺失
+
+**文件**: `infrastructure/events/redis_stream.py:62`
+
+**问题**:
+```python
+_, entries = response[0]
+```
+
+假设 response 格式正确，异常格式会导致 `IndexError`。
+
+**建议**: 添加防御性检查。
+
+---
+
+### 7. 路径限制不支持子目录
+
+**文件**: `infrastructure/crewai/loader.py:47`
+
+**问题**:
+```python
+if resolved.parent != base_dir:
+```
+
+只允许文件直接在 `base_dir` 下，未来扩展子目录模板可能受限。
+
+**建议**: 改为检查路径是否在 `base_dir` 下（允许子目录）。
+
+---
+
+### 8. 异常信息丢失
+
+**文件**: `infrastructure/queue/tasks.py:112`
+
+**问题**:
+```python
+except Exception:  # noqa: BLE001
+    error_id = "agent_runtime_failed"
+    logger.exception(...)
+```
+
+捕获所有异常但只用 `error_id` 标识，丢失了具体异常类型，排查困难。
+
+**建议**: 在日志中记录异常类型。
+
+---
+
+## ✅ 良好实践
+
+以下设计值得肯定：
+
+- **DDD 分层清晰**: domain / application / infrastructure 职责分明
+- **Repository 不做 commit**: 由 Service 控制事务边界
+- **并发控制**: 使用 `FOR UPDATE` 锁防止并发问题
+- **敏感字段脱敏**: `agui/bridge.py` 实现了 `_redact_sensitive()`
+- **路径穿越防护**: `loader.py` 使用 `_resolve_allowed_path()`
+- **协议抽象**: 使用 Protocol 进行依赖解耦
+
+---
+
+## 修复优先级建议
+
+| 优先级 | 问题 | 预计工时 |
+|--------|------|----------|
+| P0 | 同步 LLM 调用阻塞 | 2h |
+| P1 | 输入长度验证 | 0.5h |
+| P1 | LLM 超时控制 | 1h |
+| P2 | 硬编码工具结果 | 待定 |
+| P2 | 缓存异常处理 | 0.5h |
+| P3 | 其他 LOW 问题 | 1h |