Files

T

zl-q 120df903d2 feat: AG-UI 协议对齐与路由导航功能

- 前端: 添加 SSE 流式支持、stateSnapshot 事件、路由导航工具
- 前端: 实现工具调用审批流程，支持 pending 状态展示
- 后端: Agent 状态管理与会话持久化相关重构
- 文档: 新增 agent-agui-full-alignance 设计文档
- 测试: 补充相关单元测试和集成测试

2026-03-07 17:30:20 +08:00

4.4 KiB

Raw Blame History

Agent 模块审查报告

日期: 2026-03-07 范围: backend/src/core/agent 状态: 待修复

🔴 HIGH - 阻塞性问题

1. 同步 LLM 调用阻塞异步事件循环

文件: infrastructure/crewai/runtime.py:126

问题:

response = run_completion(...)  # 同步调用

run_completion 使用 litellm.completion() 是同步的，但 RunService.run() 是异步方法。这会阻塞整个事件循环，在高并发下严重影响性能。

建议: 使用 litellm.acompletion() 或 asyncio.to_thread()。

影响范围:

infrastructure/litellm/client.py - 需要添加异步版本
infrastructure/crewai/runtime.py - _run_stage() 需要改为异步

🟡 MEDIUM - 需要修复

2. 缺少输入长度验证

文件: application/run_service.py:63

问题:

async def run(self, *, session_id: str, user_input: str) -> dict[str, object]:

user_input 没有长度限制，恶意用户可发送超大输入消耗 tokens 和资源。

建议: 添加最大长度验证（如 10000 字符）。

MAX_USER_INPUT_LENGTH = 10000

if len(user_input) > MAX_USER_INPUT_LENGTH:
    raise ValueError(f"user_input exceeds maximum length of {MAX_USER_INPUT_LENGTH}")

3. LLM 调用无超时控制

文件: infrastructure/crewai/runtime.py:126

问题: run_completion 没有设置超时，如果 LLM API 挂起，请求会无限期阻塞。

建议: 添加 timeout 参数。

def run_completion(
    *,
    model: str,
    api_key: str,
    messages: list[dict[str, Any]],
    temperature: float | None = None,
    max_tokens: int | None = None,
    timeout: float | None = None,  # 新增
) -> Any:
    kwargs["timeout"] = timeout
    ...

4. 硬编码工具结果

文件: application/resume_service.py:52

问题:

content='{"status":"ok"}',

工具执行结果被硬编码为 {"status":"ok"}，看起来是占位符代码，实际工具执行结果未被使用。

建议: 实现真正的工具执行逻辑，或明确标注为待实现。

5. 缓存写入异常静默失败

文件: infrastructure/persistence/user_context_cache.py:95-96

问题:

async def set(self, *, session_id: UUID, context: UserAgentContext) -> None:
    ...
    except Exception:
        return None

set() 方法失败时静默返回 None，调用方无法知道缓存是否成功，可能导致缓存失效问题难以排查。

建议: 记录日志或抛出异常。

except Exception as exc:
    logger.warning("Failed to cache user context", session_id=str(session_id), error=str(exc))
    return None

🟢 LOW - 建议改进

6. Redis Stream 响应格式校验缺失

文件: infrastructure/events/redis_stream.py:62

问题:

_, entries = response[0]

假设 response 格式正确，异常格式会导致 IndexError。

建议: 添加防御性检查。

7. 路径限制不支持子目录

文件: infrastructure/crewai/loader.py:47

问题:

if resolved.parent != base_dir:

只允许文件直接在 base_dir 下，未来扩展子目录模板可能受限。

建议: 改为检查路径是否在 base_dir 下（允许子目录）。

8. 异常信息丢失

文件: infrastructure/queue/tasks.py:112

问题:

except Exception:  # noqa: BLE001
    error_id = "agent_runtime_failed"
    logger.exception(...)

捕获所有异常但只用 error_id 标识，丢失了具体异常类型，排查困难。

建议: 在日志中记录异常类型。

✅ 良好实践

以下设计值得肯定：

DDD 分层清晰: domain / application / infrastructure 职责分明
Repository 不做 commit: 由 Service 控制事务边界
并发控制: 使用 FOR UPDATE 锁防止并发问题
敏感字段脱敏: agui/bridge.py 实现了 _redact_sensitive()
路径穿越防护: loader.py 使用 _resolve_allowed_path()
协议抽象: 使用 Protocol 进行依赖解耦

修复优先级建议

优先级	问题	预计工时
P0	同步 LLM 调用阻塞	2h
P1	输入长度验证	0.5h
P1	LLM 超时控制	1h
P2	硬编码工具结果	待定
P2	缓存异常处理	0.5h
P3	其他 LOW 问题	1h

4.4 KiB Raw Blame History