96 lines
2.8 KiB
Markdown
96 lines
2.8 KiB
Markdown
|
|
# Agent Interrupt/Resume 遗留问题修复设计
|
|||
|
|
|
|||
|
|
## 1. 目标
|
|||
|
|
|
|||
|
|
本次修复一次性完成以下三项遗留问题:
|
|||
|
|
|
|||
|
|
1. `state_snapshot` 并发一致性问题(并发 resume 竞争)
|
|||
|
|
2. `expires_at` 过期未强校验问题
|
|||
|
|
3. `state_snapshot` 缺少强类型与版本化问题
|
|||
|
|
|
|||
|
|
## 2. 设计决策
|
|||
|
|
|
|||
|
|
采用方案 2(严格重构):
|
|||
|
|
|
|||
|
|
- `state_snapshot` 仅接受新结构,不再兼容旧结构
|
|||
|
|
- 统一快照版本为 `version = 2`
|
|||
|
|
- 使用强类型模型约束快照结构与状态迁移
|
|||
|
|
- resume 入口引入行级锁语义,避免并发双写
|
|||
|
|
|
|||
|
|
## 3. 状态快照模型
|
|||
|
|
|
|||
|
|
`state_snapshot` 顶层结构:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"version": 2,
|
|||
|
|
"pending_tool_call": {
|
|||
|
|
"interrupt_id": "int-1",
|
|||
|
|
"tool_name": "srv.transfer_funds",
|
|||
|
|
"tool_args": {"to": "u2", "amount": 100},
|
|||
|
|
"status": "PENDING_APPROVAL",
|
|||
|
|
"expires_at": "2026-03-03T12:00:00Z",
|
|||
|
|
"decision": null,
|
|||
|
|
"result": null,
|
|||
|
|
"updated_at": "2026-03-03T11:59:00Z"
|
|||
|
|
},
|
|||
|
|
"run_context": {
|
|||
|
|
"thread_id": "t1",
|
|||
|
|
"run_id": "r1"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
说明:
|
|||
|
|
|
|||
|
|
- `version` 必须为 2,否则拒绝处理
|
|||
|
|
- `pending_tool_call` 字段缺失或类型错误,按无效快照处理
|
|||
|
|
- `run_context` 仅保留 interrupt/resume 必需字段
|
|||
|
|
|
|||
|
|
## 4. 状态机约束
|
|||
|
|
|
|||
|
|
仅允许以下迁移:
|
|||
|
|
|
|||
|
|
- `PENDING_APPROVAL -> APPROVED_EXECUTING -> EXECUTED`
|
|||
|
|
- `PENDING_APPROVAL -> REJECTED`
|
|||
|
|
- `PENDING_APPROVAL -> EXPIRED`
|
|||
|
|
|
|||
|
|
非法状态迁移必须返回错误,不做隐式修复。
|
|||
|
|
|
|||
|
|
## 5. 并发与过期语义
|
|||
|
|
|
|||
|
|
- resume 前先对目标 session 加锁再读取快照
|
|||
|
|
- 同一 `interrupt_id` 并发 resume 只能有一个请求成功
|
|||
|
|
- 若 `expires_at < now(UTC)`,先迁移为 `EXPIRED`,再返回 410
|
|||
|
|
|
|||
|
|
## 6. 错误语义(RFC7807)
|
|||
|
|
|
|||
|
|
- `409 Conflict`: run/interrupt 不匹配,或并发冲突导致状态已消费
|
|||
|
|
- `410 Gone`: 挂起调用已过期
|
|||
|
|
- `422 Unprocessable Entity`: `state_snapshot` 非法或版本不匹配
|
|||
|
|
- `404 Not Found`: 目标 session/run 不存在
|
|||
|
|
|
|||
|
|
## 7. 测试策略
|
|||
|
|
|
|||
|
|
采用 TDD,先写失败测试后实现:
|
|||
|
|
|
|||
|
|
- 快照版本校验(`version != 2`)
|
|||
|
|
- 快照结构校验(必填字段/类型)
|
|||
|
|
- 并发 resume 幂等竞争(仅一个成功)
|
|||
|
|
- 过期校验(返回 410 + 状态置 EXPIRED)
|
|||
|
|
- 合法状态迁移路径覆盖
|
|||
|
|
|
|||
|
|
## 8. 验证命令
|
|||
|
|
|
|||
|
|
- `uv run pytest backend/tests/unit/v1/agent -v`
|
|||
|
|
- `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py -v`
|
|||
|
|
- `uv run pytest backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
|
|||
|
|
- `cd backend && uv run ruff check src/v1/agent`
|
|||
|
|
- `cd backend && uv run basedpyright src/v1/agent`
|
|||
|
|
|
|||
|
|
## 9. 风险与回滚
|
|||
|
|
|
|||
|
|
- 风险:旧快照不再兼容,可能触发运行时拒绝
|
|||
|
|
- 处置:通过明确 422 错误暴露不合规数据,结合日志定位并人工修复数据
|
|||
|
|
- 回滚:回退本次变更并恢复旧快照解析逻辑(仅在紧急故障时)
|