fix(agent): polish interrupt-resume flow for merge readiness

This commit is contained in:
qzl
2026-03-03 17:26:04 +08:00
parent 7be8669144
commit 30a4a1af5d
16 changed files with 1179 additions and 85 deletions
@@ -0,0 +1,95 @@
# Agent Interrupt/Resume 遗留问题修复设计
## 1. 目标
本次修复一次性完成以下三项遗留问题:
1. `state_snapshot` 并发一致性问题(并发 resume 竞争)
2. `expires_at` 过期未强校验问题
3. `state_snapshot` 缺少强类型与版本化问题
## 2. 设计决策
采用方案 2(严格重构):
- `state_snapshot` 仅接受新结构,不再兼容旧结构
- 统一快照版本为 `version = 2`
- 使用强类型模型约束快照结构与状态迁移
- resume 入口引入行级锁语义,避免并发双写
## 3. 状态快照模型
`state_snapshot` 顶层结构:
```json
{
"version": 2,
"pending_tool_call": {
"interrupt_id": "int-1",
"tool_name": "srv.transfer_funds",
"tool_args": {"to": "u2", "amount": 100},
"status": "PENDING_APPROVAL",
"expires_at": "2026-03-03T12:00:00Z",
"decision": null,
"result": null,
"updated_at": "2026-03-03T11:59:00Z"
},
"run_context": {
"thread_id": "t1",
"run_id": "r1"
}
}
```
说明:
- `version` 必须为 2,否则拒绝处理
- `pending_tool_call` 字段缺失或类型错误,按无效快照处理
- `run_context` 仅保留 interrupt/resume 必需字段
## 4. 状态机约束
仅允许以下迁移:
- `PENDING_APPROVAL -> APPROVED_EXECUTING -> EXECUTED`
- `PENDING_APPROVAL -> REJECTED`
- `PENDING_APPROVAL -> EXPIRED`
非法状态迁移必须返回错误,不做隐式修复。
## 5. 并发与过期语义
- resume 前先对目标 session 加锁再读取快照
- 同一 `interrupt_id` 并发 resume 只能有一个请求成功
-`expires_at < now(UTC)`,先迁移为 `EXPIRED`,再返回 410
## 6. 错误语义(RFC7807
- `409 Conflict`: run/interrupt 不匹配,或并发冲突导致状态已消费
- `410 Gone`: 挂起调用已过期
- `422 Unprocessable Entity`: `state_snapshot` 非法或版本不匹配
- `404 Not Found`: 目标 session/run 不存在
## 7. 测试策略
采用 TDD,先写失败测试后实现:
- 快照版本校验(`version != 2`
- 快照结构校验(必填字段/类型)
- 并发 resume 幂等竞争(仅一个成功)
- 过期校验(返回 410 + 状态置 EXPIRED
- 合法状态迁移路径覆盖
## 8. 验证命令
- `uv run pytest backend/tests/unit/v1/agent -v`
- `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py -v`
- `uv run pytest backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
- `cd backend && uv run ruff check src/v1/agent`
- `cd backend && uv run basedpyright src/v1/agent`
## 9. 风险与回滚
- 风险:旧快照不再兼容,可能触发运行时拒绝
- 处置:通过明确 422 错误暴露不合规数据,结合日志定位并人工修复数据
- 回滚:回退本次变更并恢复旧快照解析逻辑(仅在紧急故障时)
@@ -0,0 +1,377 @@
# Agent Interrupt/Resume Strict Refactor Implementation Plan
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
**Goal:** 通过严格重构一次性修复 interrupt/resume 的并发安全、过期校验和 state_snapshot 强类型版本化问题。
**Architecture:**`state_snapshot v2` 为唯一合法结构,服务层使用强类型模型解析与状态迁移,resume 路径在读取会话时加行锁保证并发一致性。路由层维持现有 run/resume 入口,错误通过 HTTPException 输出,测试覆盖版本校验、过期语义、并发幂等和状态机迁移。
**Tech Stack:** FastAPI, SQLAlchemy AsyncSession, Pydantic v2, pytest
---
### Task 1: 新增 state_snapshot v2 强类型模型
**Files:**
- Modify: `backend/src/v1/agent/schemas.py`
- Test: `backend/tests/unit/v1/agent/test_schemas.py`
**Step 1: Write the failing test**
```python
def test_state_snapshot_v2_model_accepts_valid_payload():
payload = {
"version": 2,
"pending_tool_call": {
"interrupt_id": "int-1",
"tool_name": "srv.transfer_funds",
"tool_args": {"to": "u2", "amount": 100},
"status": "PENDING_APPROVAL",
"expires_at": "2026-03-03T12:00:00Z",
"decision": None,
"result": None,
"updated_at": "2026-03-03T11:59:00Z",
},
"run_context": {"thread_id": "t1", "run_id": "r1"},
}
model = AgentSessionSnapshot.model_validate(payload)
assert model.version == 2
def test_state_snapshot_v2_rejects_wrong_version():
payload = {
"version": 1,
"pending_tool_call": None,
"run_context": {"thread_id": "t1", "run_id": "r1"},
}
with pytest.raises(ValueError):
AgentSessionSnapshot.model_validate(payload)
```
**Step 2: Run test to verify it fails**
Run: `uv run pytest backend/tests/unit/v1/agent/test_schemas.py -v`
Expected: FAIL`AgentSessionSnapshot` 未定义或校验不符合预期)
**Step 3: Write minimal implementation**
```python
class PendingToolStatus(str, Enum):
PENDING_APPROVAL = "PENDING_APPROVAL"
APPROVED_EXECUTING = "APPROVED_EXECUTING"
EXECUTED = "EXECUTED"
REJECTED = "REJECTED"
EXPIRED = "EXPIRED"
class PendingToolCall(BaseModel):
interrupt_id: str
tool_name: str
tool_args: dict[str, Any]
status: PendingToolStatus
expires_at: datetime
decision: dict[str, Any] | None = None
result: dict[str, Any] | None = None
updated_at: datetime
class SnapshotRunContext(BaseModel):
thread_id: str
run_id: str
class AgentSessionSnapshot(BaseModel):
version: Literal[2]
pending_tool_call: PendingToolCall | None = None
run_context: SnapshotRunContext
```
**Step 4: Run test to verify it passes**
Run: `uv run pytest backend/tests/unit/v1/agent/test_schemas.py -v`
Expected: PASS
**Step 5: Commit**
```bash
git add backend/src/v1/agent/schemas.py backend/tests/unit/v1/agent/test_schemas.py
git commit -m "refactor(agent): add strict v2 session snapshot schema"
```
---
### Task 2: service 层改为 v2 快照读写(严格拒绝旧结构)
**Files:**
- Modify: `backend/src/v1/agent/service.py`
- Test: `backend/tests/unit/v1/agent/test_service_pending_tool_call.py`
**Step 1: Write the failing test**
```python
@pytest.mark.asyncio
async def test_set_pending_tool_call_writes_v2_snapshot(service, session):
await service.set_pending_tool_call(
session_id=session.id,
interrupt_id="int-1",
tool_name="srv.transfer_funds",
tool_args={"to": "u2", "amount": 100},
expires_at=datetime.now(timezone.utc) + timedelta(minutes=5),
thread_id="t1",
run_id="r1",
)
snapshot = await service.get_state_snapshot(session.id)
assert snapshot["version"] == 2
assert snapshot["run_context"]["run_id"] == "r1"
@pytest.mark.asyncio
async def test_invalid_legacy_snapshot_is_rejected(service, session):
session.state_snapshot = {"pending_tool_call": {"status": "PENDING_APPROVAL"}}
with pytest.raises(ValueError):
await service.apply_resume_decision(
session_id=session.id,
interrupt_id="int-1",
decision={"decision": "approved"},
)
```
**Step 2: Run test to verify it fails**
Run: `uv run pytest backend/tests/unit/v1/agent/test_service_pending_tool_call.py -v`
Expected: FAIL
**Step 3: Write minimal implementation**
```python
def _build_snapshot_v2(...):
return AgentSessionSnapshot(...).model_dump(mode="json")
def _load_snapshot_v2(raw: dict[str, Any] | None) -> AgentSessionSnapshot:
if raw is None:
raise ValueError("state_snapshot missing")
return AgentSessionSnapshot.model_validate(raw)
```
并将 `set_pending_tool_call/get_state_snapshot/update_pending_tool_call_status` 全部改成 v2 模型读写。
**Step 4: Run test to verify it passes**
Run: `uv run pytest backend/tests/unit/v1/agent/test_service_pending_tool_call.py -v`
Expected: PASS
**Step 5: Commit**
```bash
git add backend/src/v1/agent/service.py backend/tests/unit/v1/agent/test_service_pending_tool_call.py
git commit -m "refactor(agent): enforce v2 snapshot read write in service"
```
---
### Task 3: 增加 resume 行锁与并发幂等
**Files:**
- Modify: `backend/src/v1/agent/service.py`
- Test: `backend/tests/unit/v1/agent/test_resume_idempotency.py`
**Step 1: Write the failing test**
```python
@pytest.mark.asyncio
async def test_apply_resume_decision_uses_locked_session_fetch(service, fake_db, session):
await service.apply_resume_decision(
session_id=session.id,
interrupt_id="int-1",
decision={"decision": "approved"},
)
assert fake_db.last_fetch_with_lock is True
@pytest.mark.asyncio
async def test_resume_is_idempotent(service, session):
first = await service.apply_resume_decision(...)
second = await service.apply_resume_decision(...)
assert first.applied is True
assert second.applied is False
```
**Step 2: Run test to verify it fails**
Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
Expected: FAIL
**Step 3: Write minimal implementation**
```python
async def _get_session_for_update(self, session_id: UUID) -> AgentChatSession | None:
stmt = (
select(AgentChatSession)
.where(AgentChatSession.id == session_id)
.with_for_update()
.limit(1)
)
result = await self._session.execute(stmt)
return result.scalar_one_or_none()
```
`apply_resume_decision` 改为锁内读取、校验、状态迁移,保证并发下单次生效。
**Step 4: Run test to verify it passes**
Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
Expected: PASS
**Step 5: Commit**
```bash
git add backend/src/v1/agent/service.py backend/tests/unit/v1/agent/test_resume_idempotency.py
git commit -m "fix(agent): add row lock for resume state transition"
```
---
### Task 4: 增加 expires_at 过期校验(含 EXPIRED 迁移)
**Files:**
- Modify: `backend/src/v1/agent/service.py`
- Test: `backend/tests/unit/v1/agent/test_resume_idempotency.py`
**Step 1: Write the failing test**
```python
@pytest.mark.asyncio
async def test_resume_expired_pending_returns_not_applied_and_marks_expired(service, session):
await service.set_pending_tool_call(..., expires_at=datetime.now(timezone.utc) - timedelta(seconds=1), thread_id="t1", run_id="r1")
result = await service.apply_resume_decision(
session_id=session.id,
interrupt_id="int-1",
decision={"decision": "approved"},
)
assert result.applied is False
snapshot = await service.get_state_snapshot(session.id)
assert snapshot["pending_tool_call"]["status"] == "EXPIRED"
```
**Step 2: Run test to verify it fails**
Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
Expected: FAIL
**Step 3: Write minimal implementation**
```python
if pending.expires_at < datetime.now(timezone.utc):
pending.status = PendingToolStatus.EXPIRED
pending.updated_at = datetime.now(timezone.utc)
session.state_snapshot = snapshot.model_dump(mode="json")
return ResumeDecisionResult(applied=False, expired=True)
```
**Step 4: Run test to verify it passes**
Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
Expected: PASS
**Step 5: Commit**
```bash
git add backend/src/v1/agent/service.py backend/tests/unit/v1/agent/test_resume_idempotency.py
git commit -m "fix(agent): enforce expires_at when applying resume decision"
```
---
### Task 5: 路由层补齐 v2 快照与过期/冲突错误映射
**Files:**
- Modify: `backend/src/v1/agent/router.py`
- Modify: `backend/src/v1/agent/service.py`
- Test: `backend/tests/integration/v1/agent/test_chat_routes.py`
- Test: `backend/tests/integration/v1/agent/test_interrupt_resume_flow.py`
**Step 1: Write the failing test**
```python
def test_resume_route_returns_409_on_run_id_mismatch(client):
...
def test_resume_route_returns_410_when_pending_expired(client):
...
def test_resume_route_returns_422_for_legacy_snapshot(client):
...
```
**Step 2: Run test to verify it fails**
Run: `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
Expected: FAIL
**Step 3: Write minimal implementation**
`stream_resume` 或路由调用链里将领域错误映射为:
- 过期 -> `HTTPException(410)`
- 旧快照/结构错误 -> `HTTPException(422)`
- 状态冲突/重复消费 -> `HTTPException(409)`
**Step 4: Run test to verify it passes**
Run: `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
Expected: PASS
**Step 5: Commit**
```bash
git add backend/src/v1/agent/router.py backend/src/v1/agent/service.py backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py
git commit -m "fix(agent): map resume snapshot errors to 409 410 422"
```
---
### Task 6: 更新文档并完成验证
**Files:**
- Modify: `docs/plans/2026-03-03-agent-chat-design.md`
- Modify: `docs/runtime/runtime-route.md`
**Step 1: Update docs**
- 明确 `state_snapshot version=2` 为唯一支持结构
- 明确 resume 过期与并发冲突语义(410/409)
- 明确旧快照拒绝策略(422
**Step 2: Run unit tests**
Run: `uv run pytest backend/tests/unit/v1/agent -v`
Expected: PASS
**Step 3: Run integration tests**
Run: `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
Expected: PASS
**Step 4: Run static checks**
Run: `cd backend && uv run ruff check src/v1/agent`
Expected: PASS
Run: `cd backend && uv run basedpyright src/v1/agent`
Expected: PASS
**Step 5: Commit**
```bash
git add docs/plans/2026-03-03-agent-chat-design.md docs/runtime/runtime-route.md
git commit -m "docs(agent): document strict snapshot v2 and resume error semantics"
```
---
Plan complete and saved to `docs/plans/2026-03-03-interrupt-resume-fixes-implementation-plan.md`.
Execution mode selected by user request: Subagent-Driven (this session), proceed task-by-task immediately.
+50 -24
View File
@@ -788,42 +788,68 @@
## Agent
### POST /agent
### POST /agent/runs
运行 Agent 对话(需要认证)。
创建 Agent 运行(需要认证SSE 响应)。
**Request:**
**Request (RunAgentInput):**
```json
{
"message": "string (1-8000 chars)",
"session_id": "string? (UUID)"
"threadId": "string",
"runId": "string",
"parentRunId": "string?",
"state": {},
"messages": [],
"tools": [],
"context": [],
"forwardedProps": {},
"resume": null
}
```
**Response:** 200 OK
```json
{
"session_id": "string (UUID)",
"output": "string",
"events": [
{
"type": "string",
"run_id": "string?",
"message_id": "string?",
"delta": "string?",
"tool_name": "string?",
"result": "string?",
"output": "string?",
"error": "string?"
}
]
}
```
**Response:** 200 OK (`text/event-stream`)
**Errors:**
- 401: 未认证
- 422: 请求参数无效
### POST /agent/runs/{run_id}/resume
恢复被中断运行(需要认证,SSE 响应)。
**Request (RunAgentInput):**
```json
{
"threadId": "string",
"runId": "string",
"state": {},
"messages": [],
"tools": [],
"context": [],
"forwardedProps": {},
"resume": {
"interruptId": "string",
"payload": {}
}
}
```
**State Snapshot Contract:**
- `state_snapshot` 仅支持 `version = 2`
- 顶层必须包含 `run_context``pending_tool_call`
- 旧格式或缺失字段会被拒绝
**Resume Semantics:**
- 同一 `interrupt_id` 并发恢复仅允许一个请求成功
- `expires_at` 超时后会标记为 `EXPIRED`,恢复请求不再生效
**Errors:**
- 401: 未认证
- 404: 会话不存在
- 409: `run_id``interrupt_id` 冲突,或状态已被消费
- 410: 挂起调用已过期
- 422: `state_snapshot` 非法或版本不匹配
---
## Infra