fix(agent): polish interrupt-resume flow for merge readiness

2026-03-03 17:26:04 +08:00
parent 7be8669144
commit 30a4a1af5d
16 changed files with 1179 additions and 85 deletions
@@ -0,0 +1,95 @@
+# Agent Interrupt/Resume 遗留问题修复设计
+
+## 1. 目标
+
+本次修复一次性完成以下三项遗留问题：
+
+1. `state_snapshot` 并发一致性问题（并发 resume 竞争）
+2. `expires_at` 过期未强校验问题
+3. `state_snapshot` 缺少强类型与版本化问题
+
+## 2. 设计决策
+
+采用方案 2（严格重构）：
+
+- `state_snapshot` 仅接受新结构，不再兼容旧结构
+- 统一快照版本为 `version = 2`
+- 使用强类型模型约束快照结构与状态迁移
+- resume 入口引入行级锁语义，避免并发双写
+
+## 3. 状态快照模型
+
+`state_snapshot` 顶层结构：
+
+```json
+{
+  "version": 2,
+  "pending_tool_call": {
+    "interrupt_id": "int-1",
+    "tool_name": "srv.transfer_funds",
+    "tool_args": {"to": "u2", "amount": 100},
+    "status": "PENDING_APPROVAL",
+    "expires_at": "2026-03-03T12:00:00Z",
+    "decision": null,
+    "result": null,
+    "updated_at": "2026-03-03T11:59:00Z"
+  },
+  "run_context": {
+    "thread_id": "t1",
+    "run_id": "r1"
+  }
+}
+```
+
+说明：
+
+- `version` 必须为 2，否则拒绝处理
+- `pending_tool_call` 字段缺失或类型错误，按无效快照处理
+- `run_context` 仅保留 interrupt/resume 必需字段
+
+## 4. 状态机约束
+
+仅允许以下迁移：
+
+- `PENDING_APPROVAL -> APPROVED_EXECUTING -> EXECUTED`
+- `PENDING_APPROVAL -> REJECTED`
+- `PENDING_APPROVAL -> EXPIRED`
+
+非法状态迁移必须返回错误，不做隐式修复。
+
+## 5. 并发与过期语义
+
+- resume 前先对目标 session 加锁再读取快照
+- 同一 `interrupt_id` 并发 resume 只能有一个请求成功
+- 若 `expires_at < now(UTC)`，先迁移为 `EXPIRED`，再返回 410
+
+## 6. 错误语义（RFC7807）
+
+- `409 Conflict`: run/interrupt 不匹配，或并发冲突导致状态已消费
+- `410 Gone`: 挂起调用已过期
+- `422 Unprocessable Entity`: `state_snapshot` 非法或版本不匹配
+- `404 Not Found`: 目标 session/run 不存在
+
+## 7. 测试策略
+
+采用 TDD，先写失败测试后实现：
+
+- 快照版本校验（`version != 2`）
+- 快照结构校验（必填字段/类型）
+- 并发 resume 幂等竞争（仅一个成功）
+- 过期校验（返回 410 + 状态置 EXPIRED）
+- 合法状态迁移路径覆盖
+
+## 8. 验证命令
+
+- `uv run pytest backend/tests/unit/v1/agent -v`
+- `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py -v`
+- `uv run pytest backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
+- `cd backend && uv run ruff check src/v1/agent`
+- `cd backend && uv run basedpyright src/v1/agent`
+
+## 9. 风险与回滚
+
+- 风险：旧快照不再兼容，可能触发运行时拒绝
+- 处置：通过明确 422 错误暴露不合规数据，结合日志定位并人工修复数据
+- 回滚：回退本次变更并恢复旧快照解析逻辑（仅在紧急故障时）
@@ -0,0 +1,377 @@
+# Agent Interrupt/Resume Strict Refactor Implementation Plan
+
+> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
+
+**Goal:** 通过严格重构一次性修复 interrupt/resume 的并发安全、过期校验和 state_snapshot 强类型版本化问题。
+
+**Architecture:** 以 `state_snapshot v2` 为唯一合法结构，服务层使用强类型模型解析与状态迁移，resume 路径在读取会话时加行锁保证并发一致性。路由层维持现有 run/resume 入口，错误通过 HTTPException 输出，测试覆盖版本校验、过期语义、并发幂等和状态机迁移。
+
+**Tech Stack:** FastAPI, SQLAlchemy AsyncSession, Pydantic v2, pytest
+
+---
+
+### Task 1: 新增 state_snapshot v2 强类型模型
+
+**Files:**
+- Modify: `backend/src/v1/agent/schemas.py`
+- Test: `backend/tests/unit/v1/agent/test_schemas.py`
+
+**Step 1: Write the failing test**
+
+```python
+def test_state_snapshot_v2_model_accepts_valid_payload():
+    payload = {
+        "version": 2,
+        "pending_tool_call": {
+            "interrupt_id": "int-1",
+            "tool_name": "srv.transfer_funds",
+            "tool_args": {"to": "u2", "amount": 100},
+            "status": "PENDING_APPROVAL",
+            "expires_at": "2026-03-03T12:00:00Z",
+            "decision": None,
+            "result": None,
+            "updated_at": "2026-03-03T11:59:00Z",
+        },
+        "run_context": {"thread_id": "t1", "run_id": "r1"},
+    }
+    model = AgentSessionSnapshot.model_validate(payload)
+    assert model.version == 2
+
+
+def test_state_snapshot_v2_rejects_wrong_version():
+    payload = {
+        "version": 1,
+        "pending_tool_call": None,
+        "run_context": {"thread_id": "t1", "run_id": "r1"},
+    }
+    with pytest.raises(ValueError):
+        AgentSessionSnapshot.model_validate(payload)
+```
+
+**Step 2: Run test to verify it fails**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_schemas.py -v`
+Expected: FAIL（`AgentSessionSnapshot` 未定义或校验不符合预期）
+
+**Step 3: Write minimal implementation**
+
+```python
+class PendingToolStatus(str, Enum):
+    PENDING_APPROVAL = "PENDING_APPROVAL"
+    APPROVED_EXECUTING = "APPROVED_EXECUTING"
+    EXECUTED = "EXECUTED"
+    REJECTED = "REJECTED"
+    EXPIRED = "EXPIRED"
+
+
+class PendingToolCall(BaseModel):
+    interrupt_id: str
+    tool_name: str
+    tool_args: dict[str, Any]
+    status: PendingToolStatus
+    expires_at: datetime
+    decision: dict[str, Any] | None = None
+    result: dict[str, Any] | None = None
+    updated_at: datetime
+
+
+class SnapshotRunContext(BaseModel):
+    thread_id: str
+    run_id: str
+
+
+class AgentSessionSnapshot(BaseModel):
+    version: Literal[2]
+    pending_tool_call: PendingToolCall | None = None
+    run_context: SnapshotRunContext
+```
+
+**Step 4: Run test to verify it passes**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_schemas.py -v`
+Expected: PASS
+
+**Step 5: Commit**
+
+```bash
+git add backend/src/v1/agent/schemas.py backend/tests/unit/v1/agent/test_schemas.py
+git commit -m "refactor(agent): add strict v2 session snapshot schema"
+```
+
+---
+
+### Task 2: service 层改为 v2 快照读写（严格拒绝旧结构）
+
+**Files:**
+- Modify: `backend/src/v1/agent/service.py`
+- Test: `backend/tests/unit/v1/agent/test_service_pending_tool_call.py`
+
+**Step 1: Write the failing test**
+
+```python
+@pytest.mark.asyncio
+async def test_set_pending_tool_call_writes_v2_snapshot(service, session):
+    await service.set_pending_tool_call(
+        session_id=session.id,
+        interrupt_id="int-1",
+        tool_name="srv.transfer_funds",
+        tool_args={"to": "u2", "amount": 100},
+        expires_at=datetime.now(timezone.utc) + timedelta(minutes=5),
+        thread_id="t1",
+        run_id="r1",
+    )
+    snapshot = await service.get_state_snapshot(session.id)
+    assert snapshot["version"] == 2
+    assert snapshot["run_context"]["run_id"] == "r1"
+
+
+@pytest.mark.asyncio
+async def test_invalid_legacy_snapshot_is_rejected(service, session):
+    session.state_snapshot = {"pending_tool_call": {"status": "PENDING_APPROVAL"}}
+    with pytest.raises(ValueError):
+        await service.apply_resume_decision(
+            session_id=session.id,
+            interrupt_id="int-1",
+            decision={"decision": "approved"},
+        )
+```
+
+**Step 2: Run test to verify it fails**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_service_pending_tool_call.py -v`
+Expected: FAIL
+
+**Step 3: Write minimal implementation**
+
+```python
+def _build_snapshot_v2(...):
+    return AgentSessionSnapshot(...).model_dump(mode="json")
+
+
+def _load_snapshot_v2(raw: dict[str, Any] | None) -> AgentSessionSnapshot:
+    if raw is None:
+        raise ValueError("state_snapshot missing")
+    return AgentSessionSnapshot.model_validate(raw)
+```
+
+并将 `set_pending_tool_call/get_state_snapshot/update_pending_tool_call_status` 全部改成 v2 模型读写。
+
+**Step 4: Run test to verify it passes**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_service_pending_tool_call.py -v`
+Expected: PASS
+
+**Step 5: Commit**
+
+```bash
+git add backend/src/v1/agent/service.py backend/tests/unit/v1/agent/test_service_pending_tool_call.py
+git commit -m "refactor(agent): enforce v2 snapshot read write in service"
+```
+
+---
+
+### Task 3: 增加 resume 行锁与并发幂等
+
+**Files:**
+- Modify: `backend/src/v1/agent/service.py`
+- Test: `backend/tests/unit/v1/agent/test_resume_idempotency.py`
+
+**Step 1: Write the failing test**
+
+```python
+@pytest.mark.asyncio
+async def test_apply_resume_decision_uses_locked_session_fetch(service, fake_db, session):
+    await service.apply_resume_decision(
+        session_id=session.id,
+        interrupt_id="int-1",
+        decision={"decision": "approved"},
+    )
+    assert fake_db.last_fetch_with_lock is True
+
+
+@pytest.mark.asyncio
+async def test_resume_is_idempotent(service, session):
+    first = await service.apply_resume_decision(...)
+    second = await service.apply_resume_decision(...)
+    assert first.applied is True
+    assert second.applied is False
+```
+
+**Step 2: Run test to verify it fails**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
+Expected: FAIL
+
+**Step 3: Write minimal implementation**
+
+```python
+async def _get_session_for_update(self, session_id: UUID) -> AgentChatSession | None:
+    stmt = (
+        select(AgentChatSession)
+        .where(AgentChatSession.id == session_id)
+        .with_for_update()
+        .limit(1)
+    )
+    result = await self._session.execute(stmt)
+    return result.scalar_one_or_none()
+```
+
+`apply_resume_decision` 改为锁内读取、校验、状态迁移，保证并发下单次生效。
+
+**Step 4: Run test to verify it passes**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
+Expected: PASS
+
+**Step 5: Commit**
+
+```bash
+git add backend/src/v1/agent/service.py backend/tests/unit/v1/agent/test_resume_idempotency.py
+git commit -m "fix(agent): add row lock for resume state transition"
+```
+
+---
+
+### Task 4: 增加 expires_at 过期校验（含 EXPIRED 迁移）
+
+**Files:**
+- Modify: `backend/src/v1/agent/service.py`
+- Test: `backend/tests/unit/v1/agent/test_resume_idempotency.py`
+
+**Step 1: Write the failing test**
+
+```python
+@pytest.mark.asyncio
+async def test_resume_expired_pending_returns_not_applied_and_marks_expired(service, session):
+    await service.set_pending_tool_call(..., expires_at=datetime.now(timezone.utc) - timedelta(seconds=1), thread_id="t1", run_id="r1")
+    result = await service.apply_resume_decision(
+        session_id=session.id,
+        interrupt_id="int-1",
+        decision={"decision": "approved"},
+    )
+    assert result.applied is False
+    snapshot = await service.get_state_snapshot(session.id)
+    assert snapshot["pending_tool_call"]["status"] == "EXPIRED"
+```
+
+**Step 2: Run test to verify it fails**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
+Expected: FAIL
+
+**Step 3: Write minimal implementation**
+
+```python
+if pending.expires_at < datetime.now(timezone.utc):
+    pending.status = PendingToolStatus.EXPIRED
+    pending.updated_at = datetime.now(timezone.utc)
+    session.state_snapshot = snapshot.model_dump(mode="json")
+    return ResumeDecisionResult(applied=False, expired=True)
+```
+
+**Step 4: Run test to verify it passes**
+
+Run: `uv run pytest backend/tests/unit/v1/agent/test_resume_idempotency.py -v`
+Expected: PASS
+
+**Step 5: Commit**
+
+```bash
+git add backend/src/v1/agent/service.py backend/tests/unit/v1/agent/test_resume_idempotency.py
+git commit -m "fix(agent): enforce expires_at when applying resume decision"
+```
+
+---
+
+### Task 5: 路由层补齐 v2 快照与过期/冲突错误映射
+
+**Files:**
+- Modify: `backend/src/v1/agent/router.py`
+- Modify: `backend/src/v1/agent/service.py`
+- Test: `backend/tests/integration/v1/agent/test_chat_routes.py`
+- Test: `backend/tests/integration/v1/agent/test_interrupt_resume_flow.py`
+
+**Step 1: Write the failing test**
+
+```python
+def test_resume_route_returns_409_on_run_id_mismatch(client):
+    ...
+
+
+def test_resume_route_returns_410_when_pending_expired(client):
+    ...
+
+
+def test_resume_route_returns_422_for_legacy_snapshot(client):
+    ...
+```
+
+**Step 2: Run test to verify it fails**
+
+Run: `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
+Expected: FAIL
+
+**Step 3: Write minimal implementation**
+
+在 `stream_resume` 或路由调用链里将领域错误映射为：
+
+- 过期 -> `HTTPException(410)`
+- 旧快照/结构错误 -> `HTTPException(422)`
+- 状态冲突/重复消费 -> `HTTPException(409)`
+
+**Step 4: Run test to verify it passes**
+
+Run: `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
+Expected: PASS
+
+**Step 5: Commit**
+
+```bash
+git add backend/src/v1/agent/router.py backend/src/v1/agent/service.py backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py
+git commit -m "fix(agent): map resume snapshot errors to 409 410 422"
+```
+
+---
+
+### Task 6: 更新文档并完成验证
+
+**Files:**
+- Modify: `docs/plans/2026-03-03-agent-chat-design.md`
+- Modify: `docs/runtime/runtime-route.md`
+
+**Step 1: Update docs**
+
+- 明确 `state_snapshot version=2` 为唯一支持结构
+- 明确 resume 过期与并发冲突语义（410/409）
+- 明确旧快照拒绝策略（422）
+
+**Step 2: Run unit tests**
+
+Run: `uv run pytest backend/tests/unit/v1/agent -v`
+Expected: PASS
+
+**Step 3: Run integration tests**
+
+Run: `uv run pytest backend/tests/integration/v1/agent/test_chat_routes.py backend/tests/integration/v1/agent/test_interrupt_resume_flow.py -v`
+Expected: PASS
+
+**Step 4: Run static checks**
+
+Run: `cd backend && uv run ruff check src/v1/agent`
+Expected: PASS
+
+Run: `cd backend && uv run basedpyright src/v1/agent`
+Expected: PASS
+
+**Step 5: Commit**
+
+```bash
+git add docs/plans/2026-03-03-agent-chat-design.md docs/runtime/runtime-route.md
+git commit -m "docs(agent): document strict snapshot v2 and resume error semantics"
+```
+
+---
+
+Plan complete and saved to `docs/plans/2026-03-03-interrupt-resume-fixes-implementation-plan.md`.
+
+Execution mode selected by user request: Subagent-Driven (this session), proceed task-by-task immediately.