social-app/docs/bugs/2026-03-07-agent-module-review.md

# Agent 模块审查报告

**日期**: 2026-03-07
**范围**: `backend/src/core/agent`
**状态**: 待修复

---

## 🔴 HIGH - 阻塞性问题

### 1. 同步 LLM 调用阻塞异步事件循环

**文件**: `infrastructure/crewai/runtime.py:126`

**问题**:
```python
response = run_completion(...)  # 同步调用
```

`run_completion` 使用 `litellm.completion()` 是同步的，但 `RunService.run()` 是异步方法。这会阻塞整个事件循环，在高并发下严重影响性能。

**建议**: 使用 `litellm.acompletion()` 或 `asyncio.to_thread()`。

**影响范围**:
- `infrastructure/litellm/client.py` - 需要添加异步版本
- `infrastructure/crewai/runtime.py` - `_run_stage()` 需要改为异步

---

## 🟡 MEDIUM - 需要修复

### 2. 缺少输入长度验证

**文件**: `application/run_service.py:63`

**问题**:
```python
async def run(self, *, session_id: str, user_input: str) -> dict[str, object]:
```

`user_input` 没有长度限制，恶意用户可发送超大输入消耗 tokens 和资源。

**建议**: 添加最大长度验证（如 10000 字符）。

```python
MAX_USER_INPUT_LENGTH = 10000

if len(user_input) > MAX_USER_INPUT_LENGTH:
    raise ValueError(f"user_input exceeds maximum length of {MAX_USER_INPUT_LENGTH}")
```

---

### 3. LLM 调用无超时控制

**文件**: `infrastructure/crewai/runtime.py:126`

**问题**: `run_completion` 没有设置超时，如果 LLM API 挂起，请求会无限期阻塞。

**建议**: 添加 `timeout` 参数。

```python
def run_completion(
    *,
    model: str,
    api_key: str,
    messages: list[dict[str, Any]],
    temperature: float | None = None,
    max_tokens: int | None = None,
    timeout: float | None = None,  # 新增
) -> Any:
    kwargs["timeout"] = timeout
    ...
```

---

### 4. 硬编码工具结果

**文件**: `application/resume_service.py:52`

**问题**:
```python
content='{"status":"ok"}',
```

工具执行结果被硬编码为 `{"status":"ok"}`，看起来是占位符代码，实际工具执行结果未被使用。

**建议**: 实现真正的工具执行逻辑，或明确标注为待实现。

---

### 5. 缓存写入异常静默失败

**文件**: `infrastructure/persistence/user_context_cache.py:95-96`

**问题**:
```python
async def set(self, *, session_id: UUID, context: UserAgentContext) -> None:
    ...
    except Exception:
        return None
```

`set()` 方法失败时静默返回 `None`，调用方无法知道缓存是否成功，可能导致缓存失效问题难以排查。

**建议**: 记录日志或抛出异常。

```python
except Exception as exc:
    logger.warning("Failed to cache user context", session_id=str(session_id), error=str(exc))
    return None
```

---

## 🟢 LOW - 建议改进

### 6. Redis Stream 响应格式校验缺失

**文件**: `infrastructure/events/redis_stream.py:62`

**问题**:
```python
_, entries = response[0]
```

假设 response 格式正确，异常格式会导致 `IndexError`。

**建议**: 添加防御性检查。

---

### 7. 路径限制不支持子目录

**文件**: `infrastructure/crewai/loader.py:47`

**问题**:
```python
if resolved.parent != base_dir:
```

只允许文件直接在 `base_dir` 下，未来扩展子目录模板可能受限。

**建议**: 改为检查路径是否在 `base_dir` 下（允许子目录）。

---

### 8. 异常信息丢失

**文件**: `infrastructure/queue/tasks.py:112`

**问题**:
```python
except Exception:  # noqa: BLE001
    error_id = "agent_runtime_failed"
    logger.exception(...)
```

捕获所有异常但只用 `error_id` 标识，丢失了具体异常类型，排查困难。

**建议**: 在日志中记录异常类型。

---

## ✅ 良好实践

以下设计值得肯定：

- **DDD 分层清晰**: domain / application / infrastructure 职责分明
- **Repository 不做 commit**: 由 Service 控制事务边界
- **并发控制**: 使用 `FOR UPDATE` 锁防止并发问题
- **敏感字段脱敏**: `agui/bridge.py` 实现了 `_redact_sensitive()`
- **路径穿越防护**: `loader.py` 使用 `_resolve_allowed_path()`
- **协议抽象**: 使用 Protocol 进行依赖解耦

---

## 修复优先级建议

| 优先级 | 问题 | 预计工时 |
|--------|------|----------|
| P0 | 同步 LLM 调用阻塞 | 2h |
| P1 | 输入长度验证 | 0.5h |
| P1 | LLM 超时控制 | 1h |
| P2 | 硬编码工具结果 | 待定 |
| P2 | 缓存异常处理 | 0.5h |
| P3 | 其他 LOW 问题 | 1h |