docs(runtime): optimize runbook for ops workflow
This commit is contained in:
@@ -0,0 +1,132 @@
|
||||
# Runtime Runbook Optimization Implementation Plan
|
||||
|
||||
> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
|
||||
|
||||
**Goal:** 将 `docs/runtime/runtime-runbook.md` 重构为面向运维的可执行手册,覆盖门禁、启动、验证、故障与回滚全流程。
|
||||
|
||||
**Architecture:** 保持单文档模式,在不改变脚本和运行时代码的前提下重排章节与命令。先做命令基线校对,再做文档结构重构,最后执行可达性验证并提交。所有命令以仓库现有脚本和 compose 路径为准。
|
||||
|
||||
**Tech Stack:** Markdown, Bash, Docker Compose, tmux, uv。
|
||||
|
||||
---
|
||||
|
||||
### Task 1: 命令与脚本基线核对
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runtime/runtime-runbook.md`
|
||||
- Verify: `infra/scripts/app-up.sh`
|
||||
|
||||
**Step 1: 写失败校验(当前 runbook 存在 TODO 与历史表述)**
|
||||
|
||||
```bash
|
||||
grep -n "TODO\|dev-app-up" docs/runtime/runtime-runbook.md
|
||||
```
|
||||
|
||||
**Step 2: 运行并确认失败**
|
||||
|
||||
Run: `grep -n "TODO\|dev-app-up" docs/runtime/runtime-runbook.md`
|
||||
Expected: 命中至少 1 条(表示需重构)。
|
||||
|
||||
**Step 3: 写最小实现(命令映射清单)**
|
||||
|
||||
```markdown
|
||||
- 启动脚本统一为 infra/scripts/app-up.sh
|
||||
- bootstrap 命令统一为 docker compose --env-file .env -f infra/docker/docker-compose.yml ...
|
||||
- 迁移/初始化强调 init-job --build
|
||||
```
|
||||
|
||||
**Step 4: 运行验证**
|
||||
|
||||
Run: `bash -n infra/scripts/app-up.sh`
|
||||
Expected: exit 0。
|
||||
|
||||
### Task 2: 文档结构重构为运维分层
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runtime/runtime-runbook.md`
|
||||
|
||||
**Step 1: 写失败校验(缺失目标章节)**
|
||||
|
||||
```bash
|
||||
grep -n "Bootstrap Gate\|Operational Verification\|Incident Playbook\|Rollback" docs/runtime/runtime-runbook.md
|
||||
```
|
||||
|
||||
**Step 2: 运行并确认失败**
|
||||
|
||||
Run: `grep -n "Bootstrap Gate\|Operational Verification\|Incident Playbook\|Rollback" docs/runtime/runtime-runbook.md`
|
||||
Expected: 命中不完整或为空。
|
||||
|
||||
**Step 3: 写最小实现(章节重排)**
|
||||
|
||||
```markdown
|
||||
1. Scope & Preconditions
|
||||
2. Bootstrap Gate (Mandatory)
|
||||
3. Service Start/Stop
|
||||
4. Operational Verification (L1/L2/L3)
|
||||
5. Incident Playbook
|
||||
6. Rollback Procedure
|
||||
```
|
||||
|
||||
**Step 4: 运行验证**
|
||||
|
||||
Run: `grep -n "Bootstrap Gate\|Operational Verification\|Incident Playbook\|Rollback Procedure" docs/runtime/runtime-runbook.md`
|
||||
Expected: 4 个目标章节都能命中。
|
||||
|
||||
### Task 3: 补齐运维验证与故障处理细则
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runtime/runtime-runbook.md`
|
||||
|
||||
**Step 1: 写失败校验(缺少通过判定)**
|
||||
|
||||
```bash
|
||||
grep -n "通过标准\|判定" docs/runtime/runtime-runbook.md
|
||||
```
|
||||
|
||||
**Step 2: 运行并确认失败**
|
||||
|
||||
Run: `grep -n "通过标准\|判定" docs/runtime/runtime-runbook.md`
|
||||
Expected: 命中不足。
|
||||
|
||||
**Step 3: 写最小实现(每段加判定)**
|
||||
|
||||
```markdown
|
||||
- L1 必跑:health/compose/smoke + 通过标准
|
||||
- L2 可选:auth/profile + 通过标准
|
||||
- L3 可选:agent_chat tests + 通过标准
|
||||
- 故障条目:症状/定位/修复
|
||||
```
|
||||
|
||||
**Step 4: 运行验证**
|
||||
|
||||
Run: `grep -n "L1 必跑\|L2 可选\|L3 可选\|通过标准" docs/runtime/runtime-runbook.md`
|
||||
Expected: 关键段落均命中。
|
||||
|
||||
### Task 4: 收尾校验与提交
|
||||
|
||||
**Files:**
|
||||
- Modify: `docs/runtime/runtime-runbook.md`
|
||||
|
||||
**Step 1: 运行文档语义检查(关键命令可达)**
|
||||
|
||||
```bash
|
||||
bash -n infra/scripts/app-up.sh
|
||||
PYTHONPATH=backend/src uv run python -c "import core.runtime.cli"
|
||||
```
|
||||
|
||||
**Step 2: 运行并确认通过**
|
||||
|
||||
Run: `bash -n infra/scripts/app-up.sh`
|
||||
Expected: exit 0。
|
||||
|
||||
Run: `PYTHONPATH=backend/src uv run python -c "import core.runtime.cli"`
|
||||
Expected: 无报错并 exit 0。
|
||||
|
||||
**Step 3: 提交**
|
||||
|
||||
```bash
|
||||
git add docs/runtime/runtime-runbook.md \
|
||||
docs/plans/2026-02-25-runtime-runbook-optimization-design.md \
|
||||
docs/plans/2026-02-25-runtime-runbook-optimization-implementation-plan.md
|
||||
git commit -m "docs(runtime): optimize runbook for ops workflow"
|
||||
```
|
||||
+151
-102
@@ -1,37 +1,84 @@
|
||||
# Runtime Runbook
|
||||
|
||||
**Date:** 2026-02-25
|
||||
**Status:** Active
|
||||
**Status:** Active
|
||||
**Audience:** 运维 / 后端值班
|
||||
|
||||
## 开发环境启动
|
||||
## Scope & Preconditions
|
||||
|
||||
### 一键启动
|
||||
本手册用于日常值班、发布前检查、故障处置与回滚。
|
||||
|
||||
### 前置条件
|
||||
|
||||
- 已配置 `.env`(仓库根目录)。
|
||||
- 主机可用:`docker`、`docker compose`、`tmux`、`uv`。
|
||||
- 已拉取最新代码并确认当前分支与目标发布版本一致。
|
||||
|
||||
### 红线规则
|
||||
|
||||
- 禁止跳过 bootstrap gate 直接启动 web/worker。
|
||||
- 迁移/初始化容器执行时必须带 `--build`,避免旧镜像导致迁移不生效。
|
||||
|
||||
---
|
||||
|
||||
## Bootstrap Gate (Mandatory)
|
||||
|
||||
以下流程必须按顺序执行。
|
||||
|
||||
### Step 1: 启动基础设施
|
||||
|
||||
```bash
|
||||
# 1. 首次或 schema 变更后,执行 bootstrap
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml run --rm init-job bootstrap
|
||||
|
||||
# 2. 日常启动服务(tmux)
|
||||
bash infra/scripts/app-up.sh
|
||||
|
||||
# 查看 tmux 窗口
|
||||
tmux list-windows -t social-dev
|
||||
|
||||
# 进入会话观察日志
|
||||
tmux attach -t social-dev
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml up -d
|
||||
```
|
||||
|
||||
### tmux 会话管理
|
||||
通过标准:`docker compose ... ps` 中 redis/supabase 相关容器为 `running`。
|
||||
|
||||
### Step 2: 执行迁移与初始化
|
||||
|
||||
```bash
|
||||
# 杀掉会话(停止 web/workers)
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml run --rm --build init-job bootstrap
|
||||
```
|
||||
|
||||
通过标准:命令退出码为 0,日志中无 migration/init-data 错误。
|
||||
|
||||
### Step 3: 版本核对(建议)
|
||||
|
||||
```bash
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml exec -T db \
|
||||
psql -U postgres -d postgres -c "SELECT version_num FROM public.alembic_version;"
|
||||
```
|
||||
|
||||
通过标准:返回 1 行版本号,且与发布预期版本一致。
|
||||
|
||||
---
|
||||
|
||||
## Service Start / Stop (tmux)
|
||||
|
||||
### 启动应用进程
|
||||
|
||||
```bash
|
||||
bash infra/scripts/app-up.sh
|
||||
```
|
||||
|
||||
该脚本会在 tmux `social-dev` 会话中拉起:
|
||||
|
||||
- web
|
||||
- worker-critical
|
||||
- worker-default
|
||||
- worker-bulk
|
||||
|
||||
通过标准:`tmux list-windows -t social-dev` 可见上述窗口。
|
||||
|
||||
### 常用 tmux 命令
|
||||
|
||||
```bash
|
||||
tmux list-windows -t social-dev
|
||||
tmux attach -t social-dev
|
||||
tmux kill-session -t social-dev
|
||||
```
|
||||
|
||||
### 日志文件
|
||||
|
||||
每个服务自动生成独立日志文件:
|
||||
|
||||
| 服务 | 日志文件 |
|
||||
|------|---------|
|
||||
| Web | `logs/web.log`, `logs/web.error.log` |
|
||||
@@ -41,121 +88,122 @@ tmux kill-session -t social-dev
|
||||
|
||||
---
|
||||
|
||||
## 生产环境启动
|
||||
## Operational Verification
|
||||
|
||||
> TODO: 待补充
|
||||
按优先级分层执行。
|
||||
|
||||
### L1 必跑(发布前/故障恢复后必须)
|
||||
|
||||
```bash
|
||||
# TBD
|
||||
```
|
||||
# 先导入 .env,确保端口与配置一致
|
||||
set -a
|
||||
. ./.env
|
||||
set +a
|
||||
|
||||
---
|
||||
WEB_BASE_URL="http://127.0.0.1:${SOCIAL_WEB__PORT:-5775}"
|
||||
|
||||
## 服务说明
|
||||
|
||||
| 服务 | 说明 |
|
||||
|------|------|
|
||||
| redis | 缓存与 Celery broker |
|
||||
| supabase-* | 认证与数据库相关服务 |
|
||||
| init-job | 数据库迁移和初始化(一次性) |
|
||||
| web | Web 服务 (gunicorn) |
|
||||
| worker-* | Celery worker (3 个队列) |
|
||||
|
||||
## 配置说明
|
||||
|
||||
### Web 服务器配置
|
||||
|
||||
| 环境变量 | 说明 | 默认值 |
|
||||
|----------|------|--------|
|
||||
| `SOCIAL_WEB__HOST` | 监听地址 | 0.0.0.0 |
|
||||
| `SOCIAL_WEB__PORT` | 监听端口 | 8000 |
|
||||
| `SOCIAL_WEB__GUNICORN__WORKERS` | Gunicorn 工作进程数 | 2 |
|
||||
| `SOCIAL_WEB__GUNICORN__WORKER_CLASS` | Gunicorn worker 类 | uvicorn.workers.UvicornWorker |
|
||||
| `SOCIAL_WEB__GUNICORN__TIMEOUT` | 请求超时秒数 | 60 |
|
||||
|
||||
### Celery 队列路由
|
||||
|
||||
| 任务前缀 | 队列 |
|
||||
|----------|------|
|
||||
| tasks.critical.* | critical |
|
||||
| tasks.bulk.* | bulk |
|
||||
| 其他 | default |
|
||||
|
||||
## 健康检查
|
||||
|
||||
```bash
|
||||
# Supabase 网关
|
||||
# 基础健康
|
||||
curl -fsS http://127.0.0.1:${SOCIAL_SUPABASE__KONG_HTTP_PORT:-8000}/health
|
||||
|
||||
# 数据库迁移与初始化
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml --profile job run --rm --build init-job
|
||||
```
|
||||
|
||||
## 查看服务状态
|
||||
|
||||
```bash
|
||||
# compose 状态
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml ps
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml logs -f db
|
||||
|
||||
# init-job 为一次性任务(run --rm),如需查看日志请重跑:
|
||||
docker compose --env-file .env -f infra/docker/docker-compose.yml --profile job run --rm --build init-job
|
||||
# 核心接口 smoke
|
||||
curl -sS -X POST "${WEB_BASE_URL}/api/v1/auth/login" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"email":"demo@example.com","password":"secret123"}'
|
||||
```
|
||||
|
||||
## Auth/Profile 验证
|
||||
通过标准:health 返回 2xx,关键容器 `running`,核心接口返回预期业务状态码。
|
||||
|
||||
### L2 可选(Auth/Profile 业务回归)
|
||||
|
||||
```bash
|
||||
# 注意:默认模板地址 http://mail-templates/* 仅在 Docker Compose 内网可用。
|
||||
# 生产环境请替换为 gotrue 可访问的模板 URL。
|
||||
|
||||
# signup start: username + email + password(发送验证码)
|
||||
curl -sS -X POST http://127.0.0.1:8000/api/v1/auth/signup/start \
|
||||
# signup start
|
||||
curl -sS -X POST "${WEB_BASE_URL}/api/v1/auth/signup/start" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"username":"demo","email":"demo@example.com","password":"secret123"}'
|
||||
|
||||
# signup verify: email + token(6位验证码)
|
||||
curl -sS -X POST http://127.0.0.1:8000/api/v1/auth/signup/verify \
|
||||
# signup verify
|
||||
curl -sS -X POST "${WEB_BASE_URL}/api/v1/auth/signup/verify" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"email":"demo@example.com","token":"123456"}'
|
||||
|
||||
# signup resend: email(重发验证码)
|
||||
curl -sS -X POST http://127.0.0.1:8000/api/v1/auth/signup/resend \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"email":"demo@example.com"}'
|
||||
|
||||
# login: email + password
|
||||
curl -sS -X POST http://127.0.0.1:8000/api/v1/auth/login \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"email":"demo@example.com","password":"secret123"}'
|
||||
|
||||
# by-email lookup
|
||||
curl -sS "http://127.0.0.1:8000/api/v1/auth/users/by-email?email=demo@example.com"
|
||||
|
||||
# patch profile: username/avatar_url/bio only
|
||||
curl -sS -X PATCH http://127.0.0.1:8000/api/v1/profile/me \
|
||||
# profile patch
|
||||
curl -sS -X PATCH "${WEB_BASE_URL}/api/v1/profile/me" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-H "Authorization: Bearer <access_token>" \
|
||||
-d '{"username":"demo2","bio":"hello"}'
|
||||
```
|
||||
|
||||
## Agent Chat 验证
|
||||
通过标准:接口返回符合预期的 2xx 或受控业务错误,无 5xx。
|
||||
|
||||
### L3 可选(Agent Chat 回归)
|
||||
|
||||
```bash
|
||||
# 1) 基础门禁(迁移 + init-data)
|
||||
make runtime-bootstrap-gate
|
||||
PYTHONPATH=backend/src uv run pytest backend/tests/unit -k agent_chat -q
|
||||
PYTHONPATH=backend/src uv run pytest backend/tests/integration -k agent_chat -q
|
||||
PYTHONPATH=backend/src uv run pytest backend/tests/e2e/test_agent_chat_flow.py backend/tests/e2e/test_agent_chat_recent_session_home.py -q
|
||||
|
||||
# 2) 运行 agent_chat 相关单测/集成/E2E
|
||||
PYTHONPATH=backend/src uv run pytest backend/tests/unit/core/agent_chat -v
|
||||
PYTHONPATH=backend/src uv run pytest backend/tests/integration -k agent_chat -v
|
||||
PYTHONPATH=backend/src uv run pytest backend/tests/e2e/test_agent_chat_flow.py backend/tests/e2e/test_agent_chat_recent_session_home.py -v
|
||||
|
||||
# 3) 核心接口 smoke
|
||||
curl -sS -X POST http://127.0.0.1:8000/api/v1/agent-chat/run \
|
||||
curl -sS -X POST "${WEB_BASE_URL}/api/v1/agent-chat/run" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"message":"hello"}'
|
||||
```
|
||||
|
||||
通过标准:测试通过,`/api/v1/agent-chat/run` 返回有效 `session_id` 与事件序列。
|
||||
|
||||
---
|
||||
|
||||
## 变更日志
|
||||
## Incident Playbook
|
||||
|
||||
### 1) 迁移未生效(常见于旧镜像)
|
||||
|
||||
- 症状:字段/表结构与代码不一致,接口报 schema 错误。
|
||||
- 定位:检查 `alembic_version` 与容器镜像构建时间。
|
||||
- 修复:重新执行 `init-job --build`,并复核版本号。
|
||||
|
||||
### 2) Worker 不消费任务
|
||||
|
||||
- 症状:队列堆积,任务长时间 pending。
|
||||
- 定位:检查 `worker-*` tmux 窗口和对应日志文件。
|
||||
- 修复:重启 tmux 会话,确认并发配置与队列名(critical/default/bulk)。
|
||||
|
||||
### 3) JWT 或认证异常
|
||||
|
||||
- 症状:接口持续 401/403。
|
||||
- 定位:核对 `.env` 中 Supabase JWT 配置与签发方设置。
|
||||
- 修复:修正配置后重启 web 进程并执行 L1/L2 验证。
|
||||
|
||||
### 4) Agent Chat 启动后异常
|
||||
|
||||
- 症状:`/api/v1/agent-chat/run` 返回 5xx 或事件不完整。
|
||||
- 定位:先跑 L3 测试,再看 `logs/web.error.log`。
|
||||
- 修复:先恢复到可用版本,再排查迁移、配置与依赖差异。
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedure
|
||||
|
||||
### 回滚前检查
|
||||
|
||||
- 确认目标回滚提交或版本号。
|
||||
- 确认是否涉及不可逆数据变更。
|
||||
|
||||
### 回滚执行
|
||||
|
||||
1. 停止应用进程:`tmux kill-session -t social-dev`
|
||||
2. 切换代码到目标版本。
|
||||
3. 按目标版本要求执行迁移回滚(如有)。
|
||||
4. 重新执行 bootstrap gate 与 service 启动。
|
||||
|
||||
### 回滚后复核
|
||||
|
||||
- 执行 L1 必跑检查。
|
||||
- 记录回滚原因、时间、影响范围和后续修复计划。
|
||||
|
||||
---
|
||||
|
||||
## Change Log
|
||||
|
||||
| 日期 | 变更 |
|
||||
|------|------|
|
||||
@@ -167,4 +215,5 @@ curl -sS -X POST http://127.0.0.1:8000/api/v1/agent-chat/run \
|
||||
| 2026-02-25 | Auth 注册切换为 OTP 三段式:signup/start、signup/verify、signup/resend;邮件模板改为纯验证码展示 |
|
||||
| 2026-02-25 | 清理未使用配置类:删除 WebSettings/GunicornSettings/WorkerSettings/WorkerGroupSettings(脚本仍使用环境变量启动服务) |
|
||||
| 2026-02-25 | 新增 Agent Chat 验证章节:bootstrap gate、分层测试命令与 run 接口 smoke 示例 |
|
||||
| 2026-02-25 | 简化启动方式:dev-app-up → app-up,分离 bootstrap 与服务启动 |
|
||||
| 2026-02-25 | 简化启动方式:dev-app-up -> app-up,分离 bootstrap 与服务启动 |
|
||||
| 2026-02-25 | 重构为运维分层手册:Bootstrap Gate、分层验证、故障与回滚流程 |
|
||||
|
||||
Reference in New Issue
Block a user