chore(task): archive 04-15-session-deletion-anonymization

This commit is contained in:
qzl
2026-04-15 18:19:20 +08:00
parent c2b726e7bd
commit 0bb7d77a3f
5 changed files with 53 additions and 21 deletions
@@ -0,0 +1,3 @@
{"file": ".opencode/commands/trellis/finish-work.md", "reason": "Finish work checklist"}
{"file": ".opencode/commands/trellis/check-backend.md", "reason": "Backend check spec"}
{"file": ".opencode/commands/trellis/check-frontend.md", "reason": "Frontend check spec"}
@@ -0,0 +1,2 @@
{"file": ".opencode/commands/trellis/check-backend.md", "reason": "Backend check spec"}
{"file": ".opencode/commands/trellis/check-frontend.md", "reason": "Frontend check spec"}
@@ -0,0 +1,3 @@
{"file": ".trellis/workflow.md", "reason": "Project workflow and conventions"}
{"file": ".trellis/spec/backend/index.md", "reason": "Backend development guide"}
{"file": ".trellis/spec/frontend/index.md", "reason": "Frontend development guide"}
@@ -0,0 +1,213 @@
# PRD: Divination Session Deletion Anonymization (iOS Compliance)
## Background
iOS App Store (US market) requires apps to comply with data deletion regulations. When users request deletion of their divination history, the app must remove all personally identifiable information (PII). However, we still need to retain anonymized usage/tool data for product improvement and algorithm feedback.
Current implementation (`DELETE /api/v1/agent/sessions/{thread_id}`) only performs a **soft-delete** on the `sessions` row (sets `deleted_at`). Associated `messages` rows are not touched at all and remain fully queryable with `deleted_at = NULL`. This is insufficient for iOS compliance.
## Goal
Replace the current soft-delete flow with an **anonymize-then-hard-delete** strategy:
1. **Desensitize/Anonymize** the session data: extract non-PII usage metrics, strip all PII, generate an anonymous snapshot
2. **Save** the anonymized usage data to a new `anonymous_session_snapshots` table
3. **Hard-delete** the original `sessions` and `messages` records from the database
## Scope
- **In scope**: Backend API change for session deletion, new anonymization table and migration, anonymization logic, hard-delete logic
- **Out of scope**: Account deletion flow (separate concern), frontend UI changes (the API contract for deletion stays the same - `DELETE` returns 204)
## Current Architecture
### API Endpoint
| Layer | File | Detail |
|-------|------|--------|
| Router | `backend/src/v1/agent/router.py:307-314` | `DELETE /sessions/{thread_id}` |
| Service | `backend/src/v1/agent/service.py:225-239` | `delete_session()` verifies ownership, soft-deletes |
| Repository | `backend/src/v1/agent/repository.py:99-119` | Sets `session.deleted_at = now()` |
### PII Fields (Must Be Anonymized or Removed)
**High-risk PII:**
| Table | Column | Content |
|-------|--------|---------|
| sessions | title | User's divination question (up to 80 chars) |
| sessions | state_snapshot | Full session state with divination payload |
| messages | content | Full user question and AI response text |
| messages | metadata | JSONB with `user_message_attachments`, `agent_output`, `tool_agent_output` |
**Medium-risk PII:**
| Table | Column | Content |
|-------|--------|---------|
| sessions | user_id | Links to auth.users (UUID, indirect PII) |
| messages | session_id | Links to session (FK) |
**Non-PII (to retain for analytics):**
| Table | Column | Content |
|-------|--------|---------|
| sessions | session_type | 'chat' |
| sessions | status | pending/running/completed/failed |
| sessions | total_tokens | Usage metric |
| sessions | total_cost | Usage metric |
| sessions | message_count | Counter (used for follow-up ratio analysis) |
| sessions | created_at / last_activity_at | Timestamps |
| messages | model_code | LLM model identifier |
| messages | tool_name | Divination tool name |
| messages | latency_ms | Response latency |
| messages->metadata | agent_output.sign_level | Sign level (上上签/中上签/中下签/下下签) |
| messages->metadata | agent_output.keywords | Key insights from reading |
| messages->metadata | agent_output.divination_derived.questionType | Question category (career/love/wealth/health) |
| messages->metadata | agent_output.divination_derived.guaName | Hexagram name |
| messages->metadata | agent_output.divination_derived.guaNameHant | Hexagram name (Traditional Chinese) |
| messages->metadata | agent_output.divination_derived.targetGuaName | Target hexagram name (if changing lines exist) |
| messages->metadata | agent_output.divination_derived.hasChangingYao | Whether session has changing lines |
**Analytics Requirements:**
1. **Question type distribution**: Count by `question_type`
2. **Follow-up ratio**: `message_count > 2` indicates follow-up questions
3. **LLM performance comparison**: Group by `model_code`, analyze `status`, `total_latency_ms`, `total_tokens`
4. **Hexagram accuracy analysis**: Distribution of `sign_level`, `gua_name`, `has_changing_yao`
## Technical Design
### 1. New Table: `anonymous_session_snapshots`
```sql
CREATE TABLE anonymous_session_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
anonymous_id UUID NOT NULL, -- Random UUID, no link to real user
-- Session metadata
session_type VARCHAR(20) NOT NULL, -- 'chat'
message_count INTEGER, -- Used for follow-up ratio analysis
status VARCHAR(20), -- Session final status
-- Question & divination
question_type VARCHAR(50), -- Career/love/wealth/health etc.
tool_name VARCHAR(100), -- Divination tool name
-- Hexagram details (for accuracy analysis)
gua_name VARCHAR(50), -- Hexagram name
gua_name_hant VARCHAR(50), -- Hexagram name (Traditional Chinese)
target_gua_name VARCHAR(50), -- Target hexagram (if changing lines exist)
has_changing_yao BOOLEAN, -- Whether session has changing lines
sign_level VARCHAR(20), -- 上上签/中上签/中下签/下下签
keywords TEXT[], -- Key insights from reading
-- Model & usage metrics
model_code VARCHAR(50), -- LLM model used
total_tokens INTEGER, -- Token usage
total_cost NUMERIC, -- Cost metric
total_latency_ms INTEGER, -- Aggregated latency
-- Timestamps (day precision to prevent re-identification)
created_at TIMESTAMPTZ NOT NULL, -- Original session creation time
last_activity_at TIMESTAMPTZ, -- Original last activity
anonymized_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- RLS: service role only, no user access
ALTER TABLE anonymous_session_snapshots ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Service role can manage anonymous snapshots"
ON anonymous_session_snapshots FOR ALL
USING (auth.role() = 'service_role');
```
Design notes:
- `anonymous_id` is a randomly generated UUID with **no mapping** back to the original user
- Timestamps are stored with **date-only precision** (day granularity) to prevent re-identification via time correlation
- `session_type` only supports 'chat' (AUTOMATION is legacy from reused database schema, not used in this project)
- All structued non-PII fields are retained for flexible future analysis (principle: "complete retention, filter on analysis")
- No `user_id`, no question text, no AI response text - only structured/aggregate metrics
- RLS ensures no user (even authenticated) can access this table, only service_role
### 2. Anonymization Service
New module: `backend/src/v1/agent/anonymizer.py`
```python
class SessionAnonymizer:
"""Anonymizes session data per iOS compliance requirements."""
def anonymize(self, session: AgentChatSession, messages: list[AgentChatMessage]) -> AnonymousSessionSnapshot:
"""
Extract non-PII data from session+messages into an anonymous snapshot.
- Generates a random anonymous_id (no mapping to user)
- Truncates timestamps to day precision
- Extracts question_type from message metadata (category only)
- Aggregates latency metrics
- Strips all PII: user_id, title, content, state_snapshot, attachments
"""
```
Key anonymization rules:
- **Strip entirely**: `user_id`, `title`, `state_snapshot`, `content` (all message content), `question` (user's original text), `answer` (AI response text), `user_message_attachments`, raw `agent_output` / `tool_agent_output` objects
- **Retain structured fields**:
- Session: `session_type`, `status`, `total_tokens`, `total_cost`, `message_count`
- Divination: `question_type`, `tool_name`, `gua_name`, `gua_name_hant`, `target_gua_name`, `has_changing_yao`, `sign_level`, `keywords`
- Model: `model_code`
- **Transform**: timestamps truncated to day precision
- **Aggregate**: sum `latency_ms` across all messages into `total_latency_ms`
### 3. Modified Deletion Flow
The `DELETE /api/v1/agent/sessions/{thread_id}` endpoint changes from soft-delete to:
```
1. Verify session ownership (existing logic)
2. Load session + all associated messages
3. Call SessionAnonymizer.anonymize() → create AnonymousSessionSnapshot
4. Insert anonymous snapshot into DB
5. Hard-delete messages (DELETE FROM messages WHERE session_id = ?)
6. Hard-delete session (DELETE FROM sessions WHERE id = ?)
7. Delete associated storage objects (user_message_attachments from metadata)
8. Return 204 (same as before)
```
This must run in a **single database transaction** to ensure atomicity:
- If anonymization fails, nothing is deleted
- If deletion fails, no data is lost
### 4. Storage Object Cleanup
Extract `user_message_attachments` paths from message metadata before anonymization, then delete those storage objects after the DB transaction commits (best-effort, non-blocking - storage cleanup failure should not roll back the DB operation).
### 5. Frontend Impact
**None.** The API contract remains identical:
- `DELETE /api/v1/agent/sessions/{thread_id}` returns 204 regardless
- Frontend already does optimistic deletion with rollback on failure
No frontend changes required.
## Migration Plan
1. Create migration: `anonymous_session_snapshots` table + RLS policies
2. Add `SessionAnonymizer` module
3. Modify `AgentRepository.delete_session()` to: anonymize → save snapshot → hard-delete
4. Add unit tests for anonymization logic
5. Add integration test for the full deletion flow
## Risks & Mitigations
| Risk | Mitigation |
|------|-----------|
| Data loss if anonymization fails mid-transaction | Wrap in single DB transaction; rollback on any error |
| Storage objects remain after hard-delete | Best-effort async cleanup; add periodic garbage collection |
| Anonymous data could be re-identified via time correlation | Truncate timestamps to day precision; no per-message timestamps |
| Existing soft-deleted sessions still have PII | Out of scope; handled separately via data cleanup script |
| question_type may not exist for all sessions | Make field nullable; skip if metadata lacks questionType |
## Open Questions
1. Should we also anonymize and hard-delete **already soft-deleted** sessions retroactively? (Recommended: yes, as a separate data cleanup task)
2. Should `points_ledger` entries linked to the session also be cleaned up on deletion? (Out of scope for this task, but worth noting)
3. Date precision: is day-level sufficient, or should we use week/month? (Proposing day-level as default)
@@ -0,0 +1,44 @@
{
"id": "session-deletion-anonymization",
"name": "session-deletion-anonymization",
"title": "Session deletion anonymization for iOS compliance",
"description": "Implement iOS-compliant data anonymization for divination session deletion: desensitize PII, retain anonymized usage data, hard-delete original records",
"status": "completed",
"dev_type": null,
"scope": null,
"priority": "P1",
"creator": "zl-q",
"assignee": "zl-q",
"createdAt": "2026-04-15",
"completedAt": "2026-04-15",
"branch": null,
"base_branch": "dev",
"worktree_path": null,
"current_phase": 0,
"next_action": [
{
"phase": 1,
"action": "implement"
},
{
"phase": 2,
"action": "check"
},
{
"phase": 3,
"action": "finish"
},
{
"phase": 4,
"action": "create-pr"
}
],
"commit": null,
"pr_url": null,
"subtasks": [],
"children": [],
"parent": null,
"relatedFiles": [],
"notes": "",
"meta": {}
}