10 KiB
PRD: Divination Session Deletion Anonymization (iOS Compliance)
Background
iOS App Store (US market) requires apps to comply with data deletion regulations. When users request deletion of their divination history, the app must remove all personally identifiable information (PII). However, we still need to retain anonymized usage/tool data for product improvement and algorithm feedback.
Current implementation (DELETE /api/v1/agent/sessions/{thread_id}) only performs a soft-delete on the sessions row (sets deleted_at). Associated messages rows are not touched at all and remain fully queryable with deleted_at = NULL. This is insufficient for iOS compliance.
Goal
Replace the current soft-delete flow with an anonymize-then-hard-delete strategy:
- Desensitize/Anonymize the session data: extract non-PII usage metrics, strip all PII, generate an anonymous snapshot
- Save the anonymized usage data to a new
anonymous_session_snapshotstable - Hard-delete the original
sessionsandmessagesrecords from the database
Scope
- In scope: Backend API change for session deletion, new anonymization table and migration, anonymization logic, hard-delete logic
- Out of scope: Account deletion flow (separate concern), frontend UI changes (the API contract for deletion stays the same -
DELETEreturns 204)
Current Architecture
API Endpoint
| Layer | File | Detail |
|---|---|---|
| Router | backend/src/v1/agent/router.py:307-314 |
DELETE /sessions/{thread_id} |
| Service | backend/src/v1/agent/service.py:225-239 |
delete_session() verifies ownership, soft-deletes |
| Repository | backend/src/v1/agent/repository.py:99-119 |
Sets session.deleted_at = now() |
PII Fields (Must Be Anonymized or Removed)
High-risk PII:
| Table | Column | Content |
|---|---|---|
| sessions | title | User's divination question (up to 80 chars) |
| sessions | state_snapshot | Full session state with divination payload |
| messages | content | Full user question and AI response text |
| messages | metadata | JSONB with user_message_attachments, agent_output, tool_agent_output |
Medium-risk PII:
| Table | Column | Content |
|---|---|---|
| sessions | user_id | Links to auth.users (UUID, indirect PII) |
| messages | session_id | Links to session (FK) |
Non-PII (to retain for analytics):
| Table | Column | Content |
|---|---|---|
| sessions | session_type | 'chat' |
| sessions | status | pending/running/completed/failed |
| sessions | total_tokens | Usage metric |
| sessions | total_cost | Usage metric |
| sessions | message_count | Counter (used for follow-up ratio analysis) |
| sessions | created_at / last_activity_at | Timestamps |
| messages | model_code | LLM model identifier |
| messages | tool_name | Divination tool name |
| messages | latency_ms | Response latency |
| messages->metadata | agent_output.sign_level | Sign level (上上签/中上签/中下签/下下签) |
| messages->metadata | agent_output.keywords | Key insights from reading |
| messages->metadata | agent_output.divination_derived.questionType | Question category (career/love/wealth/health) |
| messages->metadata | agent_output.divination_derived.guaName | Hexagram name |
| messages->metadata | agent_output.divination_derived.guaNameHant | Hexagram name (Traditional Chinese) |
| messages->metadata | agent_output.divination_derived.targetGuaName | Target hexagram name (if changing lines exist) |
| messages->metadata | agent_output.divination_derived.hasChangingYao | Whether session has changing lines |
Analytics Requirements:
- Question type distribution: Count by
question_type - Follow-up ratio:
message_count > 2indicates follow-up questions - LLM performance comparison: Group by
model_code, analyzestatus,total_latency_ms,total_tokens - Hexagram accuracy analysis: Distribution of
sign_level,gua_name,has_changing_yao
Technical Design
1. New Table: anonymous_session_snapshots
CREATE TABLE anonymous_session_snapshots (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
anonymous_id UUID NOT NULL, -- Random UUID, no link to real user
-- Session metadata
session_type VARCHAR(20) NOT NULL, -- 'chat'
message_count INTEGER, -- Used for follow-up ratio analysis
status VARCHAR(20), -- Session final status
-- Question & divination
question_type VARCHAR(50), -- Career/love/wealth/health etc.
tool_name VARCHAR(100), -- Divination tool name
-- Hexagram details (for accuracy analysis)
gua_name VARCHAR(50), -- Hexagram name
gua_name_hant VARCHAR(50), -- Hexagram name (Traditional Chinese)
target_gua_name VARCHAR(50), -- Target hexagram (if changing lines exist)
has_changing_yao BOOLEAN, -- Whether session has changing lines
sign_level VARCHAR(20), -- 上上签/中上签/中下签/下下签
keywords TEXT[], -- Key insights from reading
-- Model & usage metrics
model_code VARCHAR(50), -- LLM model used
total_tokens INTEGER, -- Token usage
total_cost NUMERIC, -- Cost metric
total_latency_ms INTEGER, -- Aggregated latency
-- Timestamps (day precision to prevent re-identification)
created_at TIMESTAMPTZ NOT NULL, -- Original session creation time
last_activity_at TIMESTAMPTZ, -- Original last activity
anonymized_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- RLS: service role only, no user access
ALTER TABLE anonymous_session_snapshots ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Service role can manage anonymous snapshots"
ON anonymous_session_snapshots FOR ALL
USING (auth.role() = 'service_role');
Design notes:
anonymous_idis a randomly generated UUID with no mapping back to the original user- Timestamps are stored with date-only precision (day granularity) to prevent re-identification via time correlation
session_typeonly supports 'chat' (AUTOMATION is legacy from reused database schema, not used in this project)- All structued non-PII fields are retained for flexible future analysis (principle: "complete retention, filter on analysis")
- No
user_id, no question text, no AI response text - only structured/aggregate metrics - RLS ensures no user (even authenticated) can access this table, only service_role
2. Anonymization Service
New module: backend/src/v1/agent/anonymizer.py
class SessionAnonymizer:
"""Anonymizes session data per iOS compliance requirements."""
def anonymize(self, session: AgentChatSession, messages: list[AgentChatMessage]) -> AnonymousSessionSnapshot:
"""
Extract non-PII data from session+messages into an anonymous snapshot.
- Generates a random anonymous_id (no mapping to user)
- Truncates timestamps to day precision
- Extracts question_type from message metadata (category only)
- Aggregates latency metrics
- Strips all PII: user_id, title, content, state_snapshot, attachments
"""
Key anonymization rules:
- Strip entirely:
user_id,title,state_snapshot,content(all message content),question(user's original text),answer(AI response text),user_message_attachments, rawagent_output/tool_agent_outputobjects - Retain structured fields:
- Session:
session_type,status,total_tokens,total_cost,message_count - Divination:
question_type,tool_name,gua_name,gua_name_hant,target_gua_name,has_changing_yao,sign_level,keywords - Model:
model_code
- Session:
- Transform: timestamps truncated to day precision
- Aggregate: sum
latency_msacross all messages intototal_latency_ms
3. Modified Deletion Flow
The DELETE /api/v1/agent/sessions/{thread_id} endpoint changes from soft-delete to:
1. Verify session ownership (existing logic)
2. Load session + all associated messages
3. Call SessionAnonymizer.anonymize() → create AnonymousSessionSnapshot
4. Insert anonymous snapshot into DB
5. Hard-delete messages (DELETE FROM messages WHERE session_id = ?)
6. Hard-delete session (DELETE FROM sessions WHERE id = ?)
7. Delete associated storage objects (user_message_attachments from metadata)
8. Return 204 (same as before)
This must run in a single database transaction to ensure atomicity:
- If anonymization fails, nothing is deleted
- If deletion fails, no data is lost
4. Storage Object Cleanup
Extract user_message_attachments paths from message metadata before anonymization, then delete those storage objects after the DB transaction commits (best-effort, non-blocking - storage cleanup failure should not roll back the DB operation).
5. Frontend Impact
None. The API contract remains identical:
DELETE /api/v1/agent/sessions/{thread_id}returns 204 regardless- Frontend already does optimistic deletion with rollback on failure
No frontend changes required.
Migration Plan
- Create migration:
anonymous_session_snapshotstable + RLS policies - Add
SessionAnonymizermodule - Modify
AgentRepository.delete_session()to: anonymize → save snapshot → hard-delete - Add unit tests for anonymization logic
- Add integration test for the full deletion flow
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Data loss if anonymization fails mid-transaction | Wrap in single DB transaction; rollback on any error |
| Storage objects remain after hard-delete | Best-effort async cleanup; add periodic garbage collection |
| Anonymous data could be re-identified via time correlation | Truncate timestamps to day precision; no per-message timestamps |
| Existing soft-deleted sessions still have PII | Out of scope; handled separately via data cleanup script |
| question_type may not exist for all sessions | Make field nullable; skip if metadata lacks questionType |
Open Questions
- Should we also anonymize and hard-delete already soft-deleted sessions retroactively? (Recommended: yes, as a separate data cleanup task)
- Should
points_ledgerentries linked to the session also be cleaned up on deletion? (Out of scope for this task, but worth noting) - Date precision: is day-level sufficient, or should we use week/month? (Proposing day-level as default)