qzl/eryao

Fork 0

Files

T

qzl 0bb7d77a3f chore(task): archive 04-15-session-deletion-anonymization

2026-04-15 18:19:20 +08:00

10 KiB

Raw Blame History

PRD: Divination Session Deletion Anonymization (iOS Compliance)

Background

iOS App Store (US market) requires apps to comply with data deletion regulations. When users request deletion of their divination history, the app must remove all personally identifiable information (PII). However, we still need to retain anonymized usage/tool data for product improvement and algorithm feedback.

Current implementation (DELETE /api/v1/agent/sessions/{thread_id}) only performs a soft-delete on the sessions row (sets deleted_at). Associated messages rows are not touched at all and remain fully queryable with deleted_at = NULL. This is insufficient for iOS compliance.

Goal

Replace the current soft-delete flow with an anonymize-then-hard-delete strategy:

Desensitize/Anonymize the session data: extract non-PII usage metrics, strip all PII, generate an anonymous snapshot
Save the anonymized usage data to a new anonymous_session_snapshots table
Hard-delete the original sessions and messages records from the database

Scope

In scope: Backend API change for session deletion, new anonymization table and migration, anonymization logic, hard-delete logic
Out of scope: Account deletion flow (separate concern), frontend UI changes (the API contract for deletion stays the same - DELETE returns 204)

Current Architecture

API Endpoint

Layer	File	Detail
Router	`backend/src/v1/agent/router.py:307-314`	`DELETE /sessions/{thread_id}`
Service	`backend/src/v1/agent/service.py:225-239`	`delete_session()` verifies ownership, soft-deletes
Repository	`backend/src/v1/agent/repository.py:99-119`	Sets `session.deleted_at = now()`

PII Fields (Must Be Anonymized or Removed)

High-risk PII:

Table	Column	Content
sessions	title	User's divination question (up to 80 chars)
sessions	state_snapshot	Full session state with divination payload
messages	content	Full user question and AI response text
messages	metadata	JSONB with `user_message_attachments`, `agent_output`, `tool_agent_output`

Medium-risk PII:

Table	Column	Content
sessions	user_id	Links to auth.users (UUID, indirect PII)
messages	session_id	Links to session (FK)

Non-PII (to retain for analytics):

Table	Column	Content
sessions	session_type	'chat'
sessions	status	pending/running/completed/failed
sessions	total_tokens	Usage metric
sessions	total_cost	Usage metric
sessions	message_count	Counter (used for follow-up ratio analysis)
sessions	created_at / last_activity_at	Timestamps
messages	model_code	LLM model identifier
messages	tool_name	Divination tool name
messages	latency_ms	Response latency
messages->metadata	agent_output.sign_level	Sign level (上上签/中上签/中下签/下下签)
messages->metadata	agent_output.keywords	Key insights from reading
messages->metadata	agent_output.divination_derived.questionType	Question category (career/love/wealth/health)
messages->metadata	agent_output.divination_derived.guaName	Hexagram name
messages->metadata	agent_output.divination_derived.guaNameHant	Hexagram name (Traditional Chinese)
messages->metadata	agent_output.divination_derived.targetGuaName	Target hexagram name (if changing lines exist)
messages->metadata	agent_output.divination_derived.hasChangingYao	Whether session has changing lines

Analytics Requirements:

Question type distribution: Count by question_type
Follow-up ratio: message_count > 2 indicates follow-up questions
LLM performance comparison: Group by model_code, analyze status, total_latency_ms, total_tokens
Hexagram accuracy analysis: Distribution of sign_level, gua_name, has_changing_yao

Technical Design

1. New Table: `anonymous_session_snapshots`

CREATE TABLE anonymous_session_snapshots (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  anonymous_id UUID NOT NULL,              -- Random UUID, no link to real user
  
  -- Session metadata
  session_type VARCHAR(20) NOT NULL,       -- 'chat'
  message_count INTEGER,                    -- Used for follow-up ratio analysis
  status VARCHAR(20),                       -- Session final status
  
  -- Question & divination
  question_type VARCHAR(50),                -- Career/love/wealth/health etc.
  tool_name VARCHAR(100),                   -- Divination tool name
  
  -- Hexagram details (for accuracy analysis)
  gua_name VARCHAR(50),                     -- Hexagram name
  gua_name_hant VARCHAR(50),                -- Hexagram name (Traditional Chinese)
  target_gua_name VARCHAR(50),              -- Target hexagram (if changing lines exist)
  has_changing_yao BOOLEAN,                 -- Whether session has changing lines
  sign_level VARCHAR(20),                   -- 上上签/中上签/中下签/下下签
  keywords TEXT[],                          -- Key insights from reading
  
  -- Model & usage metrics
  model_code VARCHAR(50),                   -- LLM model used
  total_tokens INTEGER,                     -- Token usage
  total_cost NUMERIC,                       -- Cost metric
  total_latency_ms INTEGER,                 -- Aggregated latency
  
  -- Timestamps (day precision to prevent re-identification)
  created_at TIMESTAMPTZ NOT NULL,          -- Original session creation time
  last_activity_at TIMESTAMPTZ,             -- Original last activity
  anonymized_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- RLS: service role only, no user access
ALTER TABLE anonymous_session_snapshots ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Service role can manage anonymous snapshots"
  ON anonymous_session_snapshots FOR ALL
  USING (auth.role() = 'service_role');

Design notes:

anonymous_id is a randomly generated UUID with no mapping back to the original user
Timestamps are stored with date-only precision (day granularity) to prevent re-identification via time correlation
session_type only supports 'chat' (AUTOMATION is legacy from reused database schema, not used in this project)
All structued non-PII fields are retained for flexible future analysis (principle: "complete retention, filter on analysis")
No user_id, no question text, no AI response text - only structured/aggregate metrics
RLS ensures no user (even authenticated) can access this table, only service_role

2. Anonymization Service

New module: backend/src/v1/agent/anonymizer.py

class SessionAnonymizer:
    """Anonymizes session data per iOS compliance requirements."""

    def anonymize(self, session: AgentChatSession, messages: list[AgentChatMessage]) -> AnonymousSessionSnapshot:
        """
        Extract non-PII data from session+messages into an anonymous snapshot.

        - Generates a random anonymous_id (no mapping to user)
        - Truncates timestamps to day precision
        - Extracts question_type from message metadata (category only)
        - Aggregates latency metrics
        - Strips all PII: user_id, title, content, state_snapshot, attachments
        """

Key anonymization rules:

Strip entirely: user_id, title, state_snapshot, content (all message content), question (user's original text), answer (AI response text), user_message_attachments, raw agent_output / tool_agent_output objects
Retain structured fields:
- Session: session_type, status, total_tokens, total_cost, message_count
- Divination: question_type, tool_name, gua_name, gua_name_hant, target_gua_name, has_changing_yao, sign_level, keywords
- Model: model_code
Transform: timestamps truncated to day precision
Aggregate: sum latency_ms across all messages into total_latency_ms

3. Modified Deletion Flow

The DELETE /api/v1/agent/sessions/{thread_id} endpoint changes from soft-delete to:

1. Verify session ownership (existing logic)
2. Load session + all associated messages
3. Call SessionAnonymizer.anonymize() → create AnonymousSessionSnapshot
4. Insert anonymous snapshot into DB
5. Hard-delete messages (DELETE FROM messages WHERE session_id = ?)
6. Hard-delete session (DELETE FROM sessions WHERE id = ?)
7. Delete associated storage objects (user_message_attachments from metadata)
8. Return 204 (same as before)

This must run in a single database transaction to ensure atomicity:

If anonymization fails, nothing is deleted
If deletion fails, no data is lost

4. Storage Object Cleanup

Extract user_message_attachments paths from message metadata before anonymization, then delete those storage objects after the DB transaction commits (best-effort, non-blocking - storage cleanup failure should not roll back the DB operation).

5. Frontend Impact

None. The API contract remains identical:

DELETE /api/v1/agent/sessions/{thread_id} returns 204 regardless
Frontend already does optimistic deletion with rollback on failure

No frontend changes required.

Migration Plan

Create migration: anonymous_session_snapshots table + RLS policies
Add SessionAnonymizer module
Modify AgentRepository.delete_session() to: anonymize → save snapshot → hard-delete
Add unit tests for anonymization logic
Add integration test for the full deletion flow

Risks & Mitigations

Risk	Mitigation
Data loss if anonymization fails mid-transaction	Wrap in single DB transaction; rollback on any error
Storage objects remain after hard-delete	Best-effort async cleanup; add periodic garbage collection
Anonymous data could be re-identified via time correlation	Truncate timestamps to day precision; no per-message timestamps
Existing soft-deleted sessions still have PII	Out of scope; handled separately via data cleanup script
question_type may not exist for all sessions	Make field nullable; skip if metadata lacks questionType

Open Questions

Should we also anonymize and hard-delete already soft-deleted sessions retroactively? (Recommended: yes, as a separate data cleanup task)
Should points_ledger entries linked to the session also be cleaned up on deletion? (Out of scope for this task, but worth noting)
Date precision: is day-level sufficient, or should we use week/month? (Proposing day-level as default)

10 KiB Raw Blame History