Files
eryao/.trellis/tasks/archive/2026-04/04-15-session-deletion-anonymization/prd.md
T

10 KiB

PRD: Divination Session Deletion Anonymization (iOS Compliance)

Background

iOS App Store (US market) requires apps to comply with data deletion regulations. When users request deletion of their divination history, the app must remove all personally identifiable information (PII). However, we still need to retain anonymized usage/tool data for product improvement and algorithm feedback.

Current implementation (DELETE /api/v1/agent/sessions/{thread_id}) only performs a soft-delete on the sessions row (sets deleted_at). Associated messages rows are not touched at all and remain fully queryable with deleted_at = NULL. This is insufficient for iOS compliance.

Goal

Replace the current soft-delete flow with an anonymize-then-hard-delete strategy:

  1. Desensitize/Anonymize the session data: extract non-PII usage metrics, strip all PII, generate an anonymous snapshot
  2. Save the anonymized usage data to a new anonymous_session_snapshots table
  3. Hard-delete the original sessions and messages records from the database

Scope

  • In scope: Backend API change for session deletion, new anonymization table and migration, anonymization logic, hard-delete logic
  • Out of scope: Account deletion flow (separate concern), frontend UI changes (the API contract for deletion stays the same - DELETE returns 204)

Current Architecture

API Endpoint

Layer File Detail
Router backend/src/v1/agent/router.py:307-314 DELETE /sessions/{thread_id}
Service backend/src/v1/agent/service.py:225-239 delete_session() verifies ownership, soft-deletes
Repository backend/src/v1/agent/repository.py:99-119 Sets session.deleted_at = now()

PII Fields (Must Be Anonymized or Removed)

High-risk PII:

Table Column Content
sessions title User's divination question (up to 80 chars)
sessions state_snapshot Full session state with divination payload
messages content Full user question and AI response text
messages metadata JSONB with user_message_attachments, agent_output, tool_agent_output

Medium-risk PII:

Table Column Content
sessions user_id Links to auth.users (UUID, indirect PII)
messages session_id Links to session (FK)

Non-PII (to retain for analytics):

Table Column Content
sessions session_type 'chat'
sessions status pending/running/completed/failed
sessions total_tokens Usage metric
sessions total_cost Usage metric
sessions message_count Counter (used for follow-up ratio analysis)
sessions created_at / last_activity_at Timestamps
messages model_code LLM model identifier
messages tool_name Divination tool name
messages latency_ms Response latency
messages->metadata agent_output.sign_level Sign level (上上签/中上签/中下签/下下签)
messages->metadata agent_output.keywords Key insights from reading
messages->metadata agent_output.divination_derived.questionType Question category (career/love/wealth/health)
messages->metadata agent_output.divination_derived.guaName Hexagram name
messages->metadata agent_output.divination_derived.guaNameHant Hexagram name (Traditional Chinese)
messages->metadata agent_output.divination_derived.targetGuaName Target hexagram name (if changing lines exist)
messages->metadata agent_output.divination_derived.hasChangingYao Whether session has changing lines

Analytics Requirements:

  1. Question type distribution: Count by question_type
  2. Follow-up ratio: message_count > 2 indicates follow-up questions
  3. LLM performance comparison: Group by model_code, analyze status, total_latency_ms, total_tokens
  4. Hexagram accuracy analysis: Distribution of sign_level, gua_name, has_changing_yao

Technical Design

1. New Table: anonymous_session_snapshots

CREATE TABLE anonymous_session_snapshots (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  anonymous_id UUID NOT NULL,              -- Random UUID, no link to real user
  
  -- Session metadata
  session_type VARCHAR(20) NOT NULL,       -- 'chat'
  message_count INTEGER,                    -- Used for follow-up ratio analysis
  status VARCHAR(20),                       -- Session final status
  
  -- Question & divination
  question_type VARCHAR(50),                -- Career/love/wealth/health etc.
  tool_name VARCHAR(100),                   -- Divination tool name
  
  -- Hexagram details (for accuracy analysis)
  gua_name VARCHAR(50),                     -- Hexagram name
  gua_name_hant VARCHAR(50),                -- Hexagram name (Traditional Chinese)
  target_gua_name VARCHAR(50),              -- Target hexagram (if changing lines exist)
  has_changing_yao BOOLEAN,                 -- Whether session has changing lines
  sign_level VARCHAR(20),                   -- 上上签/中上签/中下签/下下签
  keywords TEXT[],                          -- Key insights from reading
  
  -- Model & usage metrics
  model_code VARCHAR(50),                   -- LLM model used
  total_tokens INTEGER,                     -- Token usage
  total_cost NUMERIC,                       -- Cost metric
  total_latency_ms INTEGER,                 -- Aggregated latency
  
  -- Timestamps (day precision to prevent re-identification)
  created_at TIMESTAMPTZ NOT NULL,          -- Original session creation time
  last_activity_at TIMESTAMPTZ,             -- Original last activity
  anonymized_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- RLS: service role only, no user access
ALTER TABLE anonymous_session_snapshots ENABLE ROW LEVEL SECURITY;
CREATE POLICY "Service role can manage anonymous snapshots"
  ON anonymous_session_snapshots FOR ALL
  USING (auth.role() = 'service_role');

Design notes:

  • anonymous_id is a randomly generated UUID with no mapping back to the original user
  • Timestamps are stored with date-only precision (day granularity) to prevent re-identification via time correlation
  • session_type only supports 'chat' (AUTOMATION is legacy from reused database schema, not used in this project)
  • All structued non-PII fields are retained for flexible future analysis (principle: "complete retention, filter on analysis")
  • No user_id, no question text, no AI response text - only structured/aggregate metrics
  • RLS ensures no user (even authenticated) can access this table, only service_role

2. Anonymization Service

New module: backend/src/v1/agent/anonymizer.py

class SessionAnonymizer:
    """Anonymizes session data per iOS compliance requirements."""

    def anonymize(self, session: AgentChatSession, messages: list[AgentChatMessage]) -> AnonymousSessionSnapshot:
        """
        Extract non-PII data from session+messages into an anonymous snapshot.

        - Generates a random anonymous_id (no mapping to user)
        - Truncates timestamps to day precision
        - Extracts question_type from message metadata (category only)
        - Aggregates latency metrics
        - Strips all PII: user_id, title, content, state_snapshot, attachments
        """

Key anonymization rules:

  • Strip entirely: user_id, title, state_snapshot, content (all message content), question (user's original text), answer (AI response text), user_message_attachments, raw agent_output / tool_agent_output objects
  • Retain structured fields:
    • Session: session_type, status, total_tokens, total_cost, message_count
    • Divination: question_type, tool_name, gua_name, gua_name_hant, target_gua_name, has_changing_yao, sign_level, keywords
    • Model: model_code
  • Transform: timestamps truncated to day precision
  • Aggregate: sum latency_ms across all messages into total_latency_ms

3. Modified Deletion Flow

The DELETE /api/v1/agent/sessions/{thread_id} endpoint changes from soft-delete to:

1. Verify session ownership (existing logic)
2. Load session + all associated messages
3. Call SessionAnonymizer.anonymize() → create AnonymousSessionSnapshot
4. Insert anonymous snapshot into DB
5. Hard-delete messages (DELETE FROM messages WHERE session_id = ?)
6. Hard-delete session (DELETE FROM sessions WHERE id = ?)
7. Delete associated storage objects (user_message_attachments from metadata)
8. Return 204 (same as before)

This must run in a single database transaction to ensure atomicity:

  • If anonymization fails, nothing is deleted
  • If deletion fails, no data is lost

4. Storage Object Cleanup

Extract user_message_attachments paths from message metadata before anonymization, then delete those storage objects after the DB transaction commits (best-effort, non-blocking - storage cleanup failure should not roll back the DB operation).

5. Frontend Impact

None. The API contract remains identical:

  • DELETE /api/v1/agent/sessions/{thread_id} returns 204 regardless
  • Frontend already does optimistic deletion with rollback on failure

No frontend changes required.

Migration Plan

  1. Create migration: anonymous_session_snapshots table + RLS policies
  2. Add SessionAnonymizer module
  3. Modify AgentRepository.delete_session() to: anonymize → save snapshot → hard-delete
  4. Add unit tests for anonymization logic
  5. Add integration test for the full deletion flow

Risks & Mitigations

Risk Mitigation
Data loss if anonymization fails mid-transaction Wrap in single DB transaction; rollback on any error
Storage objects remain after hard-delete Best-effort async cleanup; add periodic garbage collection
Anonymous data could be re-identified via time correlation Truncate timestamps to day precision; no per-message timestamps
Existing soft-deleted sessions still have PII Out of scope; handled separately via data cleanup script
question_type may not exist for all sessions Make field nullable; skip if metadata lacks questionType

Open Questions

  1. Should we also anonymize and hard-delete already soft-deleted sessions retroactively? (Recommended: yes, as a separate data cleanup task)
  2. Should points_ledger entries linked to the session also be cleaned up on deletion? (Out of scope for this task, but worth noting)
  3. Date precision: is day-level sufficient, or should we use week/month? (Proposing day-level as default)