Files
eryao/.trellis/spec/backend/logging-guidelines.md
T

7.8 KiB

Logging Guidelines

How logging is done in this project.


Overview

This project uses structlog for structured, JSON-formatted logging:

  • Library: structlog with standard library integration
  • Interface: get_logger(name) from core.logging
  • Format: JSON logs in production, readable logs in development
  • Sensitive fields: Automatically redacted (passwords, tokens, etc.)

Log Levels

Use appropriate log levels based on event importance:

Level When to Use Examples Noise Level
ERROR All exceptions and failures Database connection failed, unhandled exception Required, never skip
WARNING Degraded behavior, retry, fallback Cache miss, using fallback data, retry attempt Minimal, only when action taken
INFO Key business events User login, message sent, run started Minimal, only milestones
DEBUG Detailed flow tracing (dev only) Function entry/exit, variable values High, avoid in release

Structured Logging

Logger Setup

from core.logging import get_logger

logger = get_logger("v1.agent.service")

Module naming convention:

  • Feature modules: v1.<feature>.<component> (e.g., v1.agent.service)
  • Core modules: core.<module>.<submodule> (e.g., core.logging.middleware)

Log Format

JSON structure:

{
  "event": "user_logged_in",
  "user_id": "123e4567-e89b-12d3-a456-426614174000",
  "timestamp": "2026-04-10T12:34:56.789Z",
  "level": "info",
  "logger": "v1.agent.service",
  "process": 12345,
  "thread": 140123456789024
}

Logging Methods

# Simple event
logger.info("user_logged_in", user_id=user.id)

# With extra context
logger.info(
    "run_enqueued",
    run_id=run_id,
    thread_id=thread_id,
    user_id=current_user.id,
)

# Error with exception info
try:
    await repository.create_session(...)
except IntegrityError as exc:
    logger.error(
        "session_creation_failed",
        error=str(exc),
        exc_info=True,  # Include stack trace
        user_id=user_id,
    )
    raise

What to Log

Always Log

  1. Exceptions: Every catch block that handles an exception must log it
  2. Business milestones: Key domain events (login, run started, message sent)
  3. Failed operations: Any operation that fails or requires retry
  4. Auth events: Login/logout, token refresh, auth failures
  5. External calls: LLM API calls, database queries (at DEBUG level)

Log with Context

# Include relevant identifiers
logger.info(
    "run_completed",
    run_id=run_id,
    thread_id=thread_id,
    duration_seconds=duration,
)

# Include error details
logger.error(
    "database_error",
    operation="create_session",
    error=str(exc),
    exc_info=True,
)

Log Levels by Event Type

INFO - Business Milestones:

logger.info("user_logged_in", user_id=user.id)
logger.info("run_started", run_id=run_id, thread_id=thread_id)
logger.info("run_completed", run_id=run_id, duration_seconds=duration)
logger.info("message_sent", message_id=msg_id, conversation_id=conv_id)

WARNING - Degraded Behavior:

logger.warning("cache_miss", key=cache_key, fallback="reload_from_db")
logger.warning("retry_attempt", operation="llm_call", attempt=2, max_attempts=3)
logger.warning("fallback_data_used", reason="external_api_timeout")

ERROR - Failures:

logger.error("database_connection_failed", error=str(exc), exc_info=True)
logger.error("validation_failed", field="email", reason="invalid_format")
logger.error("auth_failed", user_id=user_id, reason="invalid_token")
try:
    await service.do_operation()
except Exception as exc:
    logger.error("operation_failed", error=str(exc), exc_info=True)
    raise

What NOT to Log

Forbidden: Sensitive Data

Never log:

  • Passwords (even hashed)
  • API keys (ERYAO__* environment variables)
  • JWT tokens
  • Access tokens / refresh tokens
  • Session IDs (except for tracing)
  • PII (personally identifiable information) - names, emails, phone numbers
  • Credit card numbers
  • Social security numbers
  • Secret keys
  • Authorization headers
  • Cookies

Forbidden: Debug in Production

# WRONG: Debug logs in release builds
import os
if os.getenv("DEBUG"):
    logger.debug("variable_value", var=some_data)  # Never do this

Right way: Use runtime log level configuration:

# Config handles log level based on environment
# runtime:
#   log_level: "INFO"  # production
#   log_level: "DEBUG"  # development
logger.debug("variable_value", var=some_data)  # Will be filtered in production

Forbidden: Iteration Logs

# WRONG: Log every iteration
for item in items:
    logger.debug("processing_item", item_id=item.id)  # Too noisy
    process(item)

Right way: Log only failures:

for item in items:
    try:
        process(item)
    except Exception as exc:
        logger.error("item_processing_failed", item_id=item.id, error=str(exc))

Configuration

Settings

# core.config.settings.RuntimeSettings
class RuntimeSettings(BaseModel):
    log_level: str = "INFO"  # DEBUG, INFO, WARNING, ERROR
    log_json: bool = True  # JSON format (production) vs readable (dev)
    log_rotation: Literal["time", "size", "none"] = "time"
    log_rotation_when: str = "midnight"  # Daily rotation
    log_rotation_interval: int = 1
    log_rotation_backup_count: int = 14  # Keep 14 days
    log_file_name: str = ""  # Auto-generated: "{service_name}.log"
    log_sensitive_fields: list[str] = [
        "password", "secret", "token", "api_key",
        "authorization", "cookie", "client_ip", "user_id",
    ]

Log Files

Location: logs/{service_name}.log and logs/errors/{service_name}.error.log

Example:

  • logs/app.log - All logs
  • logs/errors/app.error.log - Errors only

Common Mistakes

Using print() instead of logging

# WRONG: Never use print in runtime code
print(f"User {user_id} logged in")

Right: Use logger.info:

logger.info("user_logged_in", user_id=user_id)

Logging sensitive data

# WRONG: Logging password
logger.info("user_login", username=username, password=password)

Right: Exclude sensitive fields:

logger.info("user_login", username=username)  # Password excluded

Skipping error logging

# WRONG: Exception is suppressed
try:
    await service.do_something()
except Exception:
    pass  # No logging, no re-raise

Right: Log and re-raise:

try:
    await service.do_something()
except Exception as exc:
    logger.error("operation_failed", error=str(exc), exc_info=True)
    raise

Info flooding

# WRONG: Log every database query
results = await repository.get_all()
logger.info("query_completed", count=len(results))  # Too noisy

Right: Log only significant operations:

results = await repository.get_all()
# Only log if something notable happens
if len(results) > MAX_THRESHOLD:
    logger.warning("high_result_count", count=len(results))

Examples from Codebase

Service Layer Logging

# v1/agent/service.py
logger = get_logger(__name__)

async def enqueue_run(self, ...) -> TaskAccepted:
    logger.info(
        "run_enqueued",
        run_id=run_id,
        thread_id=thread_id,
        user_id=current_user.id,
    )
    # ...

Router Layer Logging

# v1/agent/router.py
logger = get_logger("v1.agent.router")

async def _acquire_sse_slot(*, user_id: str) -> bool:
    try:
        redis = await get_or_init_redis_client()
        # ...
    except Exception as exc:  # noqa: BLE001
        logger.warning(
            "sse_slot_acquire_failed",
            user_id=user_id,
            reason=str(exc),
        )
        return True  # Graceful degradation