Files
eryao/.trellis/spec/backend/logging-guidelines.md
T

340 lines
7.8 KiB
Markdown

# Logging Guidelines
> How logging is done in this project.
---
## Overview
This project uses **structlog** for structured, JSON-formatted logging:
- **Library**: `structlog` with standard library integration
- **Interface**: `get_logger(name)` from `core.logging`
- **Format**: JSON logs in production, readable logs in development
- **Sensitive fields**: Automatically redacted (passwords, tokens, etc.)
---
## Log Levels
Use appropriate log levels based on event importance:
| Level | When to Use | Examples | Noise Level |
|-------|-------------|----------|-------------|
| **ERROR** | All exceptions and failures | Database connection failed, unhandled exception | Required, never skip |
| **WARNING** | Degraded behavior, retry, fallback | Cache miss, using fallback data, retry attempt | Minimal, only when action taken |
| **INFO** | Key business events | User login, message sent, run started | Minimal, only milestones |
| **DEBUG** | Detailed flow tracing (dev only) | Function entry/exit, variable values | High, avoid in release |
---
## Structured Logging
### Logger Setup
```python
from core.logging import get_logger
logger = get_logger("v1.agent.service")
```
**Module naming convention:**
- Feature modules: `v1.<feature>.<component>` (e.g., `v1.agent.service`)
- Core modules: `core.<module>.<submodule>` (e.g., `core.logging.middleware`)
### Log Format
**JSON structure:**
```json
{
"event": "user_logged_in",
"user_id": "123e4567-e89b-12d3-a456-426614174000",
"timestamp": "2026-04-10T12:34:56.789Z",
"level": "info",
"logger": "v1.agent.service",
"process": 12345,
"thread": 140123456789024
}
```
### Logging Methods
```python
# Simple event
logger.info("user_logged_in", user_id=user.id)
# With extra context
logger.info(
"run_enqueued",
run_id=run_id,
thread_id=thread_id,
user_id=current_user.id,
)
# Error with exception info
try:
await repository.create_session(...)
except IntegrityError as exc:
logger.error(
"session_creation_failed",
error=str(exc),
exc_info=True, # Include stack trace
user_id=user_id,
)
raise
```
---
## What to Log
### Always Log
1. **Exceptions**: Every catch block that handles an exception must log it
2. **Business milestones**: Key domain events (login, run started, message sent)
3. **Failed operations**: Any operation that fails or requires retry
4. **Auth events**: Login/logout, token refresh, auth failures
5. **External calls**: LLM API calls, database queries (at DEBUG level)
### Log with Context
```python
# Include relevant identifiers
logger.info(
"run_completed",
run_id=run_id,
thread_id=thread_id,
duration_seconds=duration,
)
# Include error details
logger.error(
"database_error",
operation="create_session",
error=str(exc),
exc_info=True,
)
```
### Log Levels by Event Type
**INFO - Business Milestones:**
```python
logger.info("user_logged_in", user_id=user.id)
logger.info("run_started", run_id=run_id, thread_id=thread_id)
logger.info("run_completed", run_id=run_id, duration_seconds=duration)
logger.info("message_sent", message_id=msg_id, conversation_id=conv_id)
```
**WARNING - Degraded Behavior:**
```python
logger.warning("cache_miss", key=cache_key, fallback="reload_from_db")
logger.warning("retry_attempt", operation="llm_call", attempt=2, max_attempts=3)
logger.warning("fallback_data_used", reason="external_api_timeout")
```
**ERROR - Failures:**
```python
logger.error("database_connection_failed", error=str(exc), exc_info=True)
logger.error("validation_failed", field="email", reason="invalid_format")
logger.error("auth_failed", user_id=user_id, reason="invalid_token")
try:
await service.do_operation()
except Exception as exc:
logger.error("operation_failed", error=str(exc), exc_info=True)
raise
```
---
## What NOT to Log
### Forbidden: Sensitive Data
**Never log:**
- Passwords (even hashed)
- API keys (`ERYAO__*` environment variables)
- JWT tokens
- Access tokens / refresh tokens
- Session IDs (except for tracing)
- PII (personally identifiable information) - names, emails, phone numbers
- Credit card numbers
- Social security numbers
- Secret keys
- Authorization headers
- Cookies
### Forbidden: Debug in Production
```python
# WRONG: Debug logs in release builds
import os
if os.getenv("DEBUG"):
logger.debug("variable_value", var=some_data) # Never do this
```
**Right way:** Use runtime log level configuration:
```python
# Config handles log level based on environment
# runtime:
# log_level: "INFO" # production
# log_level: "DEBUG" # development
logger.debug("variable_value", var=some_data) # Will be filtered in production
```
### Forbidden: Iteration Logs
```python
# WRONG: Log every iteration
for item in items:
logger.debug("processing_item", item_id=item.id) # Too noisy
process(item)
```
**Right way:** Log only failures:
```python
for item in items:
try:
process(item)
except Exception as exc:
logger.error("item_processing_failed", item_id=item.id, error=str(exc))
```
---
## Configuration
### Settings
```python
# core.config.settings.RuntimeSettings
class RuntimeSettings(BaseModel):
log_level: str = "INFO" # DEBUG, INFO, WARNING, ERROR
log_json: bool = True # JSON format (production) vs readable (dev)
log_rotation: Literal["time", "size", "none"] = "time"
log_rotation_when: str = "midnight" # Daily rotation
log_rotation_interval: int = 1
log_rotation_backup_count: int = 14 # Keep 14 days
log_file_name: str = "" # Auto-generated: "{service_name}.log"
log_sensitive_fields: list[str] = [
"password", "secret", "token", "api_key",
"authorization", "cookie", "client_ip", "user_id",
]
```
### Log Files
**Location:** `logs/{service_name}.log` and `logs/errors/{service_name}.error.log`
**Example:**
- `logs/app.log` - All logs
- `logs/errors/app.error.log` - Errors only
---
## Common Mistakes
### ❌ Using `print()` instead of logging
```python
# WRONG: Never use print in runtime code
print(f"User {user_id} logged in")
```
**Right:** Use `logger.info`:
```python
logger.info("user_logged_in", user_id=user_id)
```
### ❌ Logging sensitive data
```python
# WRONG: Logging password
logger.info("user_login", username=username, password=password)
```
**Right:** Exclude sensitive fields:
```python
logger.info("user_login", username=username) # Password excluded
```
### ❌ Skipping error logging
```python
# WRONG: Exception is suppressed
try:
await service.do_something()
except Exception:
pass # No logging, no re-raise
```
**Right:** Log and re-raise:
```python
try:
await service.do_something()
except Exception as exc:
logger.error("operation_failed", error=str(exc), exc_info=True)
raise
```
### ❌ Info flooding
```python
# WRONG: Log every database query
results = await repository.get_all()
logger.info("query_completed", count=len(results)) # Too noisy
```
**Right:** Log only significant operations:
```python
results = await repository.get_all()
# Only log if something notable happens
if len(results) > MAX_THRESHOLD:
logger.warning("high_result_count", count=len(results))
```
---
## Examples from Codebase
### Service Layer Logging
```python
# v1/agent/service.py
logger = get_logger(__name__)
async def enqueue_run(self, ...) -> TaskAccepted:
logger.info(
"run_enqueued",
run_id=run_id,
thread_id=thread_id,
user_id=current_user.id,
)
# ...
```
### Router Layer Logging
```python
# v1/agent/router.py
logger = get_logger("v1.agent.router")
async def _acquire_sse_slot(*, user_id: str) -> bool:
try:
redis = await get_or_init_redis_client()
# ...
except Exception as exc: # noqa: BLE001
logger.warning(
"sse_slot_acquire_failed",
user_id=user_id,
reason=str(exc),
)
return True # Graceful degradation
```