Files
kite/.planning/phases/06-performance-production-hardening/06-DISCUSSION-LOG.md
T
2026-05-30 22:41:14 +02:00

7.1 KiB

Phase 6: Performance & Production Hardening - Discussion Log

Audit trail only. Do not use as input to planning, research, or execution agents. Decisions are captured in CONTEXT.md — this log preserves the alternatives considered.

Date: 2026-05-30 Phase: 6-performance-production-hardening Areas discussed: Observability stack, Load testing & SLA targets, Container hardening depth, Rate limit header bypass prevention


Observability Stack

Structured Logging Library

Option Description Selected
structlog Purpose-built for structured logging; processors pipeline makes correlation IDs trivial; plays well with FastAPI middleware
Standard logging + python-json-logger Minimal change — configure stdlib root logger with a JSON formatter. Less powerful but zero new dependencies
loguru Simple API, good defaults, supports structured output via sink config

User's choice: structlog Notes: No follow-up notes.


Log Aggregation

Option Description Selected
Loki + Grafana in docker-compose Matches success criteria literally. Adds 2 services; queries logs via Grafana UI at localhost
stdout JSON only, no aggregation service Simpler — just emit JSON to stdout, rely on docker compose logs
Promtail + Loki + Grafana full stack Full Grafana stack with Promtail log shipper. More production-realistic but heavier

User's choice: Loki + Grafana in docker-compose Notes: No follow-up notes.


Distributed Tracing

Option Description Selected
Skip for now — correlation IDs in logs are enough Simpler; stays in scope for v1
OpenTelemetry with Tempo (add to Grafana stack) More complete observability but heavier setup
OpenTelemetry spans to stdout only (no backend) Lightweight but not queryable

User's choice: Skip — correlation IDs in logs are enough Notes: No follow-up notes.


Load Testing & SLA Targets

Load Testing Tool

Option Description Selected
Locust Python-native, fits the existing stack. Test scenarios reuse auth helpers. Lives in backend/load_tests/
k6 JavaScript-based, excellent HTML reports. Separate language from the rest of the stack
pytest-benchmark + httpx Minimal setup, reuses existing test infrastructure. Not realistic for concurrent load

User's choice: Locust Notes: No follow-up notes.


Latency Targets

Option Description Selected
Strict: p95 < 200ms, p99 < 500ms Reasonable for a local Docker stack. Clear pass/fail criteria
Relaxed: p95 < 500ms, p99 < 1s More lenient — appropriate if cloud backend latency is included in scope
You decide based on profiling Run a baseline first, then set targets at 2x observed p95

User's choice: Strict — p95 < 200ms, p99 < 500ms Notes: No follow-up notes.


Load Test Endpoint Scope

Option Description Selected
Auth + document list + document get + upload Covers the critical read/write path. Excludes cloud backends
Auth only Focus on rate limiting under load. Misses the storage I/O path
All endpoints including cloud proxy Comprehensive but cloud latency makes p95 targets meaningless

User's choice: Auth + document list/get/upload (no cloud backends) Notes: No follow-up notes.


Container Hardening Depth

Non-root User Setup

Option Description Selected
Create appuser (uid 1000), chown /app, switch USER Standard pattern. Works with read-only rootfs
Multi-stage build: builder as root, runtime as appuser Cleaner security boundary. pip install in builder, copy only packages to runtime. Reduces attack surface
Distroless base image Minimal image with no shell. Breaks pytesseract (needs system deps)

User's choice: Multi-stage build with appuser Notes: No follow-up notes.


Read-only Filesystem

Option Description Selected
tmpfs for /tmp + named volume for /app/data in docker-compose read_only: true + tmpfs for temp files + named volume for data. Correct pattern
tmpfs for /tmp only, data paths via env var Simpler but less strict
Skip read-only filesystem for Celery worker Read-only only on FastAPI service; worker stays writable

User's choice: tmpfs for /tmp + named volume for /app/data (full read-only rootfs on both services) Notes: No follow-up notes.


Linux Capability Dropping

Option Description Selected
drop ALL capabilities, no cap_add cap_drop: [ALL] with no cap_add. Port 8000 needs no capabilities
drop ALL, add back CAP_NET_BIND_SERVICE Only needed if binding to port 80/443 — unnecessary for port 8000
drop only dangerous caps (SYS_ADMIN, SYS_PTRACE, NET_RAW) Less strict than CLAUDE.md mandate

User's choice: drop ALL, no cap_add Notes: No follow-up notes.


Rate Limit Header Bypass Prevention

IP Extraction Strategy

Option Description Selected
Custom key_func: trust X-Forwarded-For only from known proxy IPs Replace get_remote_address with trusted-proxy check. Prevents header spoofing from external clients
Never trust forwarded headers — always use request.client.host Simplest and most secure for Docker Compose. Breaks if a proxy is added later
Redis-backed rate limiter with per-account AND per-IP limits More resilient for horizontal scaling but adds Redis dependency

User's choice: Custom key_func with trusted-proxy CIDR check Notes: No follow-up notes.


Per-Account Rate Limiting

Option Description Selected
Yes — add per-account limits on authenticated endpoints Second limiter keyed by user_id on document/cloud endpoints (100 req/min per user)
No — per-IP is sufficient for now Document endpoints don't need additional per-user limits
Per-account on auth endpoints only Match Phase 2 intent exactly

User's choice: Yes — per-account limits on authenticated document/cloud endpoints Notes: No follow-up notes.


Claude's Discretion

  • Exact structlog processor chain configuration
  • Loki Docker Compose service version and loki-config.yaml — use official Grafana example as base
  • Promtail vs. Docker log driver for shipping to Loki
  • Locust user class structure and task weight distribution
  • Grafana dashboard panel layout (basic request rate + latency + error rate panels)

Deferred Ideas

  • HTTPS/TLS termination (nginx + Let's Encrypt or Caddy) — out of scope; RUNBOOK.md documents how to add
  • Horizontal scaling + Redis-backed rate limit counters — Phase 7+ concern
  • GitHub Actions CI/CD pipeline for automated load tests and docker scout on every PR
  • Automated backup cron job as a Docker service — RUNBOOK.md documents manual procedure