docs(codebase): refresh codebase map after Phase 06.2 completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-02 15:32:06 +02:00
parent bd17b4b22f
commit 89f8d5a654
7 changed files with 1829 additions and 621 deletions
@@ -1,144 +1,235 @@
-# INTEGRATIONS — document-scanner
+# External Integrations

-_Last updated: 2026-05-21_
+**Analysis Date:** 2026-06-02

-## Summary
+## AI / ML Classification

-The backend integrates with four interchangeable AI providers for document classification: Anthropic Claude, OpenAI (and any OpenAI-compatible endpoint), Ollama, and LM Studio. There are no external databases, auth services, or cloud storage integrations — all persistence is local filesystem. The active provider is selected at runtime via settings persisted in `backend/data/settings.json`.
+All AI providers implement the `AIProvider` abstract interface in `backend/ai/base.py`. The active provider is selected at classification time via the `DEFAULT_AI_PROVIDER` setting (`backend/config.py`).

---
-
-## AI Providers
-
-All providers implement the `AIProvider` abstract interface defined in `backend/ai/base.py`. The active provider is resolved at request time in `backend/ai/__init__.py:get_provider()`.
-
-### Anthropic
+### Anthropic Claude

 - **SDK:** `anthropic>=0.26` — `backend/ai/anthropic_provider.py`
- **Client:** `anthropic.AsyncAnthropic`
+- **Client:** `anthropic.AsyncAnthropic(api_key=...)`
 - **API:** Messages API (`client.messages.create`)
- **Default model:** `claude-sonnet-4-6`
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.anthropic.api_key`; optionally seeded from env var `ANTHROPIC_API_KEY` (`.env.example`)
+- **Default model:** `claude-sonnet-4-6` (configurable via `DEFAULT_AI_MODEL`)
+- **Auth env var:** API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
 - **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
- **Text limit:** 8,000 characters per request (`MAX_AI_CHARS = 8_000`)
+- **Text cap:** 8,000 chars per call (`MAX_AI_CHARS = 8_000` in `backend/ai/anthropic_provider.py`)

 ### OpenAI

 - **SDK:** `openai>=1.30` — `backend/ai/openai_provider.py`
- **Client:** `openai.AsyncOpenAI`
+- **Client:** `openai.AsyncOpenAI(api_key=..., base_url=...)`
 - **API:** Chat Completions (`client.chat.completions.create`)
 - **Default model:** `gpt-4o`
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.openai.api_key`; optionally seeded from env var `OPENAI_API_KEY` (`.env.example`)
- **Custom base URL:** Supported via `providers.openai.base_url` in settings (allows pointing at any OpenAI-compatible endpoint)
+- **Auth:** `api_key` at instantiation; `base_url` override supported for custom endpoints

-### Ollama
+### Ollama (local, OpenAI-compatible)

 - **Provider file:** `backend/ai/ollama_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
+- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
 - **Default base URL:** `http://host.docker.internal:11434/v1`
 - **Default model:** `llama3.2`
- **Auth:** Stub key `"ollama"` (no real auth required)
- **Network path:** Reaches the host machine's Ollama daemon via Docker's `host.docker.internal` DNS alias (configured in `docker-compose.yml` via `extra_hosts`)
+- **Auth:** Stub key `"ollama"` — no real auth
+- **Network path:** Reaches host machine Ollama daemon via Docker `extra_hosts: host.docker.internal:host-gateway`

-### LM Studio
+### LM Studio (local, OpenAI-compatible)

 - **Provider file:** `backend/ai/lmstudio_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
+- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
 - **Default base URL:** `http://host.docker.internal:1234/v1`
 - **Default model:** `gemma-4-e4b-it`
- **Auth:** Stub key `"lm-studio"` (no real auth required)
- **Network path:** Reaches the host machine's LM Studio server via `host.docker.internal` (same `extra_hosts` setting)
- **Default active provider** — the app works out of the box with LM Studio and no API keys
+- **Auth:** Stub key `"lm-studio"` — no real auth
+- **Network path:** Same `host.docker.internal` Docker alias as Ollama

 ---

-## Provider Selection & Settings Persistence
+## Data Storage

- Active provider and all per-provider config (model names, API keys, base URLs) are persisted in `backend/data/settings.json`.
- Settings are loaded fresh on each classification request in `backend/services/classifier.py:classify_document()`.
- API keys returned from the settings API are masked (last 4 chars shown) via `backend/services/storage.py:mask_api_key()`.
- The Settings UI allows switching providers without restart.
+### PostgreSQL (primary database)
+
+- **Image:** `postgres:17-alpine` (Docker Compose)
+- **Driver:** `psycopg[binary]>=3.3.4` (psycopg v3 async)
+- **ORM:** SQLAlchemy 2.0 asyncio — `backend/db/session.py`
+- **Schema migrations:** Alembic — `backend/migrations/`
+- **Connection env vars:** `DATABASE_URL` (app user, DML only), `DATABASE_MIGRATE_URL` (migrate user, DDL)
+- **Role separation:** `docuvault_app` (DML), `docuvault_migrate` (DDL) — `docker/postgres/initdb.d/01-init-users.sql`
+
+### MinIO (object storage)
+
+- **Image:** `minio/minio:latest` (Docker Compose), ports 9000 + 9001
+- **SDK:** `minio>=7.2.20` — `backend/storage/minio_backend.py`
+- **Object key scheme:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filenames stored in DB only
+- **Presigned URLs:** Generated for browser direct-PUT uploads and GET downloads
+- **Auth env vars:** `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET`
+- **Public endpoint:** `MINIO_PUBLIC_ENDPOINT` — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint)
+- **CORS:** `MINIO_API_CORS_ALLOW_ORIGIN` set to `FRONTEND_URL` to allow browser preflight
+
+### Redis
+
+- **Image:** `redis:7-alpine` (Docker Compose), password-protected
+- **Client:** `redis>=4.6.0` (async via `redis.asyncio`)
+- **Uses:**
+  - Celery broker and result backend (`backend/celery_app.py`)
+  - JTI token revocation store (access + refresh token blacklist)
+  - Per-account rate limiting via slowapi (`backend/main.py`)
+  - TOTP replay prevention (used TOTP codes invalidated within 90 s window)
+- **Auth env var:** `REDIS_URL` (includes password in DSN)

 ---

-## Frontend ↔ Backend Communication
+## Cloud Storage Backends

- **Protocol:** HTTP REST over JSON (and multipart form for uploads)
- **Client:** Native browser `fetch` API — `frontend/src/api/client.js`
- **Base path:** All requests go to `/api/*` — no hardcoded backend hostname in the frontend
- **Proxy (dev):** Vite dev server proxies `/api` → `http://backend:8000` — `frontend/vite.config.js`
- **Proxy (prod):** Comment in `frontend/src/api/client.js` notes nginx is expected; no nginx config is present in the repo
+All backends implement `StorageBackend` ABC from `backend/storage/base.py`. Credentials are encrypted at rest with HKDF per-user key derivation using master key from `CLOUD_CREDS_KEY` env var.

-### API Endpoints consumed by the frontend
+### Google Drive v3

-| Method | Path | Purpose |
-|---|---|---|
-| POST | `/api/documents/upload` | Upload file with optional auto-classify flag |
-| GET | `/api/documents` | List documents (paginated, optional topic filter) |
-| GET | `/api/documents/:id` | Get single document metadata |
-| DELETE | `/api/documents/:id` | Delete document |
-| POST | `/api/documents/:id/classify` | (Re)classify document, optional topic list |
-| GET | `/api/topics` | List all topics |
-| POST | `/api/topics` | Create topic |
-| PATCH | `/api/topics/:id` | Update topic |
-| DELETE | `/api/topics/:id` | Delete topic |
-| POST | `/api/topics/suggest` | AI topic suggestions for a document |
-| GET | `/api/settings` | Get settings (keys masked) |
-| PATCH | `/api/settings` | Update settings |
-| POST | `/api/settings/test-provider` | Health-check the active or named provider |
-| GET | `/api/settings/default-prompt` | Retrieve the default classification system prompt |
+- **SDK:** `google-auth-oauthlib>=1.3.1` + `google-api-python-client>=2.196.0`
+- **Backend file:** `backend/storage/google_drive_backend.py`
+- **Auth:** OAuth2 flow; tokens stored encrypted in DB; `token_uri`, `client_id`, `client_secret`, `access_token`, `refresh_token` in credentials dict
+- **Scope:** `https://www.googleapis.com/auth/drive.file`
+- **Note:** All `googleapiclient` calls are synchronous and wrapped in `asyncio.to_thread()` to avoid blocking the event loop; `cache_discovery=False` prevents `/tmp` writes (path traversal mitigation)
+- **Auth env vars:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`
+- **OAuth callback:** `{BACKEND_URL}/api/cloud/google/callback`

---
+### Microsoft OneDrive (Graph API)

-## Docker Services
+- **SDK:** `msal>=1.36.0` (token management) + `httpx>=0.27` (async Graph API calls)
+- **Backend file:** `backend/storage/onedrive_backend.py`
+- **API base:** `https://graph.microsoft.com/v1.0`
+- **Auth:** OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains `access_token`, `refresh_token`, `expires_at`
+- **Upload strategy:** Resumable upload sessions (`createUploadSession`) for all files; chunk size 10 MB
+- **Auth env vars:** `ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_SECRET`, `ONEDRIVE_TENANT_ID` (default: `"common"`)

-Defined in `docker-compose.yml`:
+### Nextcloud

-| Service | Image | Port | Notes |
-|---|---|---|---|
-| `backend` | Built from `./backend/Dockerfile` | `8000:8000` | Mounts `./backend/data:/app/data` for persistence; `./backend:/app` for hot-reload |
-| `frontend` | Built from `./frontend/Dockerfile` | `5173:5173` | Mounts `./frontend/src` and `index.html` for hot-reload; depends on `backend` |
+- **Backend file:** `backend/storage/nextcloud_backend.py`
+- **Inheritance:** `NextcloudBackend → WebDAVBackend → StorageBackend`
+- **Protocol:** WebDAV via `webdavclient3>=3.14.7`
+- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
+- **SSRF prevention:** `validate_cloud_url()` called at construction time and before every outbound request (`backend/storage/cloud_utils.py`)
+- **No OAuth:** Credential-based only (username + password)

-Both services use `extra_hosts: host.docker.internal:host-gateway` on the backend to allow Ollama/LM Studio connections to the host machine.
+### Generic WebDAV

---
-
-## Environment Variables
-
-| Variable | Required | Where used | Notes |
-|---|---|---|---|
-| `DATA_DIR` | No | `backend/config.py` | Root path for uploads/metadata/settings; defaults to `/app/data` |
-| `ANTHROPIC_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
-| `OPENAI_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
-| `PYTHONDONTWRITEBYTECODE` | No | `docker-compose.yml` | Set to `1` to suppress `.pyc` files in Docker |
+- **Backend file:** `backend/storage/webdav_backend.py`
+- **SDK:** `webdavclient3>=3.14.7`
+- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
+- **SSRF prevention:** Same dual-call `validate_cloud_url()` pattern as Nextcloud
+- **Path encoding:** `urllib.parse.quote()` per path segment to handle non-ASCII filenames

 ---

 ## Authentication & Identity

- No user authentication. The application has no login system, sessions, or identity provider.
- API keys for AI providers are stored in plain text in `backend/data/settings.json` (masked only when returned via the settings API).
+No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:
+
+- **Password hashing:** Argon2id via `pwdlib[argon2]` — `backend/services/auth.py`
+- **JWT access tokens:** PyJWT `>=2.8.0`; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (`fgp`) bound to `User-Agent + Accept-Language`
+- **Refresh tokens:** 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
+- **JTI store:** Redis (TTL matching token lifetime)
+- **TOTP (2FA):** `pyotp>=2.9.0`; replay prevention via Redis within 90 s window; QR codes generated in frontend with `qrcode ^1.5.4`
+- **Backup codes:** Generated, hashed (Argon2id), stored in DB — `backend/db/models.py:BackupCode`
+
+---
+
+## External HTTP APIs
+
+### HaveIBeenPwned (HIBP)
+
+- **Purpose:** k-anonymity password breach check on registration and password change
+- **Client:** `httpx` async GET to `https://api.pwnedpasswords.com/range/{prefix}`
+- **Implementation:** `backend/services/auth.py:check_hibp()` — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration)
+- **Auth:** None required (public API)
+
+---
+
+## Email / Notifications
+
+- **Protocol:** SMTP via Python stdlib `smtplib` — `backend/services/email.py`
+- **Transport security:** STARTTLS (port 587 default)
+- **Auth:** Optional SMTP username + password
+- **Auth env vars:** `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_FROM`
+- **Dev fallback:** When `SMTP_HOST` is empty, email content is logged to stdout instead of sent
+- **Emails sent:**
+  - Password reset link (1-hour validity) — triggered from `backend/tasks/email_tasks.py`
+  - Security alert (suspicious refresh token reuse / session family revocation) — triggered from `backend/services/auth.py` via Celery
+- **Celery queue:** `email` queue, separate from `documents` queue
+
+---
+
+## Frontend ↔ Backend Communication
+
+- **Protocol:** HTTP REST over JSON; multipart/form-data for document upload
+- **Client:** Native browser `fetch` API — `frontend/src/api/` directory
+- **Base path:** All requests use relative `/api/*` — no hardcoded backend hostname
+- **Dev proxy:** Vite proxies `/api` → `http://backend:8000` (`frontend/vite.config.js`)
+- **Auth flow:** Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client
+
+---
+
+## Background Task Queues (Celery)
+
+- **Broker + result backend:** Redis (`REDIS_URL`)
+- **Serialization:** JSON only (no pickle)
+- **Queues and task modules:**
+  - `documents` — `backend/tasks/document_tasks.py` (extraction, classification, cleanup)
+  - `email` — `backend/tasks/email_tasks.py` (password reset, security alert)
+  - `documents` (reused) — `backend/tasks/audit_tasks.py` (audit log export)
+- **Scheduled tasks (Celery Beat):**
+  - `cleanup-abandoned-uploads` — every 30 minutes
+  - `audit-log-daily-export` — midnight UTC daily

 ---

 ## Monitoring & Observability

- No error tracking service (no Sentry, Datadog, etc.).
- No structured logging framework — FastAPI default stdout logging only.
- A `/health` endpoint exists at `backend/main.py` returning `{"status": "ok"}`.
- Provider connectivity tested on demand via `POST /api/settings/test-provider`.
+- **Error tracking:** None (no Sentry, Datadog, etc.)
+- **Logging:** Python stdlib `logging`; stdout; no structured logging framework
+- **Health endpoint:** `GET /health` — probes PostgreSQL (`SELECT 1`) and MinIO (bucket exists check); always returns HTTP 200 with `status: ok | degraded`
+- **Audit log:** All auth events, quota violations, and admin actions written to DB audit log (no document content) — `backend/services/audit.py`, `backend/api/audit.py`

 ---

-## Webhooks & Callbacks
+## CI/CD & Deployment

- None — the application makes no outbound webhook calls and exposes no webhook receiver endpoints.
+- **Hosting:** Docker Compose only; no cloud provider manifests detected
+- **CI pipeline:** None detected in repository
+- **Container registry:** None configured
+- **Secrets management:** Environment variables only; `.env` file for local dev (not committed)

 ---

-## Gaps / Unknowns
+## Required Environment Variables Summary

- No nginx or reverse-proxy config present for production deployments; the client-side comment references it but no config exists.
- No container registry or CI/CD pipeline configuration detected.
- API keys are stored in a plain JSON file on disk with no encryption at rest.
- The `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` env vars from `.env.example` are noted as bootstrap helpers but no code in the repo reads them directly — they appear to be manual seeding hints only.
+| Variable | Required | Service | Purpose |
+|---|---|---|---|
+| `DATABASE_URL` | Yes | backend | App DB connection (DML user) |
+| `DATABASE_MIGRATE_URL` | Yes | migrations | Alembic DDL connection |
+| `MINIO_ENDPOINT` | Yes | backend, workers | MinIO S3 API endpoint |
+| `MINIO_ACCESS_KEY` | Yes | backend, workers | MinIO credentials |
+| `MINIO_SECRET_KEY` | Yes | backend, workers | MinIO credentials |
+| `MINIO_BUCKET` | Yes | backend, workers | Object storage bucket name |
+| `REDIS_URL` | Yes | backend, workers, beat | Redis DSN (broker + JTI store) |
+| `SECRET_KEY` | Yes | backend | JWT signing secret |
+| `CLOUD_CREDS_KEY` | Yes | celery-worker | 32-byte master key for HKDF |
+| `POSTGRES_PASSWORD` | Yes | postgres service | Docker postgres init |
+| `MINIO_ROOT_USER` | Yes | minio service | MinIO root credentials |
+| `MINIO_ROOT_PASSWORD` | Yes | minio service | MinIO root credentials |
+| `REDIS_PASSWORD` | Yes | redis service | Redis auth password |
+| `SMTP_HOST` | No | backend | Transactional email (dev: logs to stdout) |
+| `GOOGLE_CLIENT_ID` | No | backend | Google Drive OAuth |
+| `GOOGLE_CLIENT_SECRET` | No | backend | Google Drive OAuth |
+| `ONEDRIVE_CLIENT_ID` | No | backend | OneDrive OAuth |
+| `ONEDRIVE_CLIENT_SECRET` | No | backend | OneDrive OAuth |
+| `ADMIN_EMAIL` | No | backend | Bootstrap admin account |
+| `ADMIN_PASSWORD` | No | backend | Bootstrap admin account |
+| `DEFAULT_AI_PROVIDER` | No | backend | AI provider selection (default: `ollama`) |
+| `DEFAULT_AI_MODEL` | No | backend | AI model selection (default: `llama3.2`) |
+| `CORS_ORIGINS` | No | backend | Allowed CORS origins |
+| `FRONTEND_URL` | No | backend, minio | Password reset links + MinIO CORS |
+| `BACKEND_URL` | No | backend | OAuth callback URL construction |
+
+---
+
+*Integration audit: 2026-06-02*