docs(codebase): refresh codebase map after Phase 06.2 completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
curo1305
2026-06-02 15:32:06 +02:00
parent bd17b4b22f
commit 89f8d5a654
7 changed files with 1829 additions and 621 deletions
+177 -86
View File
@@ -1,144 +1,235 @@
# INTEGRATIONS — document-scanner
# External Integrations
_Last updated: 2026-05-21_
**Analysis Date:** 2026-06-02
## Summary
## AI / ML Classification
The backend integrates with four interchangeable AI providers for document classification: Anthropic Claude, OpenAI (and any OpenAI-compatible endpoint), Ollama, and LM Studio. There are no external databases, auth services, or cloud storage integrations — all persistence is local filesystem. The active provider is selected at runtime via settings persisted in `backend/data/settings.json`.
All AI providers implement the `AIProvider` abstract interface in `backend/ai/base.py`. The active provider is selected at classification time via the `DEFAULT_AI_PROVIDER` setting (`backend/config.py`).
---
## AI Providers
All providers implement the `AIProvider` abstract interface defined in `backend/ai/base.py`. The active provider is resolved at request time in `backend/ai/__init__.py:get_provider()`.
### Anthropic
### Anthropic Claude
- **SDK:** `anthropic>=0.26``backend/ai/anthropic_provider.py`
- **Client:** `anthropic.AsyncAnthropic`
- **Client:** `anthropic.AsyncAnthropic(api_key=...)`
- **API:** Messages API (`client.messages.create`)
- **Default model:** `claude-sonnet-4-6`
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.anthropic.api_key`; optionally seeded from env var `ANTHROPIC_API_KEY` (`.env.example`)
- **Default model:** `claude-sonnet-4-6` (configurable via `DEFAULT_AI_MODEL`)
- **Auth env var:** API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
- **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
- **Text limit:** 8,000 characters per request (`MAX_AI_CHARS = 8_000`)
- **Text cap:** 8,000 chars per call (`MAX_AI_CHARS = 8_000` in `backend/ai/anthropic_provider.py`)
### OpenAI
- **SDK:** `openai>=1.30``backend/ai/openai_provider.py`
- **Client:** `openai.AsyncOpenAI`
- **Client:** `openai.AsyncOpenAI(api_key=..., base_url=...)`
- **API:** Chat Completions (`client.chat.completions.create`)
- **Default model:** `gpt-4o`
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.openai.api_key`; optionally seeded from env var `OPENAI_API_KEY` (`.env.example`)
- **Custom base URL:** Supported via `providers.openai.base_url` in settings (allows pointing at any OpenAI-compatible endpoint)
- **Auth:** `api_key` at instantiation; `base_url` override supported for custom endpoints
### Ollama
### Ollama (local, OpenAI-compatible)
- **Provider file:** `backend/ai/ollama_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
- **Default base URL:** `http://host.docker.internal:11434/v1`
- **Default model:** `llama3.2`
- **Auth:** Stub key `"ollama"` (no real auth required)
- **Network path:** Reaches the host machine's Ollama daemon via Docker's `host.docker.internal` DNS alias (configured in `docker-compose.yml` via `extra_hosts`)
- **Auth:** Stub key `"ollama"` no real auth
- **Network path:** Reaches host machine Ollama daemon via Docker `extra_hosts: host.docker.internal:host-gateway`
### LM Studio
### LM Studio (local, OpenAI-compatible)
- **Provider file:** `backend/ai/lmstudio_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
- **Default base URL:** `http://host.docker.internal:1234/v1`
- **Default model:** `gemma-4-e4b-it`
- **Auth:** Stub key `"lm-studio"` (no real auth required)
- **Network path:** Reaches the host machine's LM Studio server via `host.docker.internal` (same `extra_hosts` setting)
- **Default active provider** — the app works out of the box with LM Studio and no API keys
- **Auth:** Stub key `"lm-studio"` no real auth
- **Network path:** Same `host.docker.internal` Docker alias as Ollama
---
## Provider Selection & Settings Persistence
## Data Storage
- Active provider and all per-provider config (model names, API keys, base URLs) are persisted in `backend/data/settings.json`.
- Settings are loaded fresh on each classification request in `backend/services/classifier.py:classify_document()`.
- API keys returned from the settings API are masked (last 4 chars shown) via `backend/services/storage.py:mask_api_key()`.
- The Settings UI allows switching providers without restart.
### PostgreSQL (primary database)
- **Image:** `postgres:17-alpine` (Docker Compose)
- **Driver:** `psycopg[binary]>=3.3.4` (psycopg v3 async)
- **ORM:** SQLAlchemy 2.0 asyncio — `backend/db/session.py`
- **Schema migrations:** Alembic — `backend/migrations/`
- **Connection env vars:** `DATABASE_URL` (app user, DML only), `DATABASE_MIGRATE_URL` (migrate user, DDL)
- **Role separation:** `docuvault_app` (DML), `docuvault_migrate` (DDL) — `docker/postgres/initdb.d/01-init-users.sql`
### MinIO (object storage)
- **Image:** `minio/minio:latest` (Docker Compose), ports 9000 + 9001
- **SDK:** `minio>=7.2.20``backend/storage/minio_backend.py`
- **Object key scheme:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filenames stored in DB only
- **Presigned URLs:** Generated for browser direct-PUT uploads and GET downloads
- **Auth env vars:** `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET`
- **Public endpoint:** `MINIO_PUBLIC_ENDPOINT` — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint)
- **CORS:** `MINIO_API_CORS_ALLOW_ORIGIN` set to `FRONTEND_URL` to allow browser preflight
### Redis
- **Image:** `redis:7-alpine` (Docker Compose), password-protected
- **Client:** `redis>=4.6.0` (async via `redis.asyncio`)
- **Uses:**
- Celery broker and result backend (`backend/celery_app.py`)
- JTI token revocation store (access + refresh token blacklist)
- Per-account rate limiting via slowapi (`backend/main.py`)
- TOTP replay prevention (used TOTP codes invalidated within 90 s window)
- **Auth env var:** `REDIS_URL` (includes password in DSN)
---
## Frontend ↔ Backend Communication
## Cloud Storage Backends
- **Protocol:** HTTP REST over JSON (and multipart form for uploads)
- **Client:** Native browser `fetch` API — `frontend/src/api/client.js`
- **Base path:** All requests go to `/api/*` — no hardcoded backend hostname in the frontend
- **Proxy (dev):** Vite dev server proxies `/api``http://backend:8000``frontend/vite.config.js`
- **Proxy (prod):** Comment in `frontend/src/api/client.js` notes nginx is expected; no nginx config is present in the repo
All backends implement `StorageBackend` ABC from `backend/storage/base.py`. Credentials are encrypted at rest with HKDF per-user key derivation using master key from `CLOUD_CREDS_KEY` env var.
### API Endpoints consumed by the frontend
### Google Drive v3
| Method | Path | Purpose |
|---|---|---|
| POST | `/api/documents/upload` | Upload file with optional auto-classify flag |
| GET | `/api/documents` | List documents (paginated, optional topic filter) |
| GET | `/api/documents/:id` | Get single document metadata |
| DELETE | `/api/documents/:id` | Delete document |
| POST | `/api/documents/:id/classify` | (Re)classify document, optional topic list |
| GET | `/api/topics` | List all topics |
| POST | `/api/topics` | Create topic |
| PATCH | `/api/topics/:id` | Update topic |
| DELETE | `/api/topics/:id` | Delete topic |
| POST | `/api/topics/suggest` | AI topic suggestions for a document |
| GET | `/api/settings` | Get settings (keys masked) |
| PATCH | `/api/settings` | Update settings |
| POST | `/api/settings/test-provider` | Health-check the active or named provider |
| GET | `/api/settings/default-prompt` | Retrieve the default classification system prompt |
- **SDK:** `google-auth-oauthlib>=1.3.1` + `google-api-python-client>=2.196.0`
- **Backend file:** `backend/storage/google_drive_backend.py`
- **Auth:** OAuth2 flow; tokens stored encrypted in DB; `token_uri`, `client_id`, `client_secret`, `access_token`, `refresh_token` in credentials dict
- **Scope:** `https://www.googleapis.com/auth/drive.file`
- **Note:** All `googleapiclient` calls are synchronous and wrapped in `asyncio.to_thread()` to avoid blocking the event loop; `cache_discovery=False` prevents `/tmp` writes (path traversal mitigation)
- **Auth env vars:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`
- **OAuth callback:** `{BACKEND_URL}/api/cloud/google/callback`
---
### Microsoft OneDrive (Graph API)
## Docker Services
- **SDK:** `msal>=1.36.0` (token management) + `httpx>=0.27` (async Graph API calls)
- **Backend file:** `backend/storage/onedrive_backend.py`
- **API base:** `https://graph.microsoft.com/v1.0`
- **Auth:** OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains `access_token`, `refresh_token`, `expires_at`
- **Upload strategy:** Resumable upload sessions (`createUploadSession`) for all files; chunk size 10 MB
- **Auth env vars:** `ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_SECRET`, `ONEDRIVE_TENANT_ID` (default: `"common"`)
Defined in `docker-compose.yml`:
### Nextcloud
| Service | Image | Port | Notes |
|---|---|---|---|
| `backend` | Built from `./backend/Dockerfile` | `8000:8000` | Mounts `./backend/data:/app/data` for persistence; `./backend:/app` for hot-reload |
| `frontend` | Built from `./frontend/Dockerfile` | `5173:5173` | Mounts `./frontend/src` and `index.html` for hot-reload; depends on `backend` |
- **Backend file:** `backend/storage/nextcloud_backend.py`
- **Inheritance:** `NextcloudBackend → WebDAVBackend → StorageBackend`
- **Protocol:** WebDAV via `webdavclient3>=3.14.7`
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
- **SSRF prevention:** `validate_cloud_url()` called at construction time and before every outbound request (`backend/storage/cloud_utils.py`)
- **No OAuth:** Credential-based only (username + password)
Both services use `extra_hosts: host.docker.internal:host-gateway` on the backend to allow Ollama/LM Studio connections to the host machine.
### Generic WebDAV
---
## Environment Variables
| Variable | Required | Where used | Notes |
|---|---|---|---|
| `DATA_DIR` | No | `backend/config.py` | Root path for uploads/metadata/settings; defaults to `/app/data` |
| `ANTHROPIC_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
| `OPENAI_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
| `PYTHONDONTWRITEBYTECODE` | No | `docker-compose.yml` | Set to `1` to suppress `.pyc` files in Docker |
- **Backend file:** `backend/storage/webdav_backend.py`
- **SDK:** `webdavclient3>=3.14.7`
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
- **SSRF prevention:** Same dual-call `validate_cloud_url()` pattern as Nextcloud
- **Path encoding:** `urllib.parse.quote()` per path segment to handle non-ASCII filenames
---
## Authentication & Identity
- No user authentication. The application has no login system, sessions, or identity provider.
- API keys for AI providers are stored in plain text in `backend/data/settings.json` (masked only when returned via the settings API).
No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:
- **Password hashing:** Argon2id via `pwdlib[argon2]``backend/services/auth.py`
- **JWT access tokens:** PyJWT `>=2.8.0`; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (`fgp`) bound to `User-Agent + Accept-Language`
- **Refresh tokens:** 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
- **JTI store:** Redis (TTL matching token lifetime)
- **TOTP (2FA):** `pyotp>=2.9.0`; replay prevention via Redis within 90 s window; QR codes generated in frontend with `qrcode ^1.5.4`
- **Backup codes:** Generated, hashed (Argon2id), stored in DB — `backend/db/models.py:BackupCode`
---
## External HTTP APIs
### HaveIBeenPwned (HIBP)
- **Purpose:** k-anonymity password breach check on registration and password change
- **Client:** `httpx` async GET to `https://api.pwnedpasswords.com/range/{prefix}`
- **Implementation:** `backend/services/auth.py:check_hibp()` — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration)
- **Auth:** None required (public API)
---
## Email / Notifications
- **Protocol:** SMTP via Python stdlib `smtplib``backend/services/email.py`
- **Transport security:** STARTTLS (port 587 default)
- **Auth:** Optional SMTP username + password
- **Auth env vars:** `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_FROM`
- **Dev fallback:** When `SMTP_HOST` is empty, email content is logged to stdout instead of sent
- **Emails sent:**
- Password reset link (1-hour validity) — triggered from `backend/tasks/email_tasks.py`
- Security alert (suspicious refresh token reuse / session family revocation) — triggered from `backend/services/auth.py` via Celery
- **Celery queue:** `email` queue, separate from `documents` queue
---
## Frontend ↔ Backend Communication
- **Protocol:** HTTP REST over JSON; multipart/form-data for document upload
- **Client:** Native browser `fetch` API — `frontend/src/api/` directory
- **Base path:** All requests use relative `/api/*` — no hardcoded backend hostname
- **Dev proxy:** Vite proxies `/api``http://backend:8000` (`frontend/vite.config.js`)
- **Auth flow:** Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client
---
## Background Task Queues (Celery)
- **Broker + result backend:** Redis (`REDIS_URL`)
- **Serialization:** JSON only (no pickle)
- **Queues and task modules:**
- `documents``backend/tasks/document_tasks.py` (extraction, classification, cleanup)
- `email``backend/tasks/email_tasks.py` (password reset, security alert)
- `documents` (reused) — `backend/tasks/audit_tasks.py` (audit log export)
- **Scheduled tasks (Celery Beat):**
- `cleanup-abandoned-uploads` — every 30 minutes
- `audit-log-daily-export` — midnight UTC daily
---
## Monitoring & Observability
- No error tracking service (no Sentry, Datadog, etc.).
- No structured logging framework — FastAPI default stdout logging only.
- A `/health` endpoint exists at `backend/main.py` returning `{"status": "ok"}`.
- Provider connectivity tested on demand via `POST /api/settings/test-provider`.
- **Error tracking:** None (no Sentry, Datadog, etc.)
- **Logging:** Python stdlib `logging`; stdout; no structured logging framework
- **Health endpoint:** `GET /health` — probes PostgreSQL (`SELECT 1`) and MinIO (bucket exists check); always returns HTTP 200 with `status: ok | degraded`
- **Audit log:** All auth events, quota violations, and admin actions written to DB audit log (no document content) — `backend/services/audit.py`, `backend/api/audit.py`
---
## Webhooks & Callbacks
## CI/CD & Deployment
- None — the application makes no outbound webhook calls and exposes no webhook receiver endpoints.
- **Hosting:** Docker Compose only; no cloud provider manifests detected
- **CI pipeline:** None detected in repository
- **Container registry:** None configured
- **Secrets management:** Environment variables only; `.env` file for local dev (not committed)
---
## Gaps / Unknowns
## Required Environment Variables Summary
- No nginx or reverse-proxy config present for production deployments; the client-side comment references it but no config exists.
- No container registry or CI/CD pipeline configuration detected.
- API keys are stored in a plain JSON file on disk with no encryption at rest.
- The `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` env vars from `.env.example` are noted as bootstrap helpers but no code in the repo reads them directly — they appear to be manual seeding hints only.
| Variable | Required | Service | Purpose |
|---|---|---|---|
| `DATABASE_URL` | Yes | backend | App DB connection (DML user) |
| `DATABASE_MIGRATE_URL` | Yes | migrations | Alembic DDL connection |
| `MINIO_ENDPOINT` | Yes | backend, workers | MinIO S3 API endpoint |
| `MINIO_ACCESS_KEY` | Yes | backend, workers | MinIO credentials |
| `MINIO_SECRET_KEY` | Yes | backend, workers | MinIO credentials |
| `MINIO_BUCKET` | Yes | backend, workers | Object storage bucket name |
| `REDIS_URL` | Yes | backend, workers, beat | Redis DSN (broker + JTI store) |
| `SECRET_KEY` | Yes | backend | JWT signing secret |
| `CLOUD_CREDS_KEY` | Yes | celery-worker | 32-byte master key for HKDF |
| `POSTGRES_PASSWORD` | Yes | postgres service | Docker postgres init |
| `MINIO_ROOT_USER` | Yes | minio service | MinIO root credentials |
| `MINIO_ROOT_PASSWORD` | Yes | minio service | MinIO root credentials |
| `REDIS_PASSWORD` | Yes | redis service | Redis auth password |
| `SMTP_HOST` | No | backend | Transactional email (dev: logs to stdout) |
| `GOOGLE_CLIENT_ID` | No | backend | Google Drive OAuth |
| `GOOGLE_CLIENT_SECRET` | No | backend | Google Drive OAuth |
| `ONEDRIVE_CLIENT_ID` | No | backend | OneDrive OAuth |
| `ONEDRIVE_CLIENT_SECRET` | No | backend | OneDrive OAuth |
| `ADMIN_EMAIL` | No | backend | Bootstrap admin account |
| `ADMIN_PASSWORD` | No | backend | Bootstrap admin account |
| `DEFAULT_AI_PROVIDER` | No | backend | AI provider selection (default: `ollama`) |
| `DEFAULT_AI_MODEL` | No | backend | AI model selection (default: `llama3.2`) |
| `CORS_ORIGINS` | No | backend | Allowed CORS origins |
| `FRONTEND_URL` | No | backend, minio | Password reset links + MinIO CORS |
| `BACKEND_URL` | No | backend | OAuth callback URL construction |
---
*Integration audit: 2026-06-02*