Files
2026-06-02 15:32:06 +02:00

236 lines
12 KiB
Markdown

# External Integrations
**Analysis Date:** 2026-06-02
## AI / ML Classification
All AI providers implement the `AIProvider` abstract interface in `backend/ai/base.py`. The active provider is selected at classification time via the `DEFAULT_AI_PROVIDER` setting (`backend/config.py`).
### Anthropic Claude
- **SDK:** `anthropic>=0.26``backend/ai/anthropic_provider.py`
- **Client:** `anthropic.AsyncAnthropic(api_key=...)`
- **API:** Messages API (`client.messages.create`)
- **Default model:** `claude-sonnet-4-6` (configurable via `DEFAULT_AI_MODEL`)
- **Auth env var:** API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
- **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
- **Text cap:** 8,000 chars per call (`MAX_AI_CHARS = 8_000` in `backend/ai/anthropic_provider.py`)
### OpenAI
- **SDK:** `openai>=1.30``backend/ai/openai_provider.py`
- **Client:** `openai.AsyncOpenAI(api_key=..., base_url=...)`
- **API:** Chat Completions (`client.chat.completions.create`)
- **Default model:** `gpt-4o`
- **Auth:** `api_key` at instantiation; `base_url` override supported for custom endpoints
### Ollama (local, OpenAI-compatible)
- **Provider file:** `backend/ai/ollama_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
- **Default base URL:** `http://host.docker.internal:11434/v1`
- **Default model:** `llama3.2`
- **Auth:** Stub key `"ollama"` — no real auth
- **Network path:** Reaches host machine Ollama daemon via Docker `extra_hosts: host.docker.internal:host-gateway`
### LM Studio (local, OpenAI-compatible)
- **Provider file:** `backend/ai/lmstudio_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
- **Default base URL:** `http://host.docker.internal:1234/v1`
- **Default model:** `gemma-4-e4b-it`
- **Auth:** Stub key `"lm-studio"` — no real auth
- **Network path:** Same `host.docker.internal` Docker alias as Ollama
---
## Data Storage
### PostgreSQL (primary database)
- **Image:** `postgres:17-alpine` (Docker Compose)
- **Driver:** `psycopg[binary]>=3.3.4` (psycopg v3 async)
- **ORM:** SQLAlchemy 2.0 asyncio — `backend/db/session.py`
- **Schema migrations:** Alembic — `backend/migrations/`
- **Connection env vars:** `DATABASE_URL` (app user, DML only), `DATABASE_MIGRATE_URL` (migrate user, DDL)
- **Role separation:** `docuvault_app` (DML), `docuvault_migrate` (DDL) — `docker/postgres/initdb.d/01-init-users.sql`
### MinIO (object storage)
- **Image:** `minio/minio:latest` (Docker Compose), ports 9000 + 9001
- **SDK:** `minio>=7.2.20``backend/storage/minio_backend.py`
- **Object key scheme:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filenames stored in DB only
- **Presigned URLs:** Generated for browser direct-PUT uploads and GET downloads
- **Auth env vars:** `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET`
- **Public endpoint:** `MINIO_PUBLIC_ENDPOINT` — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint)
- **CORS:** `MINIO_API_CORS_ALLOW_ORIGIN` set to `FRONTEND_URL` to allow browser preflight
### Redis
- **Image:** `redis:7-alpine` (Docker Compose), password-protected
- **Client:** `redis>=4.6.0` (async via `redis.asyncio`)
- **Uses:**
- Celery broker and result backend (`backend/celery_app.py`)
- JTI token revocation store (access + refresh token blacklist)
- Per-account rate limiting via slowapi (`backend/main.py`)
- TOTP replay prevention (used TOTP codes invalidated within 90 s window)
- **Auth env var:** `REDIS_URL` (includes password in DSN)
---
## Cloud Storage Backends
All backends implement `StorageBackend` ABC from `backend/storage/base.py`. Credentials are encrypted at rest with HKDF per-user key derivation using master key from `CLOUD_CREDS_KEY` env var.
### Google Drive v3
- **SDK:** `google-auth-oauthlib>=1.3.1` + `google-api-python-client>=2.196.0`
- **Backend file:** `backend/storage/google_drive_backend.py`
- **Auth:** OAuth2 flow; tokens stored encrypted in DB; `token_uri`, `client_id`, `client_secret`, `access_token`, `refresh_token` in credentials dict
- **Scope:** `https://www.googleapis.com/auth/drive.file`
- **Note:** All `googleapiclient` calls are synchronous and wrapped in `asyncio.to_thread()` to avoid blocking the event loop; `cache_discovery=False` prevents `/tmp` writes (path traversal mitigation)
- **Auth env vars:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`
- **OAuth callback:** `{BACKEND_URL}/api/cloud/google/callback`
### Microsoft OneDrive (Graph API)
- **SDK:** `msal>=1.36.0` (token management) + `httpx>=0.27` (async Graph API calls)
- **Backend file:** `backend/storage/onedrive_backend.py`
- **API base:** `https://graph.microsoft.com/v1.0`
- **Auth:** OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains `access_token`, `refresh_token`, `expires_at`
- **Upload strategy:** Resumable upload sessions (`createUploadSession`) for all files; chunk size 10 MB
- **Auth env vars:** `ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_SECRET`, `ONEDRIVE_TENANT_ID` (default: `"common"`)
### Nextcloud
- **Backend file:** `backend/storage/nextcloud_backend.py`
- **Inheritance:** `NextcloudBackend → WebDAVBackend → StorageBackend`
- **Protocol:** WebDAV via `webdavclient3>=3.14.7`
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
- **SSRF prevention:** `validate_cloud_url()` called at construction time and before every outbound request (`backend/storage/cloud_utils.py`)
- **No OAuth:** Credential-based only (username + password)
### Generic WebDAV
- **Backend file:** `backend/storage/webdav_backend.py`
- **SDK:** `webdavclient3>=3.14.7`
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
- **SSRF prevention:** Same dual-call `validate_cloud_url()` pattern as Nextcloud
- **Path encoding:** `urllib.parse.quote()` per path segment to handle non-ASCII filenames
---
## Authentication & Identity
No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:
- **Password hashing:** Argon2id via `pwdlib[argon2]``backend/services/auth.py`
- **JWT access tokens:** PyJWT `>=2.8.0`; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (`fgp`) bound to `User-Agent + Accept-Language`
- **Refresh tokens:** 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
- **JTI store:** Redis (TTL matching token lifetime)
- **TOTP (2FA):** `pyotp>=2.9.0`; replay prevention via Redis within 90 s window; QR codes generated in frontend with `qrcode ^1.5.4`
- **Backup codes:** Generated, hashed (Argon2id), stored in DB — `backend/db/models.py:BackupCode`
---
## External HTTP APIs
### HaveIBeenPwned (HIBP)
- **Purpose:** k-anonymity password breach check on registration and password change
- **Client:** `httpx` async GET to `https://api.pwnedpasswords.com/range/{prefix}`
- **Implementation:** `backend/services/auth.py:check_hibp()` — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration)
- **Auth:** None required (public API)
---
## Email / Notifications
- **Protocol:** SMTP via Python stdlib `smtplib``backend/services/email.py`
- **Transport security:** STARTTLS (port 587 default)
- **Auth:** Optional SMTP username + password
- **Auth env vars:** `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_FROM`
- **Dev fallback:** When `SMTP_HOST` is empty, email content is logged to stdout instead of sent
- **Emails sent:**
- Password reset link (1-hour validity) — triggered from `backend/tasks/email_tasks.py`
- Security alert (suspicious refresh token reuse / session family revocation) — triggered from `backend/services/auth.py` via Celery
- **Celery queue:** `email` queue, separate from `documents` queue
---
## Frontend ↔ Backend Communication
- **Protocol:** HTTP REST over JSON; multipart/form-data for document upload
- **Client:** Native browser `fetch` API — `frontend/src/api/` directory
- **Base path:** All requests use relative `/api/*` — no hardcoded backend hostname
- **Dev proxy:** Vite proxies `/api``http://backend:8000` (`frontend/vite.config.js`)
- **Auth flow:** Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client
---
## Background Task Queues (Celery)
- **Broker + result backend:** Redis (`REDIS_URL`)
- **Serialization:** JSON only (no pickle)
- **Queues and task modules:**
- `documents``backend/tasks/document_tasks.py` (extraction, classification, cleanup)
- `email``backend/tasks/email_tasks.py` (password reset, security alert)
- `documents` (reused) — `backend/tasks/audit_tasks.py` (audit log export)
- **Scheduled tasks (Celery Beat):**
- `cleanup-abandoned-uploads` — every 30 minutes
- `audit-log-daily-export` — midnight UTC daily
---
## Monitoring & Observability
- **Error tracking:** None (no Sentry, Datadog, etc.)
- **Logging:** Python stdlib `logging`; stdout; no structured logging framework
- **Health endpoint:** `GET /health` — probes PostgreSQL (`SELECT 1`) and MinIO (bucket exists check); always returns HTTP 200 with `status: ok | degraded`
- **Audit log:** All auth events, quota violations, and admin actions written to DB audit log (no document content) — `backend/services/audit.py`, `backend/api/audit.py`
---
## CI/CD & Deployment
- **Hosting:** Docker Compose only; no cloud provider manifests detected
- **CI pipeline:** None detected in repository
- **Container registry:** None configured
- **Secrets management:** Environment variables only; `.env` file for local dev (not committed)
---
## Required Environment Variables Summary
| Variable | Required | Service | Purpose |
|---|---|---|---|
| `DATABASE_URL` | Yes | backend | App DB connection (DML user) |
| `DATABASE_MIGRATE_URL` | Yes | migrations | Alembic DDL connection |
| `MINIO_ENDPOINT` | Yes | backend, workers | MinIO S3 API endpoint |
| `MINIO_ACCESS_KEY` | Yes | backend, workers | MinIO credentials |
| `MINIO_SECRET_KEY` | Yes | backend, workers | MinIO credentials |
| `MINIO_BUCKET` | Yes | backend, workers | Object storage bucket name |
| `REDIS_URL` | Yes | backend, workers, beat | Redis DSN (broker + JTI store) |
| `SECRET_KEY` | Yes | backend | JWT signing secret |
| `CLOUD_CREDS_KEY` | Yes | celery-worker | 32-byte master key for HKDF |
| `POSTGRES_PASSWORD` | Yes | postgres service | Docker postgres init |
| `MINIO_ROOT_USER` | Yes | minio service | MinIO root credentials |
| `MINIO_ROOT_PASSWORD` | Yes | minio service | MinIO root credentials |
| `REDIS_PASSWORD` | Yes | redis service | Redis auth password |
| `SMTP_HOST` | No | backend | Transactional email (dev: logs to stdout) |
| `GOOGLE_CLIENT_ID` | No | backend | Google Drive OAuth |
| `GOOGLE_CLIENT_SECRET` | No | backend | Google Drive OAuth |
| `ONEDRIVE_CLIENT_ID` | No | backend | OneDrive OAuth |
| `ONEDRIVE_CLIENT_SECRET` | No | backend | OneDrive OAuth |
| `ADMIN_EMAIL` | No | backend | Bootstrap admin account |
| `ADMIN_PASSWORD` | No | backend | Bootstrap admin account |
| `DEFAULT_AI_PROVIDER` | No | backend | AI provider selection (default: `ollama`) |
| `DEFAULT_AI_MODEL` | No | backend | AI model selection (default: `llama3.2`) |
| `CORS_ORIGINS` | No | backend | Allowed CORS origins |
| `FRONTEND_URL` | No | backend, minio | Password reset links + MinIO CORS |
| `BACKEND_URL` | No | backend | OAuth callback URL construction |
---
*Integration audit: 2026-06-02*