# External Integrations **Analysis Date:** 2026-06-02 ## AI / ML Classification All AI providers implement the `AIProvider` abstract interface in `backend/ai/base.py`. The active provider is selected at classification time via the `DEFAULT_AI_PROVIDER` setting (`backend/config.py`). ### Anthropic Claude - **SDK:** `anthropic>=0.26` — `backend/ai/anthropic_provider.py` - **Client:** `anthropic.AsyncAnthropic(api_key=...)` - **API:** Messages API (`client.messages.create`) - **Default model:** `claude-sonnet-4-6` (configurable via `DEFAULT_AI_MODEL`) - **Auth env var:** API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code) - **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5) - **Text cap:** 8,000 chars per call (`MAX_AI_CHARS = 8_000` in `backend/ai/anthropic_provider.py`) ### OpenAI - **SDK:** `openai>=1.30` — `backend/ai/openai_provider.py` - **Client:** `openai.AsyncOpenAI(api_key=..., base_url=...)` - **API:** Chat Completions (`client.chat.completions.create`) - **Default model:** `gpt-4o` - **Auth:** `api_key` at instantiation; `base_url` override supported for custom endpoints ### Ollama (local, OpenAI-compatible) - **Provider file:** `backend/ai/ollama_provider.py` - **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url` - **Default base URL:** `http://host.docker.internal:11434/v1` - **Default model:** `llama3.2` - **Auth:** Stub key `"ollama"` — no real auth - **Network path:** Reaches host machine Ollama daemon via Docker `extra_hosts: host.docker.internal:host-gateway` ### LM Studio (local, OpenAI-compatible) - **Provider file:** `backend/ai/lmstudio_provider.py` - **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url` - **Default base URL:** `http://host.docker.internal:1234/v1` - **Default model:** `gemma-4-e4b-it` - **Auth:** Stub key `"lm-studio"` — no real auth - **Network path:** Same `host.docker.internal` Docker alias as Ollama --- ## Data Storage ### PostgreSQL (primary database) - **Image:** `postgres:17-alpine` (Docker Compose) - **Driver:** `psycopg[binary]>=3.3.4` (psycopg v3 async) - **ORM:** SQLAlchemy 2.0 asyncio — `backend/db/session.py` - **Schema migrations:** Alembic — `backend/migrations/` - **Connection env vars:** `DATABASE_URL` (app user, DML only), `DATABASE_MIGRATE_URL` (migrate user, DDL) - **Role separation:** `docuvault_app` (DML), `docuvault_migrate` (DDL) — `docker/postgres/initdb.d/01-init-users.sql` ### MinIO (object storage) - **Image:** `minio/minio:latest` (Docker Compose), ports 9000 + 9001 - **SDK:** `minio>=7.2.20` — `backend/storage/minio_backend.py` - **Object key scheme:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filenames stored in DB only - **Presigned URLs:** Generated for browser direct-PUT uploads and GET downloads - **Auth env vars:** `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET` - **Public endpoint:** `MINIO_PUBLIC_ENDPOINT` — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint) - **CORS:** `MINIO_API_CORS_ALLOW_ORIGIN` set to `FRONTEND_URL` to allow browser preflight ### Redis - **Image:** `redis:7-alpine` (Docker Compose), password-protected - **Client:** `redis>=4.6.0` (async via `redis.asyncio`) - **Uses:** - Celery broker and result backend (`backend/celery_app.py`) - JTI token revocation store (access + refresh token blacklist) - Per-account rate limiting via slowapi (`backend/main.py`) - TOTP replay prevention (used TOTP codes invalidated within 90 s window) - **Auth env var:** `REDIS_URL` (includes password in DSN) --- ## Cloud Storage Backends All backends implement `StorageBackend` ABC from `backend/storage/base.py`. Credentials are encrypted at rest with HKDF per-user key derivation using master key from `CLOUD_CREDS_KEY` env var. ### Google Drive v3 - **SDK:** `google-auth-oauthlib>=1.3.1` + `google-api-python-client>=2.196.0` - **Backend file:** `backend/storage/google_drive_backend.py` - **Auth:** OAuth2 flow; tokens stored encrypted in DB; `token_uri`, `client_id`, `client_secret`, `access_token`, `refresh_token` in credentials dict - **Scope:** `https://www.googleapis.com/auth/drive.file` - **Note:** All `googleapiclient` calls are synchronous and wrapped in `asyncio.to_thread()` to avoid blocking the event loop; `cache_discovery=False` prevents `/tmp` writes (path traversal mitigation) - **Auth env vars:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET` - **OAuth callback:** `{BACKEND_URL}/api/cloud/google/callback` ### Microsoft OneDrive (Graph API) - **SDK:** `msal>=1.36.0` (token management) + `httpx>=0.27` (async Graph API calls) - **Backend file:** `backend/storage/onedrive_backend.py` - **API base:** `https://graph.microsoft.com/v1.0` - **Auth:** OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains `access_token`, `refresh_token`, `expires_at` - **Upload strategy:** Resumable upload sessions (`createUploadSession`) for all files; chunk size 10 MB - **Auth env vars:** `ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_SECRET`, `ONEDRIVE_TENANT_ID` (default: `"common"`) ### Nextcloud - **Backend file:** `backend/storage/nextcloud_backend.py` - **Inheritance:** `NextcloudBackend → WebDAVBackend → StorageBackend` - **Protocol:** WebDAV via `webdavclient3>=3.14.7` - **Credentials dict:** `{"server_url": str, "username": str, "password": str}` - **SSRF prevention:** `validate_cloud_url()` called at construction time and before every outbound request (`backend/storage/cloud_utils.py`) - **No OAuth:** Credential-based only (username + password) ### Generic WebDAV - **Backend file:** `backend/storage/webdav_backend.py` - **SDK:** `webdavclient3>=3.14.7` - **Credentials dict:** `{"server_url": str, "username": str, "password": str}` - **SSRF prevention:** Same dual-call `validate_cloud_url()` pattern as Nextcloud - **Path encoding:** `urllib.parse.quote()` per path segment to handle non-ASCII filenames --- ## Authentication & Identity No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built: - **Password hashing:** Argon2id via `pwdlib[argon2]` — `backend/services/auth.py` - **JWT access tokens:** PyJWT `>=2.8.0`; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (`fgp`) bound to `User-Agent + Accept-Language` - **Refresh tokens:** 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse - **JTI store:** Redis (TTL matching token lifetime) - **TOTP (2FA):** `pyotp>=2.9.0`; replay prevention via Redis within 90 s window; QR codes generated in frontend with `qrcode ^1.5.4` - **Backup codes:** Generated, hashed (Argon2id), stored in DB — `backend/db/models.py:BackupCode` --- ## External HTTP APIs ### HaveIBeenPwned (HIBP) - **Purpose:** k-anonymity password breach check on registration and password change - **Client:** `httpx` async GET to `https://api.pwnedpasswords.com/range/{prefix}` - **Implementation:** `backend/services/auth.py:check_hibp()` — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration) - **Auth:** None required (public API) --- ## Email / Notifications - **Protocol:** SMTP via Python stdlib `smtplib` — `backend/services/email.py` - **Transport security:** STARTTLS (port 587 default) - **Auth:** Optional SMTP username + password - **Auth env vars:** `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_FROM` - **Dev fallback:** When `SMTP_HOST` is empty, email content is logged to stdout instead of sent - **Emails sent:** - Password reset link (1-hour validity) — triggered from `backend/tasks/email_tasks.py` - Security alert (suspicious refresh token reuse / session family revocation) — triggered from `backend/services/auth.py` via Celery - **Celery queue:** `email` queue, separate from `documents` queue --- ## Frontend ↔ Backend Communication - **Protocol:** HTTP REST over JSON; multipart/form-data for document upload - **Client:** Native browser `fetch` API — `frontend/src/api/` directory - **Base path:** All requests use relative `/api/*` — no hardcoded backend hostname - **Dev proxy:** Vite proxies `/api` → `http://backend:8000` (`frontend/vite.config.js`) - **Auth flow:** Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client --- ## Background Task Queues (Celery) - **Broker + result backend:** Redis (`REDIS_URL`) - **Serialization:** JSON only (no pickle) - **Queues and task modules:** - `documents` — `backend/tasks/document_tasks.py` (extraction, classification, cleanup) - `email` — `backend/tasks/email_tasks.py` (password reset, security alert) - `documents` (reused) — `backend/tasks/audit_tasks.py` (audit log export) - **Scheduled tasks (Celery Beat):** - `cleanup-abandoned-uploads` — every 30 minutes - `audit-log-daily-export` — midnight UTC daily --- ## Monitoring & Observability - **Error tracking:** None (no Sentry, Datadog, etc.) - **Logging:** Python stdlib `logging`; stdout; no structured logging framework - **Health endpoint:** `GET /health` — probes PostgreSQL (`SELECT 1`) and MinIO (bucket exists check); always returns HTTP 200 with `status: ok | degraded` - **Audit log:** All auth events, quota violations, and admin actions written to DB audit log (no document content) — `backend/services/audit.py`, `backend/api/audit.py` --- ## CI/CD & Deployment - **Hosting:** Docker Compose only; no cloud provider manifests detected - **CI pipeline:** None detected in repository - **Container registry:** None configured - **Secrets management:** Environment variables only; `.env` file for local dev (not committed) --- ## Required Environment Variables Summary | Variable | Required | Service | Purpose | |---|---|---|---| | `DATABASE_URL` | Yes | backend | App DB connection (DML user) | | `DATABASE_MIGRATE_URL` | Yes | migrations | Alembic DDL connection | | `MINIO_ENDPOINT` | Yes | backend, workers | MinIO S3 API endpoint | | `MINIO_ACCESS_KEY` | Yes | backend, workers | MinIO credentials | | `MINIO_SECRET_KEY` | Yes | backend, workers | MinIO credentials | | `MINIO_BUCKET` | Yes | backend, workers | Object storage bucket name | | `REDIS_URL` | Yes | backend, workers, beat | Redis DSN (broker + JTI store) | | `SECRET_KEY` | Yes | backend | JWT signing secret | | `CLOUD_CREDS_KEY` | Yes | celery-worker | 32-byte master key for HKDF | | `POSTGRES_PASSWORD` | Yes | postgres service | Docker postgres init | | `MINIO_ROOT_USER` | Yes | minio service | MinIO root credentials | | `MINIO_ROOT_PASSWORD` | Yes | minio service | MinIO root credentials | | `REDIS_PASSWORD` | Yes | redis service | Redis auth password | | `SMTP_HOST` | No | backend | Transactional email (dev: logs to stdout) | | `GOOGLE_CLIENT_ID` | No | backend | Google Drive OAuth | | `GOOGLE_CLIENT_SECRET` | No | backend | Google Drive OAuth | | `ONEDRIVE_CLIENT_ID` | No | backend | OneDrive OAuth | | `ONEDRIVE_CLIENT_SECRET` | No | backend | OneDrive OAuth | | `ADMIN_EMAIL` | No | backend | Bootstrap admin account | | `ADMIN_PASSWORD` | No | backend | Bootstrap admin account | | `DEFAULT_AI_PROVIDER` | No | backend | AI provider selection (default: `ollama`) | | `DEFAULT_AI_MODEL` | No | backend | AI model selection (default: `llama3.2`) | | `CORS_ORIGINS` | No | backend | Allowed CORS origins | | `FRONTEND_URL` | No | backend, minio | Password reset links + MinIO CORS | | `BACKEND_URL` | No | backend | OAuth callback URL construction | --- *Integration audit: 2026-06-02*