89f8d5a654
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
236 lines
12 KiB
Markdown
236 lines
12 KiB
Markdown
# External Integrations
|
|
|
|
**Analysis Date:** 2026-06-02
|
|
|
|
## AI / ML Classification
|
|
|
|
All AI providers implement the `AIProvider` abstract interface in `backend/ai/base.py`. The active provider is selected at classification time via the `DEFAULT_AI_PROVIDER` setting (`backend/config.py`).
|
|
|
|
### Anthropic Claude
|
|
|
|
- **SDK:** `anthropic>=0.26` — `backend/ai/anthropic_provider.py`
|
|
- **Client:** `anthropic.AsyncAnthropic(api_key=...)`
|
|
- **API:** Messages API (`client.messages.create`)
|
|
- **Default model:** `claude-sonnet-4-6` (configurable via `DEFAULT_AI_MODEL`)
|
|
- **Auth env var:** API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
|
|
- **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
|
|
- **Text cap:** 8,000 chars per call (`MAX_AI_CHARS = 8_000` in `backend/ai/anthropic_provider.py`)
|
|
|
|
### OpenAI
|
|
|
|
- **SDK:** `openai>=1.30` — `backend/ai/openai_provider.py`
|
|
- **Client:** `openai.AsyncOpenAI(api_key=..., base_url=...)`
|
|
- **API:** Chat Completions (`client.chat.completions.create`)
|
|
- **Default model:** `gpt-4o`
|
|
- **Auth:** `api_key` at instantiation; `base_url` override supported for custom endpoints
|
|
|
|
### Ollama (local, OpenAI-compatible)
|
|
|
|
- **Provider file:** `backend/ai/ollama_provider.py`
|
|
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
|
|
- **Default base URL:** `http://host.docker.internal:11434/v1`
|
|
- **Default model:** `llama3.2`
|
|
- **Auth:** Stub key `"ollama"` — no real auth
|
|
- **Network path:** Reaches host machine Ollama daemon via Docker `extra_hosts: host.docker.internal:host-gateway`
|
|
|
|
### LM Studio (local, OpenAI-compatible)
|
|
|
|
- **Provider file:** `backend/ai/lmstudio_provider.py`
|
|
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
|
|
- **Default base URL:** `http://host.docker.internal:1234/v1`
|
|
- **Default model:** `gemma-4-e4b-it`
|
|
- **Auth:** Stub key `"lm-studio"` — no real auth
|
|
- **Network path:** Same `host.docker.internal` Docker alias as Ollama
|
|
|
|
---
|
|
|
|
## Data Storage
|
|
|
|
### PostgreSQL (primary database)
|
|
|
|
- **Image:** `postgres:17-alpine` (Docker Compose)
|
|
- **Driver:** `psycopg[binary]>=3.3.4` (psycopg v3 async)
|
|
- **ORM:** SQLAlchemy 2.0 asyncio — `backend/db/session.py`
|
|
- **Schema migrations:** Alembic — `backend/migrations/`
|
|
- **Connection env vars:** `DATABASE_URL` (app user, DML only), `DATABASE_MIGRATE_URL` (migrate user, DDL)
|
|
- **Role separation:** `docuvault_app` (DML), `docuvault_migrate` (DDL) — `docker/postgres/initdb.d/01-init-users.sql`
|
|
|
|
### MinIO (object storage)
|
|
|
|
- **Image:** `minio/minio:latest` (Docker Compose), ports 9000 + 9001
|
|
- **SDK:** `minio>=7.2.20` — `backend/storage/minio_backend.py`
|
|
- **Object key scheme:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filenames stored in DB only
|
|
- **Presigned URLs:** Generated for browser direct-PUT uploads and GET downloads
|
|
- **Auth env vars:** `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET`
|
|
- **Public endpoint:** `MINIO_PUBLIC_ENDPOINT` — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint)
|
|
- **CORS:** `MINIO_API_CORS_ALLOW_ORIGIN` set to `FRONTEND_URL` to allow browser preflight
|
|
|
|
### Redis
|
|
|
|
- **Image:** `redis:7-alpine` (Docker Compose), password-protected
|
|
- **Client:** `redis>=4.6.0` (async via `redis.asyncio`)
|
|
- **Uses:**
|
|
- Celery broker and result backend (`backend/celery_app.py`)
|
|
- JTI token revocation store (access + refresh token blacklist)
|
|
- Per-account rate limiting via slowapi (`backend/main.py`)
|
|
- TOTP replay prevention (used TOTP codes invalidated within 90 s window)
|
|
- **Auth env var:** `REDIS_URL` (includes password in DSN)
|
|
|
|
---
|
|
|
|
## Cloud Storage Backends
|
|
|
|
All backends implement `StorageBackend` ABC from `backend/storage/base.py`. Credentials are encrypted at rest with HKDF per-user key derivation using master key from `CLOUD_CREDS_KEY` env var.
|
|
|
|
### Google Drive v3
|
|
|
|
- **SDK:** `google-auth-oauthlib>=1.3.1` + `google-api-python-client>=2.196.0`
|
|
- **Backend file:** `backend/storage/google_drive_backend.py`
|
|
- **Auth:** OAuth2 flow; tokens stored encrypted in DB; `token_uri`, `client_id`, `client_secret`, `access_token`, `refresh_token` in credentials dict
|
|
- **Scope:** `https://www.googleapis.com/auth/drive.file`
|
|
- **Note:** All `googleapiclient` calls are synchronous and wrapped in `asyncio.to_thread()` to avoid blocking the event loop; `cache_discovery=False` prevents `/tmp` writes (path traversal mitigation)
|
|
- **Auth env vars:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`
|
|
- **OAuth callback:** `{BACKEND_URL}/api/cloud/google/callback`
|
|
|
|
### Microsoft OneDrive (Graph API)
|
|
|
|
- **SDK:** `msal>=1.36.0` (token management) + `httpx>=0.27` (async Graph API calls)
|
|
- **Backend file:** `backend/storage/onedrive_backend.py`
|
|
- **API base:** `https://graph.microsoft.com/v1.0`
|
|
- **Auth:** OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains `access_token`, `refresh_token`, `expires_at`
|
|
- **Upload strategy:** Resumable upload sessions (`createUploadSession`) for all files; chunk size 10 MB
|
|
- **Auth env vars:** `ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_SECRET`, `ONEDRIVE_TENANT_ID` (default: `"common"`)
|
|
|
|
### Nextcloud
|
|
|
|
- **Backend file:** `backend/storage/nextcloud_backend.py`
|
|
- **Inheritance:** `NextcloudBackend → WebDAVBackend → StorageBackend`
|
|
- **Protocol:** WebDAV via `webdavclient3>=3.14.7`
|
|
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
|
|
- **SSRF prevention:** `validate_cloud_url()` called at construction time and before every outbound request (`backend/storage/cloud_utils.py`)
|
|
- **No OAuth:** Credential-based only (username + password)
|
|
|
|
### Generic WebDAV
|
|
|
|
- **Backend file:** `backend/storage/webdav_backend.py`
|
|
- **SDK:** `webdavclient3>=3.14.7`
|
|
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
|
|
- **SSRF prevention:** Same dual-call `validate_cloud_url()` pattern as Nextcloud
|
|
- **Path encoding:** `urllib.parse.quote()` per path segment to handle non-ASCII filenames
|
|
|
|
---
|
|
|
|
## Authentication & Identity
|
|
|
|
No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:
|
|
|
|
- **Password hashing:** Argon2id via `pwdlib[argon2]` — `backend/services/auth.py`
|
|
- **JWT access tokens:** PyJWT `>=2.8.0`; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (`fgp`) bound to `User-Agent + Accept-Language`
|
|
- **Refresh tokens:** 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
|
|
- **JTI store:** Redis (TTL matching token lifetime)
|
|
- **TOTP (2FA):** `pyotp>=2.9.0`; replay prevention via Redis within 90 s window; QR codes generated in frontend with `qrcode ^1.5.4`
|
|
- **Backup codes:** Generated, hashed (Argon2id), stored in DB — `backend/db/models.py:BackupCode`
|
|
|
|
---
|
|
|
|
## External HTTP APIs
|
|
|
|
### HaveIBeenPwned (HIBP)
|
|
|
|
- **Purpose:** k-anonymity password breach check on registration and password change
|
|
- **Client:** `httpx` async GET to `https://api.pwnedpasswords.com/range/{prefix}`
|
|
- **Implementation:** `backend/services/auth.py:check_hibp()` — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration)
|
|
- **Auth:** None required (public API)
|
|
|
|
---
|
|
|
|
## Email / Notifications
|
|
|
|
- **Protocol:** SMTP via Python stdlib `smtplib` — `backend/services/email.py`
|
|
- **Transport security:** STARTTLS (port 587 default)
|
|
- **Auth:** Optional SMTP username + password
|
|
- **Auth env vars:** `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_FROM`
|
|
- **Dev fallback:** When `SMTP_HOST` is empty, email content is logged to stdout instead of sent
|
|
- **Emails sent:**
|
|
- Password reset link (1-hour validity) — triggered from `backend/tasks/email_tasks.py`
|
|
- Security alert (suspicious refresh token reuse / session family revocation) — triggered from `backend/services/auth.py` via Celery
|
|
- **Celery queue:** `email` queue, separate from `documents` queue
|
|
|
|
---
|
|
|
|
## Frontend ↔ Backend Communication
|
|
|
|
- **Protocol:** HTTP REST over JSON; multipart/form-data for document upload
|
|
- **Client:** Native browser `fetch` API — `frontend/src/api/` directory
|
|
- **Base path:** All requests use relative `/api/*` — no hardcoded backend hostname
|
|
- **Dev proxy:** Vite proxies `/api` → `http://backend:8000` (`frontend/vite.config.js`)
|
|
- **Auth flow:** Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client
|
|
|
|
---
|
|
|
|
## Background Task Queues (Celery)
|
|
|
|
- **Broker + result backend:** Redis (`REDIS_URL`)
|
|
- **Serialization:** JSON only (no pickle)
|
|
- **Queues and task modules:**
|
|
- `documents` — `backend/tasks/document_tasks.py` (extraction, classification, cleanup)
|
|
- `email` — `backend/tasks/email_tasks.py` (password reset, security alert)
|
|
- `documents` (reused) — `backend/tasks/audit_tasks.py` (audit log export)
|
|
- **Scheduled tasks (Celery Beat):**
|
|
- `cleanup-abandoned-uploads` — every 30 minutes
|
|
- `audit-log-daily-export` — midnight UTC daily
|
|
|
|
---
|
|
|
|
## Monitoring & Observability
|
|
|
|
- **Error tracking:** None (no Sentry, Datadog, etc.)
|
|
- **Logging:** Python stdlib `logging`; stdout; no structured logging framework
|
|
- **Health endpoint:** `GET /health` — probes PostgreSQL (`SELECT 1`) and MinIO (bucket exists check); always returns HTTP 200 with `status: ok | degraded`
|
|
- **Audit log:** All auth events, quota violations, and admin actions written to DB audit log (no document content) — `backend/services/audit.py`, `backend/api/audit.py`
|
|
|
|
---
|
|
|
|
## CI/CD & Deployment
|
|
|
|
- **Hosting:** Docker Compose only; no cloud provider manifests detected
|
|
- **CI pipeline:** None detected in repository
|
|
- **Container registry:** None configured
|
|
- **Secrets management:** Environment variables only; `.env` file for local dev (not committed)
|
|
|
|
---
|
|
|
|
## Required Environment Variables Summary
|
|
|
|
| Variable | Required | Service | Purpose |
|
|
|---|---|---|---|
|
|
| `DATABASE_URL` | Yes | backend | App DB connection (DML user) |
|
|
| `DATABASE_MIGRATE_URL` | Yes | migrations | Alembic DDL connection |
|
|
| `MINIO_ENDPOINT` | Yes | backend, workers | MinIO S3 API endpoint |
|
|
| `MINIO_ACCESS_KEY` | Yes | backend, workers | MinIO credentials |
|
|
| `MINIO_SECRET_KEY` | Yes | backend, workers | MinIO credentials |
|
|
| `MINIO_BUCKET` | Yes | backend, workers | Object storage bucket name |
|
|
| `REDIS_URL` | Yes | backend, workers, beat | Redis DSN (broker + JTI store) |
|
|
| `SECRET_KEY` | Yes | backend | JWT signing secret |
|
|
| `CLOUD_CREDS_KEY` | Yes | celery-worker | 32-byte master key for HKDF |
|
|
| `POSTGRES_PASSWORD` | Yes | postgres service | Docker postgres init |
|
|
| `MINIO_ROOT_USER` | Yes | minio service | MinIO root credentials |
|
|
| `MINIO_ROOT_PASSWORD` | Yes | minio service | MinIO root credentials |
|
|
| `REDIS_PASSWORD` | Yes | redis service | Redis auth password |
|
|
| `SMTP_HOST` | No | backend | Transactional email (dev: logs to stdout) |
|
|
| `GOOGLE_CLIENT_ID` | No | backend | Google Drive OAuth |
|
|
| `GOOGLE_CLIENT_SECRET` | No | backend | Google Drive OAuth |
|
|
| `ONEDRIVE_CLIENT_ID` | No | backend | OneDrive OAuth |
|
|
| `ONEDRIVE_CLIENT_SECRET` | No | backend | OneDrive OAuth |
|
|
| `ADMIN_EMAIL` | No | backend | Bootstrap admin account |
|
|
| `ADMIN_PASSWORD` | No | backend | Bootstrap admin account |
|
|
| `DEFAULT_AI_PROVIDER` | No | backend | AI provider selection (default: `ollama`) |
|
|
| `DEFAULT_AI_MODEL` | No | backend | AI model selection (default: `llama3.2`) |
|
|
| `CORS_ORIGINS` | No | backend | Allowed CORS origins |
|
|
| `FRONTEND_URL` | No | backend, minio | Password reset links + MinIO CORS |
|
|
| `BACKEND_URL` | No | backend | OAuth callback URL construction |
|
|
|
|
---
|
|
|
|
*Integration audit: 2026-06-02*
|