89f8d5a654
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
12 KiB
External Integrations
Analysis Date: 2026-06-02
AI / ML Classification
All AI providers implement the AIProvider abstract interface in backend/ai/base.py. The active provider is selected at classification time via the DEFAULT_AI_PROVIDER setting (backend/config.py).
Anthropic Claude
- SDK:
anthropic>=0.26—backend/ai/anthropic_provider.py - Client:
anthropic.AsyncAnthropic(api_key=...) - API: Messages API (
client.messages.create) - Default model:
claude-sonnet-4-6(configurable viaDEFAULT_AI_MODEL) - Auth env var: API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
- Calls made:
classify(max_tokens=1024),suggest_topics(max_tokens=256),health_check(max_tokens=5) - Text cap: 8,000 chars per call (
MAX_AI_CHARS = 8_000inbackend/ai/anthropic_provider.py)
OpenAI
- SDK:
openai>=1.30—backend/ai/openai_provider.py - Client:
openai.AsyncOpenAI(api_key=..., base_url=...) - API: Chat Completions (
client.chat.completions.create) - Default model:
gpt-4o - Auth:
api_keyat instantiation;base_urloverride supported for custom endpoints
Ollama (local, OpenAI-compatible)
- Provider file:
backend/ai/ollama_provider.py - Implementation: Subclass of
OpenAIProviderwith fixedbase_url - Default base URL:
http://host.docker.internal:11434/v1 - Default model:
llama3.2 - Auth: Stub key
"ollama"— no real auth - Network path: Reaches host machine Ollama daemon via Docker
extra_hosts: host.docker.internal:host-gateway
LM Studio (local, OpenAI-compatible)
- Provider file:
backend/ai/lmstudio_provider.py - Implementation: Subclass of
OpenAIProviderwith fixedbase_url - Default base URL:
http://host.docker.internal:1234/v1 - Default model:
gemma-4-e4b-it - Auth: Stub key
"lm-studio"— no real auth - Network path: Same
host.docker.internalDocker alias as Ollama
Data Storage
PostgreSQL (primary database)
- Image:
postgres:17-alpine(Docker Compose) - Driver:
psycopg[binary]>=3.3.4(psycopg v3 async) - ORM: SQLAlchemy 2.0 asyncio —
backend/db/session.py - Schema migrations: Alembic —
backend/migrations/ - Connection env vars:
DATABASE_URL(app user, DML only),DATABASE_MIGRATE_URL(migrate user, DDL) - Role separation:
docuvault_app(DML),docuvault_migrate(DDL) —docker/postgres/initdb.d/01-init-users.sql
MinIO (object storage)
- Image:
minio/minio:latest(Docker Compose), ports 9000 + 9001 - SDK:
minio>=7.2.20—backend/storage/minio_backend.py - Object key scheme:
{user_id}/{document_id}/{uuid4()}{ext}— human filenames stored in DB only - Presigned URLs: Generated for browser direct-PUT uploads and GET downloads
- Auth env vars:
MINIO_ENDPOINT,MINIO_ACCESS_KEY,MINIO_SECRET_KEY,MINIO_BUCKET - Public endpoint:
MINIO_PUBLIC_ENDPOINT— browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint) - CORS:
MINIO_API_CORS_ALLOW_ORIGINset toFRONTEND_URLto allow browser preflight
Redis
- Image:
redis:7-alpine(Docker Compose), password-protected - Client:
redis>=4.6.0(async viaredis.asyncio) - Uses:
- Celery broker and result backend (
backend/celery_app.py) - JTI token revocation store (access + refresh token blacklist)
- Per-account rate limiting via slowapi (
backend/main.py) - TOTP replay prevention (used TOTP codes invalidated within 90 s window)
- Celery broker and result backend (
- Auth env var:
REDIS_URL(includes password in DSN)
Cloud Storage Backends
All backends implement StorageBackend ABC from backend/storage/base.py. Credentials are encrypted at rest with HKDF per-user key derivation using master key from CLOUD_CREDS_KEY env var.
Google Drive v3
- SDK:
google-auth-oauthlib>=1.3.1+google-api-python-client>=2.196.0 - Backend file:
backend/storage/google_drive_backend.py - Auth: OAuth2 flow; tokens stored encrypted in DB;
token_uri,client_id,client_secret,access_token,refresh_tokenin credentials dict - Scope:
https://www.googleapis.com/auth/drive.file - Note: All
googleapiclientcalls are synchronous and wrapped inasyncio.to_thread()to avoid blocking the event loop;cache_discovery=Falseprevents/tmpwrites (path traversal mitigation) - Auth env vars:
GOOGLE_CLIENT_ID,GOOGLE_CLIENT_SECRET - OAuth callback:
{BACKEND_URL}/api/cloud/google/callback
Microsoft OneDrive (Graph API)
- SDK:
msal>=1.36.0(token management) +httpx>=0.27(async Graph API calls) - Backend file:
backend/storage/onedrive_backend.py - API base:
https://graph.microsoft.com/v1.0 - Auth: OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains
access_token,refresh_token,expires_at - Upload strategy: Resumable upload sessions (
createUploadSession) for all files; chunk size 10 MB - Auth env vars:
ONEDRIVE_CLIENT_ID,ONEDRIVE_CLIENT_SECRET,ONEDRIVE_TENANT_ID(default:"common")
Nextcloud
- Backend file:
backend/storage/nextcloud_backend.py - Inheritance:
NextcloudBackend → WebDAVBackend → StorageBackend - Protocol: WebDAV via
webdavclient3>=3.14.7 - Credentials dict:
{"server_url": str, "username": str, "password": str} - SSRF prevention:
validate_cloud_url()called at construction time and before every outbound request (backend/storage/cloud_utils.py) - No OAuth: Credential-based only (username + password)
Generic WebDAV
- Backend file:
backend/storage/webdav_backend.py - SDK:
webdavclient3>=3.14.7 - Credentials dict:
{"server_url": str, "username": str, "password": str} - SSRF prevention: Same dual-call
validate_cloud_url()pattern as Nextcloud - Path encoding:
urllib.parse.quote()per path segment to handle non-ASCII filenames
Authentication & Identity
No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:
- Password hashing: Argon2id via
pwdlib[argon2]—backend/services/auth.py - JWT access tokens: PyJWT
>=2.8.0; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (fgp) bound toUser-Agent + Accept-Language - Refresh tokens: 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
- JTI store: Redis (TTL matching token lifetime)
- TOTP (2FA):
pyotp>=2.9.0; replay prevention via Redis within 90 s window; QR codes generated in frontend withqrcode ^1.5.4 - Backup codes: Generated, hashed (Argon2id), stored in DB —
backend/db/models.py:BackupCode
External HTTP APIs
HaveIBeenPwned (HIBP)
- Purpose: k-anonymity password breach check on registration and password change
- Client:
httpxasync GET tohttps://api.pwnedpasswords.com/range/{prefix} - Implementation:
backend/services/auth.py:check_hibp()— sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration) - Auth: None required (public API)
Email / Notifications
- Protocol: SMTP via Python stdlib
smtplib—backend/services/email.py - Transport security: STARTTLS (port 587 default)
- Auth: Optional SMTP username + password
- Auth env vars:
SMTP_HOST,SMTP_PORT,SMTP_USER,SMTP_PASSWORD,SMTP_FROM - Dev fallback: When
SMTP_HOSTis empty, email content is logged to stdout instead of sent - Emails sent:
- Password reset link (1-hour validity) — triggered from
backend/tasks/email_tasks.py - Security alert (suspicious refresh token reuse / session family revocation) — triggered from
backend/services/auth.pyvia Celery
- Password reset link (1-hour validity) — triggered from
- Celery queue:
emailqueue, separate fromdocumentsqueue
Frontend ↔ Backend Communication
- Protocol: HTTP REST over JSON; multipart/form-data for document upload
- Client: Native browser
fetchAPI —frontend/src/api/directory - Base path: All requests use relative
/api/*— no hardcoded backend hostname - Dev proxy: Vite proxies
/api→http://backend:8000(frontend/vite.config.js) - Auth flow: Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client
Background Task Queues (Celery)
- Broker + result backend: Redis (
REDIS_URL) - Serialization: JSON only (no pickle)
- Queues and task modules:
documents—backend/tasks/document_tasks.py(extraction, classification, cleanup)email—backend/tasks/email_tasks.py(password reset, security alert)documents(reused) —backend/tasks/audit_tasks.py(audit log export)
- Scheduled tasks (Celery Beat):
cleanup-abandoned-uploads— every 30 minutesaudit-log-daily-export— midnight UTC daily
Monitoring & Observability
- Error tracking: None (no Sentry, Datadog, etc.)
- Logging: Python stdlib
logging; stdout; no structured logging framework - Health endpoint:
GET /health— probes PostgreSQL (SELECT 1) and MinIO (bucket exists check); always returns HTTP 200 withstatus: ok | degraded - Audit log: All auth events, quota violations, and admin actions written to DB audit log (no document content) —
backend/services/audit.py,backend/api/audit.py
CI/CD & Deployment
- Hosting: Docker Compose only; no cloud provider manifests detected
- CI pipeline: None detected in repository
- Container registry: None configured
- Secrets management: Environment variables only;
.envfile for local dev (not committed)
Required Environment Variables Summary
| Variable | Required | Service | Purpose |
|---|---|---|---|
DATABASE_URL |
Yes | backend | App DB connection (DML user) |
DATABASE_MIGRATE_URL |
Yes | migrations | Alembic DDL connection |
MINIO_ENDPOINT |
Yes | backend, workers | MinIO S3 API endpoint |
MINIO_ACCESS_KEY |
Yes | backend, workers | MinIO credentials |
MINIO_SECRET_KEY |
Yes | backend, workers | MinIO credentials |
MINIO_BUCKET |
Yes | backend, workers | Object storage bucket name |
REDIS_URL |
Yes | backend, workers, beat | Redis DSN (broker + JTI store) |
SECRET_KEY |
Yes | backend | JWT signing secret |
CLOUD_CREDS_KEY |
Yes | celery-worker | 32-byte master key for HKDF |
POSTGRES_PASSWORD |
Yes | postgres service | Docker postgres init |
MINIO_ROOT_USER |
Yes | minio service | MinIO root credentials |
MINIO_ROOT_PASSWORD |
Yes | minio service | MinIO root credentials |
REDIS_PASSWORD |
Yes | redis service | Redis auth password |
SMTP_HOST |
No | backend | Transactional email (dev: logs to stdout) |
GOOGLE_CLIENT_ID |
No | backend | Google Drive OAuth |
GOOGLE_CLIENT_SECRET |
No | backend | Google Drive OAuth |
ONEDRIVE_CLIENT_ID |
No | backend | OneDrive OAuth |
ONEDRIVE_CLIENT_SECRET |
No | backend | OneDrive OAuth |
ADMIN_EMAIL |
No | backend | Bootstrap admin account |
ADMIN_PASSWORD |
No | backend | Bootstrap admin account |
DEFAULT_AI_PROVIDER |
No | backend | AI provider selection (default: ollama) |
DEFAULT_AI_MODEL |
No | backend | AI model selection (default: llama3.2) |
CORS_ORIGINS |
No | backend | Allowed CORS origins |
FRONTEND_URL |
No | backend, minio | Password reset links + MinIO CORS |
BACKEND_URL |
No | backend | OAuth callback URL construction |
Integration audit: 2026-06-02