Files
kite/.planning/codebase/INTEGRATIONS.md
T
2026-06-02 15:32:06 +02:00

12 KiB

External Integrations

Analysis Date: 2026-06-02

AI / ML Classification

All AI providers implement the AIProvider abstract interface in backend/ai/base.py. The active provider is selected at classification time via the DEFAULT_AI_PROVIDER setting (backend/config.py).

Anthropic Claude

  • SDK: anthropic>=0.26backend/ai/anthropic_provider.py
  • Client: anthropic.AsyncAnthropic(api_key=...)
  • API: Messages API (client.messages.create)
  • Default model: claude-sonnet-4-6 (configurable via DEFAULT_AI_MODEL)
  • Auth env var: API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
  • Calls made: classify (max_tokens=1024), suggest_topics (max_tokens=256), health_check (max_tokens=5)
  • Text cap: 8,000 chars per call (MAX_AI_CHARS = 8_000 in backend/ai/anthropic_provider.py)

OpenAI

  • SDK: openai>=1.30backend/ai/openai_provider.py
  • Client: openai.AsyncOpenAI(api_key=..., base_url=...)
  • API: Chat Completions (client.chat.completions.create)
  • Default model: gpt-4o
  • Auth: api_key at instantiation; base_url override supported for custom endpoints

Ollama (local, OpenAI-compatible)

  • Provider file: backend/ai/ollama_provider.py
  • Implementation: Subclass of OpenAIProvider with fixed base_url
  • Default base URL: http://host.docker.internal:11434/v1
  • Default model: llama3.2
  • Auth: Stub key "ollama" — no real auth
  • Network path: Reaches host machine Ollama daemon via Docker extra_hosts: host.docker.internal:host-gateway

LM Studio (local, OpenAI-compatible)

  • Provider file: backend/ai/lmstudio_provider.py
  • Implementation: Subclass of OpenAIProvider with fixed base_url
  • Default base URL: http://host.docker.internal:1234/v1
  • Default model: gemma-4-e4b-it
  • Auth: Stub key "lm-studio" — no real auth
  • Network path: Same host.docker.internal Docker alias as Ollama

Data Storage

PostgreSQL (primary database)

  • Image: postgres:17-alpine (Docker Compose)
  • Driver: psycopg[binary]>=3.3.4 (psycopg v3 async)
  • ORM: SQLAlchemy 2.0 asyncio — backend/db/session.py
  • Schema migrations: Alembic — backend/migrations/
  • Connection env vars: DATABASE_URL (app user, DML only), DATABASE_MIGRATE_URL (migrate user, DDL)
  • Role separation: docuvault_app (DML), docuvault_migrate (DDL) — docker/postgres/initdb.d/01-init-users.sql

MinIO (object storage)

  • Image: minio/minio:latest (Docker Compose), ports 9000 + 9001
  • SDK: minio>=7.2.20backend/storage/minio_backend.py
  • Object key scheme: {user_id}/{document_id}/{uuid4()}{ext} — human filenames stored in DB only
  • Presigned URLs: Generated for browser direct-PUT uploads and GET downloads
  • Auth env vars: MINIO_ENDPOINT, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, MINIO_BUCKET
  • Public endpoint: MINIO_PUBLIC_ENDPOINT — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint)
  • CORS: MINIO_API_CORS_ALLOW_ORIGIN set to FRONTEND_URL to allow browser preflight

Redis

  • Image: redis:7-alpine (Docker Compose), password-protected
  • Client: redis>=4.6.0 (async via redis.asyncio)
  • Uses:
    • Celery broker and result backend (backend/celery_app.py)
    • JTI token revocation store (access + refresh token blacklist)
    • Per-account rate limiting via slowapi (backend/main.py)
    • TOTP replay prevention (used TOTP codes invalidated within 90 s window)
  • Auth env var: REDIS_URL (includes password in DSN)

Cloud Storage Backends

All backends implement StorageBackend ABC from backend/storage/base.py. Credentials are encrypted at rest with HKDF per-user key derivation using master key from CLOUD_CREDS_KEY env var.

Google Drive v3

  • SDK: google-auth-oauthlib>=1.3.1 + google-api-python-client>=2.196.0
  • Backend file: backend/storage/google_drive_backend.py
  • Auth: OAuth2 flow; tokens stored encrypted in DB; token_uri, client_id, client_secret, access_token, refresh_token in credentials dict
  • Scope: https://www.googleapis.com/auth/drive.file
  • Note: All googleapiclient calls are synchronous and wrapped in asyncio.to_thread() to avoid blocking the event loop; cache_discovery=False prevents /tmp writes (path traversal mitigation)
  • Auth env vars: GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET
  • OAuth callback: {BACKEND_URL}/api/cloud/google/callback

Microsoft OneDrive (Graph API)

  • SDK: msal>=1.36.0 (token management) + httpx>=0.27 (async Graph API calls)
  • Backend file: backend/storage/onedrive_backend.py
  • API base: https://graph.microsoft.com/v1.0
  • Auth: OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains access_token, refresh_token, expires_at
  • Upload strategy: Resumable upload sessions (createUploadSession) for all files; chunk size 10 MB
  • Auth env vars: ONEDRIVE_CLIENT_ID, ONEDRIVE_CLIENT_SECRET, ONEDRIVE_TENANT_ID (default: "common")

Nextcloud

  • Backend file: backend/storage/nextcloud_backend.py
  • Inheritance: NextcloudBackend → WebDAVBackend → StorageBackend
  • Protocol: WebDAV via webdavclient3>=3.14.7
  • Credentials dict: {"server_url": str, "username": str, "password": str}
  • SSRF prevention: validate_cloud_url() called at construction time and before every outbound request (backend/storage/cloud_utils.py)
  • No OAuth: Credential-based only (username + password)

Generic WebDAV

  • Backend file: backend/storage/webdav_backend.py
  • SDK: webdavclient3>=3.14.7
  • Credentials dict: {"server_url": str, "username": str, "password": str}
  • SSRF prevention: Same dual-call validate_cloud_url() pattern as Nextcloud
  • Path encoding: urllib.parse.quote() per path segment to handle non-ASCII filenames

Authentication & Identity

No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:

  • Password hashing: Argon2id via pwdlib[argon2]backend/services/auth.py
  • JWT access tokens: PyJWT >=2.8.0; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (fgp) bound to User-Agent + Accept-Language
  • Refresh tokens: 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
  • JTI store: Redis (TTL matching token lifetime)
  • TOTP (2FA): pyotp>=2.9.0; replay prevention via Redis within 90 s window; QR codes generated in frontend with qrcode ^1.5.4
  • Backup codes: Generated, hashed (Argon2id), stored in DB — backend/db/models.py:BackupCode

External HTTP APIs

HaveIBeenPwned (HIBP)

  • Purpose: k-anonymity password breach check on registration and password change
  • Client: httpx async GET to https://api.pwnedpasswords.com/range/{prefix}
  • Implementation: backend/services/auth.py:check_hibp() — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration)
  • Auth: None required (public API)

Email / Notifications

  • Protocol: SMTP via Python stdlib smtplibbackend/services/email.py
  • Transport security: STARTTLS (port 587 default)
  • Auth: Optional SMTP username + password
  • Auth env vars: SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASSWORD, SMTP_FROM
  • Dev fallback: When SMTP_HOST is empty, email content is logged to stdout instead of sent
  • Emails sent:
    • Password reset link (1-hour validity) — triggered from backend/tasks/email_tasks.py
    • Security alert (suspicious refresh token reuse / session family revocation) — triggered from backend/services/auth.py via Celery
  • Celery queue: email queue, separate from documents queue

Frontend ↔ Backend Communication

  • Protocol: HTTP REST over JSON; multipart/form-data for document upload
  • Client: Native browser fetch API — frontend/src/api/ directory
  • Base path: All requests use relative /api/* — no hardcoded backend hostname
  • Dev proxy: Vite proxies /apihttp://backend:8000 (frontend/vite.config.js)
  • Auth flow: Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client

Background Task Queues (Celery)

  • Broker + result backend: Redis (REDIS_URL)
  • Serialization: JSON only (no pickle)
  • Queues and task modules:
    • documentsbackend/tasks/document_tasks.py (extraction, classification, cleanup)
    • emailbackend/tasks/email_tasks.py (password reset, security alert)
    • documents (reused) — backend/tasks/audit_tasks.py (audit log export)
  • Scheduled tasks (Celery Beat):
    • cleanup-abandoned-uploads — every 30 minutes
    • audit-log-daily-export — midnight UTC daily

Monitoring & Observability

  • Error tracking: None (no Sentry, Datadog, etc.)
  • Logging: Python stdlib logging; stdout; no structured logging framework
  • Health endpoint: GET /health — probes PostgreSQL (SELECT 1) and MinIO (bucket exists check); always returns HTTP 200 with status: ok | degraded
  • Audit log: All auth events, quota violations, and admin actions written to DB audit log (no document content) — backend/services/audit.py, backend/api/audit.py

CI/CD & Deployment

  • Hosting: Docker Compose only; no cloud provider manifests detected
  • CI pipeline: None detected in repository
  • Container registry: None configured
  • Secrets management: Environment variables only; .env file for local dev (not committed)

Required Environment Variables Summary

Variable Required Service Purpose
DATABASE_URL Yes backend App DB connection (DML user)
DATABASE_MIGRATE_URL Yes migrations Alembic DDL connection
MINIO_ENDPOINT Yes backend, workers MinIO S3 API endpoint
MINIO_ACCESS_KEY Yes backend, workers MinIO credentials
MINIO_SECRET_KEY Yes backend, workers MinIO credentials
MINIO_BUCKET Yes backend, workers Object storage bucket name
REDIS_URL Yes backend, workers, beat Redis DSN (broker + JTI store)
SECRET_KEY Yes backend JWT signing secret
CLOUD_CREDS_KEY Yes celery-worker 32-byte master key for HKDF
POSTGRES_PASSWORD Yes postgres service Docker postgres init
MINIO_ROOT_USER Yes minio service MinIO root credentials
MINIO_ROOT_PASSWORD Yes minio service MinIO root credentials
REDIS_PASSWORD Yes redis service Redis auth password
SMTP_HOST No backend Transactional email (dev: logs to stdout)
GOOGLE_CLIENT_ID No backend Google Drive OAuth
GOOGLE_CLIENT_SECRET No backend Google Drive OAuth
ONEDRIVE_CLIENT_ID No backend OneDrive OAuth
ONEDRIVE_CLIENT_SECRET No backend OneDrive OAuth
ADMIN_EMAIL No backend Bootstrap admin account
ADMIN_PASSWORD No backend Bootstrap admin account
DEFAULT_AI_PROVIDER No backend AI provider selection (default: ollama)
DEFAULT_AI_MODEL No backend AI model selection (default: llama3.2)
CORS_ORIGINS No backend Allowed CORS origins
FRONTEND_URL No backend, minio Password reset links + MinIO CORS
BACKEND_URL No backend OAuth callback URL construction

Integration audit: 2026-06-02