Files
2026-06-02 15:32:06 +02:00

20 KiB

Architecture

Analysis Date: 2026-06-02

System Overview

┌──────────────────────────────────────────────────────────────────────────┐
│                        Browser (Vue 3 SPA)                               │
│  Pinia stores: auth · documents · folders · topics · cloudConnections   │
│  Router: /  /folders/:id  /document/:id  /cloud  /admin  /shared        │
└─────────────────────┬──────────────────────────────────┬────────────────┘
                      │ fetch() + Bearer JWT              │ PUT (presigned)
                      ▼                                   ▼
┌──────────────────────────────────┐     ┌───────────────────────────────┐
│   FastAPI Backend  :8000         │     │   MinIO  :9000                │
│   api/auth  api/documents        │     │   Bucket: docuvault           │
│   api/folders  api/shares        │     │   Keys: {uid}/{did}/{uuid}{e} │
│   api/cloud  api/admin           │     └───────────────────────────────┘
│   api/audit  api/topics          │
│                                  │     ┌───────────────────────────────┐
│   Middleware stack (per request):│     │   Cloud Backends              │
│     OriginValidation (first)     │     │   Google Drive / OneDrive     │
│     CORS                         │     │   Nextcloud / WebDAV          │
│     SecurityHeaders (CSP, etc.)  │     └───────────────────────────────┘
│     SlowAPI rate limiter         │
│                                  │     ┌───────────────────────────────┐
│   Deps layer:                    │     │   Celery Worker               │
│     get_db (AsyncSession)        │◄────►   tasks/document_tasks.py     │
│     get_current_user (JWT)       │     │   tasks/email_tasks.py        │
│     get_current_admin            │     │   tasks/audit_tasks.py        │
│     get_regular_user             │     └───────────────────────────────┘
└────────────┬─────────────────────┘
             │ SQLAlchemy async          ┌───────────────────────────────┐
             ▼                          │   Redis  :6379                │
┌──────────────────────────┐           │   Rate limiting (slowapi)     │
│   PostgreSQL  :5432      │           │   TOTP replay cache           │
│   11 tables:             │◄──────────►   Celery broker + results     │
│   users · quotas         │           │   OAuth state tokens (TTL)    │
│   refresh_tokens         │           └───────────────────────────────┘
│   backup_codes · folders │
│   documents · topics     │           ┌───────────────────────────────┐
│   document_topics        │           │   AI Providers (pluggable)    │
│   shares · audit_log     │           │   Ollama · OpenAI · Anthropic │
│   cloud_connections      │           │   LMStudio                    │
│   groups (v2 stub)       │           │   ai/base.py → AIProvider ABC │
└──────────────────────────┘           └───────────────────────────────┘

Component Responsibilities

Component Responsibility Key File
FastAPI app ASGI entry point, middleware, router registration backend/main.py
Auth API Register, login (TOTP/backup), refresh, logout, password reset backend/api/auth.py
Documents API Upload URL, confirm, list, delete, classify, stream content backend/api/documents.py
Folders API CRUD folders, move documents between folders backend/api/folders.py
Shares API Grant/revoke/list document shares between users backend/api/shares.py
Cloud API OAuth flows, WebDAV connect, folder listing, default storage backend/api/cloud.py
Admin API User CRUD, quota, AI config, audit log, delete user backend/api/admin.py
Audit API Paginated audit log viewer + CSV export backend/api/audit.py
Topics API CRUD topics, topic suggestions backend/api/topics.py
Auth service Password hashing, JWT, refresh token family, TOTP, HIBP backend/services/auth.py
Audit service write_audit_log() — flushed within caller's transaction backend/services/audit.py
Classifier service Selects AI provider, assigns topics, auto-creates suggestions backend/services/classifier.py
Extractor service PDF/DOCX/image/text extraction backend/services/extractor.py
Storage service ORM queries for documents + topic resolution backend/services/storage.py
StorageBackend ABC Interface for all object storage backends backend/storage/base.py
Storage factory Returns MinIOBackend or cloud backend from document record backend/storage/__init__.py
MinIO backend Presigned URL, put/get/delete, stat backend/storage/minio_backend.py
Cloud backends Google Drive, OneDrive, Nextcloud, WebDAV implementations backend/storage/*_backend.py
AIProvider ABC Interface: classify, suggest_topics, health_check backend/ai/base.py
AI factory Returns provider instance from string slug backend/ai/__init__.py
Celery app Task routing, beat schedule, JSON serialization backend/celery_app.py
Document task extract_and_classify — async bridge from sync Celery worker backend/tasks/document_tasks.py
ORM models 11-table schema, all UUID PKs, full index set backend/db/models.py
DB session Async engine, session factory (expire_on_commit=False) backend/db/session.py
FastAPI deps get_db, get_current_user, get_current_admin, get_regular_user backend/deps/
Auth store accessToken (memory only), user, quota, refresh deduplication frontend/src/stores/auth.js
Documents store CRUD, 3-step MinIO upload with progress, search debounce frontend/src/stores/documents.js
Folders store CRUD folders, breadcrumb, rootFolders for sidebar frontend/src/stores/folders.js
Topics store CRUD topics frontend/src/stores/topics.js
CloudConnections store List/disconnect cloud connections frontend/src/stores/cloudConnections.js
API client fetch wrapper, Bearer injection, 401→refresh→retry frontend/src/api/client.js
Vue Router SPA routes, beforeEach guard (silent refresh on reload) frontend/src/router/index.js
FileManagerView Unified file manager for local folders and documents frontend/src/views/FileManagerView.vue
StorageBrowser Reusable file listing component (local + cloud modes) frontend/src/components/storage/StorageBrowser.vue

Pattern Overview

Overall: Layered REST API + SPA with async background processing

Key Characteristics:

  • API layer is thin — validation via Pydantic, business logic in services/
  • No ORM relationships loaded — explicit queries only (prevents N+1)
  • Async everywhere in FastAPI; Celery workers bridge to async via asyncio.run()
  • Frontend Pinia stores own data-fetching; views delegate to stores; components emit events upward
  • One DB session per request (yielded by get_db dep), one per Celery task invocation
  • All resource ownership checked inline in handlers (resource.user_id == current_user.id)

Layers

API Layer:

  • Purpose: HTTP routing, request validation, response serialization
  • Location: backend/api/
  • Contains: APIRouter instances, Pydantic request/response models, FastAPI dep injection
  • Depends on: services/, deps/, db/models.py
  • Used by: Frontend via HTTP; not called from other backend modules

Service Layer:

  • Purpose: Business logic with no FastAPI coupling (pure Python async functions)
  • Location: backend/services/
  • Contains: auth.py, audit.py, classifier.py, extractor.py, storage.py, cloud_cache.py, email.py
  • Depends on: db/models.py, storage/, ai/, config
  • Used by: api/ layer and Celery tasks

Storage Abstraction Layer:

  • Purpose: Backend-agnostic object storage interface
  • Location: backend/storage/
  • Contains: base.py (ABC), minio_backend.py, google_drive_backend.py, onedrive_backend.py, nextcloud_backend.py, webdav_backend.py, cloud_utils.py (HKDF encryption), exceptions.py
  • Depends on: config, db/models.py (for cloud credential lookup)
  • Used by: services/storage.py, api/documents.py, Celery tasks

AI Abstraction Layer:

  • Purpose: Pluggable AI provider interface for document classification
  • Location: backend/ai/
  • Contains: base.py (ABC), ollama_provider.py, openai_provider.py, anthropic_provider.py, lmstudio_provider.py, utils.py
  • Depends on: External AI APIs via httpx
  • Used by: services/classifier.py

Dependency Layer:

  • Purpose: FastAPI reusable dependencies (DI)
  • Location: backend/deps/
  • Contains: db.py (get_db), auth.py (get_current_user, get_current_admin, get_regular_user), utils.py (get_client_ip)
  • Used by: All api/ handlers

Frontend Store Layer:

  • Purpose: Application state + async API calls
  • Location: frontend/src/stores/
  • Contains: auth.js, documents.js, folders.js, topics.js, cloudConnections.js
  • Depends on: api/client.js
  • Used by: Views and components

Data Flow

Document Upload (MinIO presigned URL path)

  1. User drops file in DropZoneStorageBrowser emits uploadFileManagerView.onFilesSelected (frontend/src/views/FileManagerView.vue)
  2. documentsStore.upload(file, autoClassify, folderId) (frontend/src/stores/documents.js)
  3. POST /api/documents/upload-url → creates pending Document row, returns presigned PUT URL + document_id (backend/api/documents.py)
  4. XHR PUT bytes directly from browser to MinIO presigned URL (no backend proxy, no auth header needed — URL is self-authenticating)
  5. POST /api/documents/{id}/confirmstat_object() for authoritative size → atomic quota UPDATE … RETURNING → status set to 'ready' (backend/api/documents.py)
  6. If folderId != null: PATCH /api/documents/{id}/folder → places document in folder
  7. Celery task extract_and_classify.delay(document_id) enqueued → text extraction → AI classification → topic assignment (backend/tasks/document_tasks.py)
  8. authStore.fetchQuota() called on frontend to refresh sidebar quota bar

Authentication Flow

  1. POST /api/auth/login with {email, password} — per-account Redis rate limit checked first (backend/api/auth.py)
  2. Password verified with Argon2 (constant-time via pwdlib)
  3. If TOTP enabled and no code provided → returns {requires_totp: true} challenge
  4. If TOTP code provided → verified against pyotp + Redis replay prevention window
  5. On success: create_access_token() (HS256 JWT, 15-min TTL) + create_refresh_token() (SHA-256 hashed, stored in DB) (backend/services/auth.py)
  6. Access token returned in JSON body; refresh token set as httpOnly; Secure; SameSite=Strict cookie scoped to /api/auth/refresh path only
  7. Frontend stores access token in authStore.accessToken (Pinia ref() — memory only, never localStorage)
  8. On page reload: router beforeEach guard calls authStore.refresh()POST /api/auth/refresh sends httpOnly cookie → new access token returned
  9. api/client.js intercepts any 401 → calls authStore.refresh() → retries request once (frontend/src/api/client.js)

Refresh Token Rotation + Family Revocation

  1. POST /api/auth/refresh reads httpOnly cookie, looks up RefreshToken row by SHA-256 hash
  2. If token already revoked → all user's refresh tokens revoked → 401 + security alert email enqueued via Celery
  3. If valid: old token marked revoked=True, new raw token generated and stored (hashed), rotated cookie set

Cloud Storage OAuth Flow

  1. GET /api/cloud/oauth/initiate/{provider} → state token stored in Redis (TTL 1800s, single-use) → authorization URL returned
  2. Browser navigates to OAuth provider → callback to GET /api/cloud/oauth/callback/{provider}
  3. State token validated (single-use consumed from Redis), authorization code exchanged for credentials
  4. Credentials encrypted with HKDF-derived per-user Fernet key → stored in cloud_connections.credentials_enc
  5. On document operations: get_storage_backend_for_document() decrypts credentials, instantiates cloud backend — transparent to API handlers (backend/storage/__init__.py)

State Management (frontend):

  • Access token: authStore.accessToken — Pinia ref(null), JS memory only, cleared on logout/error
  • User profile: authStore.user — Pinia ref(null)
  • Quota: authStore.quota — fetched after upload/delete, displayed in QuotaBar
  • Documents: documentsStore.documents — local array, kept in sync via explicit fetchDocuments() calls
  • Folder tree: foldersStore.rootFolders (sidebar) + foldersStore.folders (current level)
  • Upload progress: documentsStore.uploadProgress — keyed ${filename}__${Date.now()} to prevent key collision

Key Abstractions

StorageBackend ABC (backend/storage/base.py):

  • Purpose: Uniform interface over MinIO and all cloud providers
  • Methods: put_object, get_object, delete_object, presigned_get_url, health_check, generate_presigned_put_url, stat_object
  • Implementations: MinIOBackend, GoogleDriveBackend, OneDriveBackend, NextcloudBackend, WebDAVBackend
  • Selected by: get_storage_backend_for_document() in backend/storage/__init__.py

AIProvider ABC (backend/ai/base.py):

  • Purpose: Pluggable classification backend
  • Methods: classify, suggest_topics, health_check
  • Returns: ClassificationResult(topics, suggested_new_topics, reasoning)
  • Implementations: OllamaProvider, OpenAIProvider, AnthropicProvider, LMStudioProvider
  • Selected by: ai/__init__.py factory, keyed to per-user ai_provider/ai_model from DB

Dependency Chain:

  • get_current_user → parses Bearer JWT → loads User from DB, checks is_active
  • get_current_admin → wraps get_current_user + role == 'admin' check (raises 403)
  • get_regular_user → wraps get_current_user + rejects role == 'admin' (admins get 403 on document endpoints)

Entry Points

Backend:

  • Location: backend/main.py
  • Triggers: uvicorn main:app
  • Responsibilities: FastAPI app factory, lifespan (MinIO bucket init, Redis connection, admin bootstrap), middleware registration in correct order, router inclusion

Celery Worker:

  • Location: backend/celery_app.py (factory) + backend/tasks/
  • Triggers: celery -A celery_app worker -Q documents
  • Responsibilities: Async document text extraction + classification, email delivery, scheduled nightly audit CSV export

Frontend:

  • Location: frontend/src/main.js
  • Triggers: Vite dev server (npm run dev) or built static files served by frontend container
  • Responsibilities: Mount Vue app with Pinia and Router

Architectural Constraints

  • Threading: FastAPI runs on a single-threaded asyncio event loop (uvicorn). Blocking MinIO SDK calls use asyncio.to_thread(). Celery workers are separate sync processes that bridge to async via asyncio.run() — they never share an event loop with FastAPI.
  • Global state: backend/services/storage.py holds a module-level _storage singleton for the default MinIO backend. backend/main.py stores MinIO client on app.state.minio and Redis client on app.state.redis.
  • Circular imports: Celery task modules must never import from main.py or router modules. backend/celery_app.py intentionally avoids importing config — reads REDIS_URL directly from os.environ to avoid pydantic-settings side effects.
  • Admin isolation: Admin accounts cannot access document content — enforced by get_regular_user dep on all document/folder/share endpoints. No impersonation code path exists (backend/deps/auth.py).
  • Quota atomicity: Quota enforcement uses a single atomic UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes — no read-then-write in Python.
  • Object key privacy: MinIO keys are {user_id}/{document_id}/{uuid4()}{ext} — original filenames stored only in the DB filename column, never in the storage key.

Anti-Patterns

Accessing document content via unauthenticated iframe src

What happens: Setting <iframe src="/api/documents/{id}/content"> directly would bypass Bearer token auth in browsers that do not send cookies cross-origin. Why it's wrong: The document content endpoint requires Authorization: Bearer header; browser src= attributes do not send custom headers. Do this instead: Use fetchDocumentContent(docId) in frontend/src/api/client.js — it injects Bearer + handles 401-refresh-retry, then builds an object URL from the Blob response.

Committing inside write_audit_log

What happens: Calling session.commit() inside write_audit_log creates a separate transaction for the audit entry. Why it's wrong: The audit entry would commit even if the primary operation subsequently fails, creating phantom audit records. Do this instead: write_audit_log calls session.flush() only. The caller owns session.commit()backend/services/audit.py.

CloudConnection query without user scope

What happens: Querying CloudConnection without filtering user_id == current_user.id would allow one user's cloud credentials to service another user's request. Why it's wrong: IDOR — cross-user credential access. Do this instead: Always filter CloudConnection.user_id == user.id as enforced in get_storage_backend_for_document() in backend/storage/__init__.py.

Error Handling

Strategy: Services raise ValueError; API handlers catch and re-raise as HTTPException. No service module imports FastAPI.

Patterns:

  • Auth service raises ValueError → API layer maps to 401/422/400
  • Storage errors (S3Error, cloud provider errors) wrapped in backend/storage/exceptions.py → 503 or 404
  • write_audit_log never raises — silently logs and swallows to protect primary operations
  • CloudConnectionError (backend/storage/exceptions.py) used for cloud-specific failures

Cross-Cutting Concerns

Logging: Python logging module with logger = logging.getLogger(__name__) in each module. No structured logging framework.

Validation: Pydantic models at API boundary. Field validators on sensitive fields (filename rejects path separators, permission allowlists, non-negative quota). No model accepts **kwargs.

Authentication: Every non-public endpoint injects get_current_user, get_current_admin, or get_regular_user via FastAPI Depends. No endpoint bypasses the dependency chain.

Rate Limiting: slowapi (wraps limits-library) on all auth endpoints. Per-IP limits via @limiter.limit("10/minute"). Per-account Redis counter on login: login_attempts:{email}, 10 attempts per 15-minute window.

Audit Logging: write_audit_log() called inline in API handlers for all auth events, document operations, admin actions, and cloud connections. Written within the handler's transaction via session.flush().

HKDF Credential Encryption: Cloud credentials encrypted with Fernet(HKDF-SHA256(master_key, salt=user_id, purpose="cloud-creds")) before DB storage. Implementation in backend/storage/cloud_utils.py.


Architecture analysis: 2026-06-02