89f8d5a654
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
285 lines
20 KiB
Markdown
285 lines
20 KiB
Markdown
<!-- refreshed: 2026-06-02 -->
|
|
# Architecture
|
|
|
|
**Analysis Date:** 2026-06-02
|
|
|
|
## System Overview
|
|
|
|
```text
|
|
┌──────────────────────────────────────────────────────────────────────────┐
|
|
│ Browser (Vue 3 SPA) │
|
|
│ Pinia stores: auth · documents · folders · topics · cloudConnections │
|
|
│ Router: / /folders/:id /document/:id /cloud /admin /shared │
|
|
└─────────────────────┬──────────────────────────────────┬────────────────┘
|
|
│ fetch() + Bearer JWT │ PUT (presigned)
|
|
▼ ▼
|
|
┌──────────────────────────────────┐ ┌───────────────────────────────┐
|
|
│ FastAPI Backend :8000 │ │ MinIO :9000 │
|
|
│ api/auth api/documents │ │ Bucket: docuvault │
|
|
│ api/folders api/shares │ │ Keys: {uid}/{did}/{uuid}{e} │
|
|
│ api/cloud api/admin │ └───────────────────────────────┘
|
|
│ api/audit api/topics │
|
|
│ │ ┌───────────────────────────────┐
|
|
│ Middleware stack (per request):│ │ Cloud Backends │
|
|
│ OriginValidation (first) │ │ Google Drive / OneDrive │
|
|
│ CORS │ │ Nextcloud / WebDAV │
|
|
│ SecurityHeaders (CSP, etc.) │ └───────────────────────────────┘
|
|
│ SlowAPI rate limiter │
|
|
│ │ ┌───────────────────────────────┐
|
|
│ Deps layer: │ │ Celery Worker │
|
|
│ get_db (AsyncSession) │◄────► tasks/document_tasks.py │
|
|
│ get_current_user (JWT) │ │ tasks/email_tasks.py │
|
|
│ get_current_admin │ │ tasks/audit_tasks.py │
|
|
│ get_regular_user │ └───────────────────────────────┘
|
|
└────────────┬─────────────────────┘
|
|
│ SQLAlchemy async ┌───────────────────────────────┐
|
|
▼ │ Redis :6379 │
|
|
┌──────────────────────────┐ │ Rate limiting (slowapi) │
|
|
│ PostgreSQL :5432 │ │ TOTP replay cache │
|
|
│ 11 tables: │◄──────────► Celery broker + results │
|
|
│ users · quotas │ │ OAuth state tokens (TTL) │
|
|
│ refresh_tokens │ └───────────────────────────────┘
|
|
│ backup_codes · folders │
|
|
│ documents · topics │ ┌───────────────────────────────┐
|
|
│ document_topics │ │ AI Providers (pluggable) │
|
|
│ shares · audit_log │ │ Ollama · OpenAI · Anthropic │
|
|
│ cloud_connections │ │ LMStudio │
|
|
│ groups (v2 stub) │ │ ai/base.py → AIProvider ABC │
|
|
└──────────────────────────┘ └───────────────────────────────┘
|
|
```
|
|
|
|
## Component Responsibilities
|
|
|
|
| Component | Responsibility | Key File |
|
|
|-----------|----------------|----------|
|
|
| FastAPI app | ASGI entry point, middleware, router registration | `backend/main.py` |
|
|
| Auth API | Register, login (TOTP/backup), refresh, logout, password reset | `backend/api/auth.py` |
|
|
| Documents API | Upload URL, confirm, list, delete, classify, stream content | `backend/api/documents.py` |
|
|
| Folders API | CRUD folders, move documents between folders | `backend/api/folders.py` |
|
|
| Shares API | Grant/revoke/list document shares between users | `backend/api/shares.py` |
|
|
| Cloud API | OAuth flows, WebDAV connect, folder listing, default storage | `backend/api/cloud.py` |
|
|
| Admin API | User CRUD, quota, AI config, audit log, delete user | `backend/api/admin.py` |
|
|
| Audit API | Paginated audit log viewer + CSV export | `backend/api/audit.py` |
|
|
| Topics API | CRUD topics, topic suggestions | `backend/api/topics.py` |
|
|
| Auth service | Password hashing, JWT, refresh token family, TOTP, HIBP | `backend/services/auth.py` |
|
|
| Audit service | `write_audit_log()` — flushed within caller's transaction | `backend/services/audit.py` |
|
|
| Classifier service | Selects AI provider, assigns topics, auto-creates suggestions | `backend/services/classifier.py` |
|
|
| Extractor service | PDF/DOCX/image/text extraction | `backend/services/extractor.py` |
|
|
| Storage service | ORM queries for documents + topic resolution | `backend/services/storage.py` |
|
|
| StorageBackend ABC | Interface for all object storage backends | `backend/storage/base.py` |
|
|
| Storage factory | Returns MinIOBackend or cloud backend from document record | `backend/storage/__init__.py` |
|
|
| MinIO backend | Presigned URL, put/get/delete, stat | `backend/storage/minio_backend.py` |
|
|
| Cloud backends | Google Drive, OneDrive, Nextcloud, WebDAV implementations | `backend/storage/*_backend.py` |
|
|
| AIProvider ABC | Interface: classify, suggest_topics, health_check | `backend/ai/base.py` |
|
|
| AI factory | Returns provider instance from string slug | `backend/ai/__init__.py` |
|
|
| Celery app | Task routing, beat schedule, JSON serialization | `backend/celery_app.py` |
|
|
| Document task | extract_and_classify — async bridge from sync Celery worker | `backend/tasks/document_tasks.py` |
|
|
| ORM models | 11-table schema, all UUID PKs, full index set | `backend/db/models.py` |
|
|
| DB session | Async engine, session factory (expire_on_commit=False) | `backend/db/session.py` |
|
|
| FastAPI deps | get_db, get_current_user, get_current_admin, get_regular_user | `backend/deps/` |
|
|
| Auth store | accessToken (memory only), user, quota, refresh deduplication | `frontend/src/stores/auth.js` |
|
|
| Documents store | CRUD, 3-step MinIO upload with progress, search debounce | `frontend/src/stores/documents.js` |
|
|
| Folders store | CRUD folders, breadcrumb, rootFolders for sidebar | `frontend/src/stores/folders.js` |
|
|
| Topics store | CRUD topics | `frontend/src/stores/topics.js` |
|
|
| CloudConnections store | List/disconnect cloud connections | `frontend/src/stores/cloudConnections.js` |
|
|
| API client | fetch wrapper, Bearer injection, 401→refresh→retry | `frontend/src/api/client.js` |
|
|
| Vue Router | SPA routes, beforeEach guard (silent refresh on reload) | `frontend/src/router/index.js` |
|
|
| FileManagerView | Unified file manager for local folders and documents | `frontend/src/views/FileManagerView.vue` |
|
|
| StorageBrowser | Reusable file listing component (local + cloud modes) | `frontend/src/components/storage/StorageBrowser.vue` |
|
|
|
|
## Pattern Overview
|
|
|
|
**Overall:** Layered REST API + SPA with async background processing
|
|
|
|
**Key Characteristics:**
|
|
- API layer is thin — validation via Pydantic, business logic in `services/`
|
|
- No ORM relationships loaded — explicit queries only (prevents N+1)
|
|
- Async everywhere in FastAPI; Celery workers bridge to async via `asyncio.run()`
|
|
- Frontend Pinia stores own data-fetching; views delegate to stores; components emit events upward
|
|
- One DB session per request (yielded by `get_db` dep), one per Celery task invocation
|
|
- All resource ownership checked inline in handlers (`resource.user_id == current_user.id`)
|
|
|
|
## Layers
|
|
|
|
**API Layer:**
|
|
- Purpose: HTTP routing, request validation, response serialization
|
|
- Location: `backend/api/`
|
|
- Contains: APIRouter instances, Pydantic request/response models, FastAPI dep injection
|
|
- Depends on: `services/`, `deps/`, `db/models.py`
|
|
- Used by: Frontend via HTTP; not called from other backend modules
|
|
|
|
**Service Layer:**
|
|
- Purpose: Business logic with no FastAPI coupling (pure Python async functions)
|
|
- Location: `backend/services/`
|
|
- Contains: `auth.py`, `audit.py`, `classifier.py`, `extractor.py`, `storage.py`, `cloud_cache.py`, `email.py`
|
|
- Depends on: `db/models.py`, `storage/`, `ai/`, `config`
|
|
- Used by: `api/` layer and Celery tasks
|
|
|
|
**Storage Abstraction Layer:**
|
|
- Purpose: Backend-agnostic object storage interface
|
|
- Location: `backend/storage/`
|
|
- Contains: `base.py` (ABC), `minio_backend.py`, `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`, `cloud_utils.py` (HKDF encryption), `exceptions.py`
|
|
- Depends on: `config`, `db/models.py` (for cloud credential lookup)
|
|
- Used by: `services/storage.py`, `api/documents.py`, Celery tasks
|
|
|
|
**AI Abstraction Layer:**
|
|
- Purpose: Pluggable AI provider interface for document classification
|
|
- Location: `backend/ai/`
|
|
- Contains: `base.py` (ABC), `ollama_provider.py`, `openai_provider.py`, `anthropic_provider.py`, `lmstudio_provider.py`, `utils.py`
|
|
- Depends on: External AI APIs via httpx
|
|
- Used by: `services/classifier.py`
|
|
|
|
**Dependency Layer:**
|
|
- Purpose: FastAPI reusable dependencies (DI)
|
|
- Location: `backend/deps/`
|
|
- Contains: `db.py` (get_db), `auth.py` (get_current_user, get_current_admin, get_regular_user), `utils.py` (get_client_ip)
|
|
- Used by: All `api/` handlers
|
|
|
|
**Frontend Store Layer:**
|
|
- Purpose: Application state + async API calls
|
|
- Location: `frontend/src/stores/`
|
|
- Contains: `auth.js`, `documents.js`, `folders.js`, `topics.js`, `cloudConnections.js`
|
|
- Depends on: `api/client.js`
|
|
- Used by: Views and components
|
|
|
|
## Data Flow
|
|
|
|
### Document Upload (MinIO presigned URL path)
|
|
|
|
1. User drops file in `DropZone` → `StorageBrowser` emits `upload` → `FileManagerView.onFilesSelected` (`frontend/src/views/FileManagerView.vue`)
|
|
2. `documentsStore.upload(file, autoClassify, folderId)` (`frontend/src/stores/documents.js`)
|
|
3. `POST /api/documents/upload-url` → creates pending `Document` row, returns presigned PUT URL + `document_id` (`backend/api/documents.py`)
|
|
4. XHR `PUT` bytes directly from browser to MinIO presigned URL (no backend proxy, no auth header needed — URL is self-authenticating)
|
|
5. `POST /api/documents/{id}/confirm` → `stat_object()` for authoritative size → atomic quota `UPDATE … RETURNING` → status set to `'ready'` (`backend/api/documents.py`)
|
|
6. If `folderId != null`: `PATCH /api/documents/{id}/folder` → places document in folder
|
|
7. Celery task `extract_and_classify.delay(document_id)` enqueued → text extraction → AI classification → topic assignment (`backend/tasks/document_tasks.py`)
|
|
8. `authStore.fetchQuota()` called on frontend to refresh sidebar quota bar
|
|
|
|
### Authentication Flow
|
|
|
|
1. `POST /api/auth/login` with `{email, password}` — per-account Redis rate limit checked first (`backend/api/auth.py`)
|
|
2. Password verified with Argon2 (constant-time via pwdlib)
|
|
3. If TOTP enabled and no code provided → returns `{requires_totp: true}` challenge
|
|
4. If TOTP code provided → verified against pyotp + Redis replay prevention window
|
|
5. On success: `create_access_token()` (HS256 JWT, 15-min TTL) + `create_refresh_token()` (SHA-256 hashed, stored in DB) (`backend/services/auth.py`)
|
|
6. Access token returned in JSON body; refresh token set as `httpOnly; Secure; SameSite=Strict` cookie scoped to `/api/auth/refresh` path only
|
|
7. Frontend stores access token in `authStore.accessToken` (Pinia `ref()` — memory only, never localStorage)
|
|
8. On page reload: router `beforeEach` guard calls `authStore.refresh()` → `POST /api/auth/refresh` sends httpOnly cookie → new access token returned
|
|
9. `api/client.js` intercepts any 401 → calls `authStore.refresh()` → retries request once (`frontend/src/api/client.js`)
|
|
|
|
### Refresh Token Rotation + Family Revocation
|
|
|
|
1. `POST /api/auth/refresh` reads httpOnly cookie, looks up `RefreshToken` row by SHA-256 hash
|
|
2. If token already revoked → all user's refresh tokens revoked → 401 + security alert email enqueued via Celery
|
|
3. If valid: old token marked `revoked=True`, new raw token generated and stored (hashed), rotated cookie set
|
|
|
|
### Cloud Storage OAuth Flow
|
|
|
|
1. `GET /api/cloud/oauth/initiate/{provider}` → state token stored in Redis (TTL 1800s, single-use) → authorization URL returned
|
|
2. Browser navigates to OAuth provider → callback to `GET /api/cloud/oauth/callback/{provider}`
|
|
3. State token validated (single-use consumed from Redis), authorization code exchanged for credentials
|
|
4. Credentials encrypted with HKDF-derived per-user Fernet key → stored in `cloud_connections.credentials_enc`
|
|
5. On document operations: `get_storage_backend_for_document()` decrypts credentials, instantiates cloud backend — transparent to API handlers (`backend/storage/__init__.py`)
|
|
|
|
**State Management (frontend):**
|
|
- Access token: `authStore.accessToken` — Pinia `ref(null)`, JS memory only, cleared on logout/error
|
|
- User profile: `authStore.user` — Pinia `ref(null)`
|
|
- Quota: `authStore.quota` — fetched after upload/delete, displayed in `QuotaBar`
|
|
- Documents: `documentsStore.documents` — local array, kept in sync via explicit `fetchDocuments()` calls
|
|
- Folder tree: `foldersStore.rootFolders` (sidebar) + `foldersStore.folders` (current level)
|
|
- Upload progress: `documentsStore.uploadProgress` — keyed `${filename}__${Date.now()}` to prevent key collision
|
|
|
|
## Key Abstractions
|
|
|
|
**StorageBackend ABC (`backend/storage/base.py`):**
|
|
- Purpose: Uniform interface over MinIO and all cloud providers
|
|
- Methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`, `generate_presigned_put_url`, `stat_object`
|
|
- Implementations: `MinIOBackend`, `GoogleDriveBackend`, `OneDriveBackend`, `NextcloudBackend`, `WebDAVBackend`
|
|
- Selected by: `get_storage_backend_for_document()` in `backend/storage/__init__.py`
|
|
|
|
**AIProvider ABC (`backend/ai/base.py`):**
|
|
- Purpose: Pluggable classification backend
|
|
- Methods: `classify`, `suggest_topics`, `health_check`
|
|
- Returns: `ClassificationResult(topics, suggested_new_topics, reasoning)`
|
|
- Implementations: `OllamaProvider`, `OpenAIProvider`, `AnthropicProvider`, `LMStudioProvider`
|
|
- Selected by: `ai/__init__.py` factory, keyed to per-user `ai_provider`/`ai_model` from DB
|
|
|
|
**Dependency Chain:**
|
|
- `get_current_user` → parses Bearer JWT → loads `User` from DB, checks `is_active`
|
|
- `get_current_admin` → wraps `get_current_user` + `role == 'admin'` check (raises 403)
|
|
- `get_regular_user` → wraps `get_current_user` + rejects `role == 'admin'` (admins get 403 on document endpoints)
|
|
|
|
## Entry Points
|
|
|
|
**Backend:**
|
|
- Location: `backend/main.py`
|
|
- Triggers: `uvicorn main:app`
|
|
- Responsibilities: FastAPI app factory, lifespan (MinIO bucket init, Redis connection, admin bootstrap), middleware registration in correct order, router inclusion
|
|
|
|
**Celery Worker:**
|
|
- Location: `backend/celery_app.py` (factory) + `backend/tasks/`
|
|
- Triggers: `celery -A celery_app worker -Q documents`
|
|
- Responsibilities: Async document text extraction + classification, email delivery, scheduled nightly audit CSV export
|
|
|
|
**Frontend:**
|
|
- Location: `frontend/src/main.js`
|
|
- Triggers: Vite dev server (`npm run dev`) or built static files served by frontend container
|
|
- Responsibilities: Mount Vue app with Pinia and Router
|
|
|
|
## Architectural Constraints
|
|
|
|
- **Threading:** FastAPI runs on a single-threaded asyncio event loop (uvicorn). Blocking MinIO SDK calls use `asyncio.to_thread()`. Celery workers are separate sync processes that bridge to async via `asyncio.run()` — they never share an event loop with FastAPI.
|
|
- **Global state:** `backend/services/storage.py` holds a module-level `_storage` singleton for the default MinIO backend. `backend/main.py` stores MinIO client on `app.state.minio` and Redis client on `app.state.redis`.
|
|
- **Circular imports:** Celery task modules must never import from `main.py` or router modules. `backend/celery_app.py` intentionally avoids importing `config` — reads `REDIS_URL` directly from `os.environ` to avoid pydantic-settings side effects.
|
|
- **Admin isolation:** Admin accounts cannot access document content — enforced by `get_regular_user` dep on all document/folder/share endpoints. No impersonation code path exists (`backend/deps/auth.py`).
|
|
- **Quota atomicity:** Quota enforcement uses a single atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — no read-then-write in Python.
|
|
- **Object key privacy:** MinIO keys are `{user_id}/{document_id}/{uuid4()}{ext}` — original filenames stored only in the DB `filename` column, never in the storage key.
|
|
|
|
## Anti-Patterns
|
|
|
|
### Accessing document content via unauthenticated iframe src
|
|
|
|
**What happens:** Setting `<iframe src="/api/documents/{id}/content">` directly would bypass Bearer token auth in browsers that do not send cookies cross-origin.
|
|
**Why it's wrong:** The document content endpoint requires `Authorization: Bearer` header; browser `src=` attributes do not send custom headers.
|
|
**Do this instead:** Use `fetchDocumentContent(docId)` in `frontend/src/api/client.js` — it injects Bearer + handles 401-refresh-retry, then builds an object URL from the Blob response.
|
|
|
|
### Committing inside `write_audit_log`
|
|
|
|
**What happens:** Calling `session.commit()` inside `write_audit_log` creates a separate transaction for the audit entry.
|
|
**Why it's wrong:** The audit entry would commit even if the primary operation subsequently fails, creating phantom audit records.
|
|
**Do this instead:** `write_audit_log` calls `session.flush()` only. The caller owns `session.commit()` — `backend/services/audit.py`.
|
|
|
|
### CloudConnection query without user scope
|
|
|
|
**What happens:** Querying `CloudConnection` without filtering `user_id == current_user.id` would allow one user's cloud credentials to service another user's request.
|
|
**Why it's wrong:** IDOR — cross-user credential access.
|
|
**Do this instead:** Always filter `CloudConnection.user_id == user.id` as enforced in `get_storage_backend_for_document()` in `backend/storage/__init__.py`.
|
|
|
|
## Error Handling
|
|
|
|
**Strategy:** Services raise `ValueError`; API handlers catch and re-raise as `HTTPException`. No service module imports FastAPI.
|
|
|
|
**Patterns:**
|
|
- Auth service raises `ValueError` → API layer maps to 401/422/400
|
|
- Storage errors (`S3Error`, cloud provider errors) wrapped in `backend/storage/exceptions.py` → 503 or 404
|
|
- `write_audit_log` never raises — silently logs and swallows to protect primary operations
|
|
- `CloudConnectionError` (`backend/storage/exceptions.py`) used for cloud-specific failures
|
|
|
|
## Cross-Cutting Concerns
|
|
|
|
**Logging:** Python `logging` module with `logger = logging.getLogger(__name__)` in each module. No structured logging framework.
|
|
|
|
**Validation:** Pydantic models at API boundary. Field validators on sensitive fields (filename rejects path separators, permission allowlists, non-negative quota). No model accepts `**kwargs`.
|
|
|
|
**Authentication:** Every non-public endpoint injects `get_current_user`, `get_current_admin`, or `get_regular_user` via FastAPI `Depends`. No endpoint bypasses the dependency chain.
|
|
|
|
**Rate Limiting:** slowapi (wraps limits-library) on all auth endpoints. Per-IP limits via `@limiter.limit("10/minute")`. Per-account Redis counter on login: `login_attempts:{email}`, 10 attempts per 15-minute window.
|
|
|
|
**Audit Logging:** `write_audit_log()` called inline in API handlers for all auth events, document operations, admin actions, and cloud connections. Written within the handler's transaction via `session.flush()`.
|
|
|
|
**HKDF Credential Encryption:** Cloud credentials encrypted with `Fernet(HKDF-SHA256(master_key, salt=user_id, purpose="cloud-creds"))` before DB storage. Implementation in `backend/storage/cloud_utils.py`.
|
|
|
|
---
|
|
|
|
*Architecture analysis: 2026-06-02*
|