diff --git a/.planning/codebase/ARCHITECTURE.md b/.planning/codebase/ARCHITECTURE.md index 0bf746c..360e53b 100644 --- a/.planning/codebase/ARCHITECTURE.md +++ b/.planning/codebase/ARCHITECTURE.md @@ -1,116 +1,284 @@ -# ARCHITECTURE — document-scanner + +# Architecture -_Last updated: 2026-05-21_ - -## Summary - -Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction. - ---- +**Analysis Date:** 2026-06-02 ## System Overview -``` -Browser (Vue 3 SPA) - │ HTTP/JSON + multipart - ▼ -FastAPI (port 8000) - ├── api/documents.py – upload, list, get, delete, reclassify - ├── api/topics.py – CRUD for topic list - ├── api/settings.py – AI provider config + system prompt - │ - ├── services/ - │ ├── extractor.py – text extraction dispatch - │ ├── classifier.py – orchestrates AI call + topic creation - │ └── storage.py – flat-file JSON + filesystem persistence - │ - └── ai/ – provider abstraction layer - ├── base.py – AIProvider ABC + ClassificationResult - ├── __init__.py – get_provider() factory - ├── anthropic_provider.py - ├── openai_provider.py - ├── ollama_provider.py (subclasses OpenAIProvider) - └── lmstudio_provider.py (subclasses OpenAIProvider) - │ - ▼ - External AI service (Anthropic API / OpenAI API / - Ollama / LM Studio — host.docker.internal) +```text +┌──────────────────────────────────────────────────────────────────────────┐ +│ Browser (Vue 3 SPA) │ +│ Pinia stores: auth · documents · folders · topics · cloudConnections │ +│ Router: / /folders/:id /document/:id /cloud /admin /shared │ +└─────────────────────┬──────────────────────────────────┬────────────────┘ + │ fetch() + Bearer JWT │ PUT (presigned) + ▼ ▼ +┌──────────────────────────────────┐ ┌───────────────────────────────┐ +│ FastAPI Backend :8000 │ │ MinIO :9000 │ +│ api/auth api/documents │ │ Bucket: docuvault │ +│ api/folders api/shares │ │ Keys: {uid}/{did}/{uuid}{e} │ +│ api/cloud api/admin │ └───────────────────────────────┘ +│ api/audit api/topics │ +│ │ ┌───────────────────────────────┐ +│ Middleware stack (per request):│ │ Cloud Backends │ +│ OriginValidation (first) │ │ Google Drive / OneDrive │ +│ CORS │ │ Nextcloud / WebDAV │ +│ SecurityHeaders (CSP, etc.) │ └───────────────────────────────┘ +│ SlowAPI rate limiter │ +│ │ ┌───────────────────────────────┐ +│ Deps layer: │ │ Celery Worker │ +│ get_db (AsyncSession) │◄────► tasks/document_tasks.py │ +│ get_current_user (JWT) │ │ tasks/email_tasks.py │ +│ get_current_admin │ │ tasks/audit_tasks.py │ +│ get_regular_user │ └───────────────────────────────┘ +└────────────┬─────────────────────┘ + │ SQLAlchemy async ┌───────────────────────────────┐ + ▼ │ Redis :6379 │ +┌──────────────────────────┐ │ Rate limiting (slowapi) │ +│ PostgreSQL :5432 │ │ TOTP replay cache │ +│ 11 tables: │◄──────────► Celery broker + results │ +│ users · quotas │ │ OAuth state tokens (TTL) │ +│ refresh_tokens │ └───────────────────────────────┘ +│ backup_codes · folders │ +│ documents · topics │ ┌───────────────────────────────┐ +│ document_topics │ │ AI Providers (pluggable) │ +│ shares · audit_log │ │ Ollama · OpenAI · Anthropic │ +│ cloud_connections │ │ LMStudio │ +│ groups (v2 stub) │ │ ai/base.py → AIProvider ABC │ +└──────────────────────────┘ └───────────────────────────────┘ ``` ---- +## Component Responsibilities -## Request Flow — Document Upload + Classification +| Component | Responsibility | Key File | +|-----------|----------------|----------| +| FastAPI app | ASGI entry point, middleware, router registration | `backend/main.py` | +| Auth API | Register, login (TOTP/backup), refresh, logout, password reset | `backend/api/auth.py` | +| Documents API | Upload URL, confirm, list, delete, classify, stream content | `backend/api/documents.py` | +| Folders API | CRUD folders, move documents between folders | `backend/api/folders.py` | +| Shares API | Grant/revoke/list document shares between users | `backend/api/shares.py` | +| Cloud API | OAuth flows, WebDAV connect, folder listing, default storage | `backend/api/cloud.py` | +| Admin API | User CRUD, quota, AI config, audit log, delete user | `backend/api/admin.py` | +| Audit API | Paginated audit log viewer + CSV export | `backend/api/audit.py` | +| Topics API | CRUD topics, topic suggestions | `backend/api/topics.py` | +| Auth service | Password hashing, JWT, refresh token family, TOTP, HIBP | `backend/services/auth.py` | +| Audit service | `write_audit_log()` — flushed within caller's transaction | `backend/services/audit.py` | +| Classifier service | Selects AI provider, assigns topics, auto-creates suggestions | `backend/services/classifier.py` | +| Extractor service | PDF/DOCX/image/text extraction | `backend/services/extractor.py` | +| Storage service | ORM queries for documents + topic resolution | `backend/services/storage.py` | +| StorageBackend ABC | Interface for all object storage backends | `backend/storage/base.py` | +| Storage factory | Returns MinIOBackend or cloud backend from document record | `backend/storage/__init__.py` | +| MinIO backend | Presigned URL, put/get/delete, stat | `backend/storage/minio_backend.py` | +| Cloud backends | Google Drive, OneDrive, Nextcloud, WebDAV implementations | `backend/storage/*_backend.py` | +| AIProvider ABC | Interface: classify, suggest_topics, health_check | `backend/ai/base.py` | +| AI factory | Returns provider instance from string slug | `backend/ai/__init__.py` | +| Celery app | Task routing, beat schedule, JSON serialization | `backend/celery_app.py` | +| Document task | extract_and_classify — async bridge from sync Celery worker | `backend/tasks/document_tasks.py` | +| ORM models | 11-table schema, all UUID PKs, full index set | `backend/db/models.py` | +| DB session | Async engine, session factory (expire_on_commit=False) | `backend/db/session.py` | +| FastAPI deps | get_db, get_current_user, get_current_admin, get_regular_user | `backend/deps/` | +| Auth store | accessToken (memory only), user, quota, refresh deduplication | `frontend/src/stores/auth.js` | +| Documents store | CRUD, 3-step MinIO upload with progress, search debounce | `frontend/src/stores/documents.js` | +| Folders store | CRUD folders, breadcrumb, rootFolders for sidebar | `frontend/src/stores/folders.js` | +| Topics store | CRUD topics | `frontend/src/stores/topics.js` | +| CloudConnections store | List/disconnect cloud connections | `frontend/src/stores/cloudConnections.js` | +| API client | fetch wrapper, Bearer injection, 401→refresh→retry | `frontend/src/api/client.js` | +| Vue Router | SPA routes, beforeEach guard (silent refresh on reload) | `frontend/src/router/index.js` | +| FileManagerView | Unified file manager for local folders and documents | `frontend/src/views/FileManagerView.vue` | +| StorageBrowser | Reusable file listing component (local + cloud modes) | `frontend/src/components/storage/StorageBrowser.vue` | -1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload` -2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()` -3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/.json` -4. If `auto_classify=true`, `classifier.classify_document()` is called: - a. Loads current settings from `data/settings.json` → calls `get_provider(settings)` - b. Passes document text + existing topics to `provider.classify()` - c. Any suggested new topics are created via `storage.add_topic()` - d. Document metadata is updated with assigned topics -5. Full document metadata JSON is returned to the frontend +## Pattern Overview + +**Overall:** Layered REST API + SPA with async background processing + +**Key Characteristics:** +- API layer is thin — validation via Pydantic, business logic in `services/` +- No ORM relationships loaded — explicit queries only (prevents N+1) +- Async everywhere in FastAPI; Celery workers bridge to async via `asyncio.run()` +- Frontend Pinia stores own data-fetching; views delegate to stores; components emit events upward +- One DB session per request (yielded by `get_db` dep), one per Celery task invocation +- All resource ownership checked inline in handlers (`resource.user_id == current_user.id`) + +## Layers + +**API Layer:** +- Purpose: HTTP routing, request validation, response serialization +- Location: `backend/api/` +- Contains: APIRouter instances, Pydantic request/response models, FastAPI dep injection +- Depends on: `services/`, `deps/`, `db/models.py` +- Used by: Frontend via HTTP; not called from other backend modules + +**Service Layer:** +- Purpose: Business logic with no FastAPI coupling (pure Python async functions) +- Location: `backend/services/` +- Contains: `auth.py`, `audit.py`, `classifier.py`, `extractor.py`, `storage.py`, `cloud_cache.py`, `email.py` +- Depends on: `db/models.py`, `storage/`, `ai/`, `config` +- Used by: `api/` layer and Celery tasks + +**Storage Abstraction Layer:** +- Purpose: Backend-agnostic object storage interface +- Location: `backend/storage/` +- Contains: `base.py` (ABC), `minio_backend.py`, `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`, `cloud_utils.py` (HKDF encryption), `exceptions.py` +- Depends on: `config`, `db/models.py` (for cloud credential lookup) +- Used by: `services/storage.py`, `api/documents.py`, Celery tasks + +**AI Abstraction Layer:** +- Purpose: Pluggable AI provider interface for document classification +- Location: `backend/ai/` +- Contains: `base.py` (ABC), `ollama_provider.py`, `openai_provider.py`, `anthropic_provider.py`, `lmstudio_provider.py`, `utils.py` +- Depends on: External AI APIs via httpx +- Used by: `services/classifier.py` + +**Dependency Layer:** +- Purpose: FastAPI reusable dependencies (DI) +- Location: `backend/deps/` +- Contains: `db.py` (get_db), `auth.py` (get_current_user, get_current_admin, get_regular_user), `utils.py` (get_client_ip) +- Used by: All `api/` handlers + +**Frontend Store Layer:** +- Purpose: Application state + async API calls +- Location: `frontend/src/stores/` +- Contains: `auth.js`, `documents.js`, `folders.js`, `topics.js`, `cloudConnections.js` +- Depends on: `api/client.js` +- Used by: Views and components + +## Data Flow + +### Document Upload (MinIO presigned URL path) + +1. User drops file in `DropZone` → `StorageBrowser` emits `upload` → `FileManagerView.onFilesSelected` (`frontend/src/views/FileManagerView.vue`) +2. `documentsStore.upload(file, autoClassify, folderId)` (`frontend/src/stores/documents.js`) +3. `POST /api/documents/upload-url` → creates pending `Document` row, returns presigned PUT URL + `document_id` (`backend/api/documents.py`) +4. XHR `PUT` bytes directly from browser to MinIO presigned URL (no backend proxy, no auth header needed — URL is self-authenticating) +5. `POST /api/documents/{id}/confirm` → `stat_object()` for authoritative size → atomic quota `UPDATE … RETURNING` → status set to `'ready'` (`backend/api/documents.py`) +6. If `folderId != null`: `PATCH /api/documents/{id}/folder` → places document in folder +7. Celery task `extract_and_classify.delay(document_id)` enqueued → text extraction → AI classification → topic assignment (`backend/tasks/document_tasks.py`) +8. `authStore.fetchQuota()` called on frontend to refresh sidebar quota bar + +### Authentication Flow + +1. `POST /api/auth/login` with `{email, password}` — per-account Redis rate limit checked first (`backend/api/auth.py`) +2. Password verified with Argon2 (constant-time via pwdlib) +3. If TOTP enabled and no code provided → returns `{requires_totp: true}` challenge +4. If TOTP code provided → verified against pyotp + Redis replay prevention window +5. On success: `create_access_token()` (HS256 JWT, 15-min TTL) + `create_refresh_token()` (SHA-256 hashed, stored in DB) (`backend/services/auth.py`) +6. Access token returned in JSON body; refresh token set as `httpOnly; Secure; SameSite=Strict` cookie scoped to `/api/auth/refresh` path only +7. Frontend stores access token in `authStore.accessToken` (Pinia `ref()` — memory only, never localStorage) +8. On page reload: router `beforeEach` guard calls `authStore.refresh()` → `POST /api/auth/refresh` sends httpOnly cookie → new access token returned +9. `api/client.js` intercepts any 401 → calls `authStore.refresh()` → retries request once (`frontend/src/api/client.js`) + +### Refresh Token Rotation + Family Revocation + +1. `POST /api/auth/refresh` reads httpOnly cookie, looks up `RefreshToken` row by SHA-256 hash +2. If token already revoked → all user's refresh tokens revoked → 401 + security alert email enqueued via Celery +3. If valid: old token marked `revoked=True`, new raw token generated and stored (hashed), rotated cookie set + +### Cloud Storage OAuth Flow + +1. `GET /api/cloud/oauth/initiate/{provider}` → state token stored in Redis (TTL 1800s, single-use) → authorization URL returned +2. Browser navigates to OAuth provider → callback to `GET /api/cloud/oauth/callback/{provider}` +3. State token validated (single-use consumed from Redis), authorization code exchanged for credentials +4. Credentials encrypted with HKDF-derived per-user Fernet key → stored in `cloud_connections.credentials_enc` +5. On document operations: `get_storage_backend_for_document()` decrypts credentials, instantiates cloud backend — transparent to API handlers (`backend/storage/__init__.py`) + +**State Management (frontend):** +- Access token: `authStore.accessToken` — Pinia `ref(null)`, JS memory only, cleared on logout/error +- User profile: `authStore.user` — Pinia `ref(null)` +- Quota: `authStore.quota` — fetched after upload/delete, displayed in `QuotaBar` +- Documents: `documentsStore.documents` — local array, kept in sync via explicit `fetchDocuments()` calls +- Folder tree: `foldersStore.rootFolders` (sidebar) + `foldersStore.folders` (current level) +- Upload progress: `documentsStore.uploadProgress` — keyed `${filename}__${Date.now()}` to prevent key collision + +## Key Abstractions + +**StorageBackend ABC (`backend/storage/base.py`):** +- Purpose: Uniform interface over MinIO and all cloud providers +- Methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`, `generate_presigned_put_url`, `stat_object` +- Implementations: `MinIOBackend`, `GoogleDriveBackend`, `OneDriveBackend`, `NextcloudBackend`, `WebDAVBackend` +- Selected by: `get_storage_backend_for_document()` in `backend/storage/__init__.py` + +**AIProvider ABC (`backend/ai/base.py`):** +- Purpose: Pluggable classification backend +- Methods: `classify`, `suggest_topics`, `health_check` +- Returns: `ClassificationResult(topics, suggested_new_topics, reasoning)` +- Implementations: `OllamaProvider`, `OpenAIProvider`, `AnthropicProvider`, `LMStudioProvider` +- Selected by: `ai/__init__.py` factory, keyed to per-user `ai_provider`/`ai_model` from DB + +**Dependency Chain:** +- `get_current_user` → parses Bearer JWT → loads `User` from DB, checks `is_active` +- `get_current_admin` → wraps `get_current_user` + `role == 'admin'` check (raises 403) +- `get_regular_user` → wraps `get_current_user` + rejects `role == 'admin'` (admins get 403 on document endpoints) + +## Entry Points + +**Backend:** +- Location: `backend/main.py` +- Triggers: `uvicorn main:app` +- Responsibilities: FastAPI app factory, lifespan (MinIO bucket init, Redis connection, admin bootstrap), middleware registration in correct order, router inclusion + +**Celery Worker:** +- Location: `backend/celery_app.py` (factory) + `backend/tasks/` +- Triggers: `celery -A celery_app worker -Q documents` +- Responsibilities: Async document text extraction + classification, email delivery, scheduled nightly audit CSV export + +**Frontend:** +- Location: `frontend/src/main.js` +- Triggers: Vite dev server (`npm run dev`) or built static files served by frontend container +- Responsibilities: Mount Vue app with Pinia and Router + +## Architectural Constraints + +- **Threading:** FastAPI runs on a single-threaded asyncio event loop (uvicorn). Blocking MinIO SDK calls use `asyncio.to_thread()`. Celery workers are separate sync processes that bridge to async via `asyncio.run()` — they never share an event loop with FastAPI. +- **Global state:** `backend/services/storage.py` holds a module-level `_storage` singleton for the default MinIO backend. `backend/main.py` stores MinIO client on `app.state.minio` and Redis client on `app.state.redis`. +- **Circular imports:** Celery task modules must never import from `main.py` or router modules. `backend/celery_app.py` intentionally avoids importing `config` — reads `REDIS_URL` directly from `os.environ` to avoid pydantic-settings side effects. +- **Admin isolation:** Admin accounts cannot access document content — enforced by `get_regular_user` dep on all document/folder/share endpoints. No impersonation code path exists (`backend/deps/auth.py`). +- **Quota atomicity:** Quota enforcement uses a single atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — no read-then-write in Python. +- **Object key privacy:** MinIO keys are `{user_id}/{document_id}/{uuid4()}{ext}` — original filenames stored only in the DB `filename` column, never in the storage key. + +## Anti-Patterns + +### Accessing document content via unauthenticated iframe src + +**What happens:** Setting `