docs(codebase): refresh codebase map after Phase 06.2 completion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
curo1305
2026-06-02 15:32:06 +02:00
parent bd17b4b22f
commit 89f8d5a654
7 changed files with 1829 additions and 621 deletions
+273 -105
View File
@@ -1,116 +1,284 @@
# ARCHITECTURE — document-scanner
<!-- refreshed: 2026-06-02 -->
# Architecture
_Last updated: 2026-05-21_
## Summary
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
---
**Analysis Date:** 2026-06-02
## System Overview
```
Browser (Vue 3 SPA)
│ HTTP/JSON + multipart
FastAPI (port 8000)
├── api/documents.py upload, list, get, delete, reclassify
├── api/topics.py CRUD for topic list
├── api/settings.py AI provider config + system prompt
├── services/
├── extractor.py text extraction dispatch
├── classifier.py orchestrates AI call + topic creation
│ └── storage.py flat-file JSON + filesystem persistence
└── ai/ provider abstraction layer
├── base.py AIProvider ABC + ClassificationResult
├── __init__.py get_provider() factory
├── anthropic_provider.py
├── openai_provider.py
├── ollama_provider.py (subclasses OpenAIProvider)
└── lmstudio_provider.py (subclasses OpenAIProvider)
External AI service (Anthropic API / OpenAI API /
Ollama / LM Studio — host.docker.internal)
```text
┌──────────────────────────────────────────────────────────────────────────┐
│ Browser (Vue 3 SPA) │
│ Pinia stores: auth · documents · folders · topics · cloudConnections │
│ Router: / /folders/:id /document/:id /cloud /admin /shared │
└─────────────────────┬──────────────────────────────────┬────────────────┘
│ fetch() + Bearer JWT │ PUT (presigned)
▼ ▼
┌──────────────────────────────────┐ ┌───────────────────────────────┐
│ FastAPI Backend :8000 │ │ MinIO :9000 │
api/auth api/documents │ │ Bucket: docuvault │
api/folders api/shares │ │ Keys: {uid}/{did}/{uuid}{e} │
│ api/cloud api/admin │ └───────────────────────────────┘
│ api/audit api/topics
│ │ ┌───────────────────────────────┐
│ Middleware stack (per request):│ │ Cloud Backends │
│ OriginValidation (first) │ │ Google Drive / OneDrive │
│ CORS │ │ Nextcloud / WebDAV │
│ SecurityHeaders (CSP, etc.) │ └───────────────────────────────┘
│ SlowAPI rate limiter │
│ │ ┌───────────────────────────────┐
│ Deps layer: │ │ Celery Worker
│ get_db (AsyncSession) │◄────► tasks/document_tasks.py │
│ get_current_user (JWT) │ │ tasks/email_tasks.py │
│ get_current_admin │ │ tasks/audit_tasks.py │
│ get_regular_user │ └───────────────────────────────┘
└────────────┬─────────────────────┘
│ SQLAlchemy async ┌───────────────────────────────┐
▼ │ Redis :6379 │
┌──────────────────────────┐ │ Rate limiting (slowapi) │
│ PostgreSQL :5432 │ │ TOTP replay cache │
│ 11 tables: │◄──────────► Celery broker + results │
│ users · quotas │ │ OAuth state tokens (TTL) │
│ refresh_tokens │ └───────────────────────────────┘
│ backup_codes · folders │
│ documents · topics │ ┌───────────────────────────────┐
│ document_topics │ │ AI Providers (pluggable) │
│ shares · audit_log │ │ Ollama · OpenAI · Anthropic │
│ cloud_connections │ │ LMStudio │
│ groups (v2 stub) │ │ ai/base.py → AIProvider ABC │
└──────────────────────────┘ └───────────────────────────────┘
```
---
## Component Responsibilities
## Request Flow — Document Upload + Classification
| Component | Responsibility | Key File |
|-----------|----------------|----------|
| FastAPI app | ASGI entry point, middleware, router registration | `backend/main.py` |
| Auth API | Register, login (TOTP/backup), refresh, logout, password reset | `backend/api/auth.py` |
| Documents API | Upload URL, confirm, list, delete, classify, stream content | `backend/api/documents.py` |
| Folders API | CRUD folders, move documents between folders | `backend/api/folders.py` |
| Shares API | Grant/revoke/list document shares between users | `backend/api/shares.py` |
| Cloud API | OAuth flows, WebDAV connect, folder listing, default storage | `backend/api/cloud.py` |
| Admin API | User CRUD, quota, AI config, audit log, delete user | `backend/api/admin.py` |
| Audit API | Paginated audit log viewer + CSV export | `backend/api/audit.py` |
| Topics API | CRUD topics, topic suggestions | `backend/api/topics.py` |
| Auth service | Password hashing, JWT, refresh token family, TOTP, HIBP | `backend/services/auth.py` |
| Audit service | `write_audit_log()` — flushed within caller's transaction | `backend/services/audit.py` |
| Classifier service | Selects AI provider, assigns topics, auto-creates suggestions | `backend/services/classifier.py` |
| Extractor service | PDF/DOCX/image/text extraction | `backend/services/extractor.py` |
| Storage service | ORM queries for documents + topic resolution | `backend/services/storage.py` |
| StorageBackend ABC | Interface for all object storage backends | `backend/storage/base.py` |
| Storage factory | Returns MinIOBackend or cloud backend from document record | `backend/storage/__init__.py` |
| MinIO backend | Presigned URL, put/get/delete, stat | `backend/storage/minio_backend.py` |
| Cloud backends | Google Drive, OneDrive, Nextcloud, WebDAV implementations | `backend/storage/*_backend.py` |
| AIProvider ABC | Interface: classify, suggest_topics, health_check | `backend/ai/base.py` |
| AI factory | Returns provider instance from string slug | `backend/ai/__init__.py` |
| Celery app | Task routing, beat schedule, JSON serialization | `backend/celery_app.py` |
| Document task | extract_and_classify — async bridge from sync Celery worker | `backend/tasks/document_tasks.py` |
| ORM models | 11-table schema, all UUID PKs, full index set | `backend/db/models.py` |
| DB session | Async engine, session factory (expire_on_commit=False) | `backend/db/session.py` |
| FastAPI deps | get_db, get_current_user, get_current_admin, get_regular_user | `backend/deps/` |
| Auth store | accessToken (memory only), user, quota, refresh deduplication | `frontend/src/stores/auth.js` |
| Documents store | CRUD, 3-step MinIO upload with progress, search debounce | `frontend/src/stores/documents.js` |
| Folders store | CRUD folders, breadcrumb, rootFolders for sidebar | `frontend/src/stores/folders.js` |
| Topics store | CRUD topics | `frontend/src/stores/topics.js` |
| CloudConnections store | List/disconnect cloud connections | `frontend/src/stores/cloudConnections.js` |
| API client | fetch wrapper, Bearer injection, 401→refresh→retry | `frontend/src/api/client.js` |
| Vue Router | SPA routes, beforeEach guard (silent refresh on reload) | `frontend/src/router/index.js` |
| FileManagerView | Unified file manager for local folders and documents | `frontend/src/views/FileManagerView.vue` |
| StorageBrowser | Reusable file listing component (local + cloud modes) | `frontend/src/components/storage/StorageBrowser.vue` |
1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
4. If `auto_classify=true`, `classifier.classify_document()` is called:
a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
b. Passes document text + existing topics to `provider.classify()`
c. Any suggested new topics are created via `storage.add_topic()`
d. Document metadata is updated with assigned topics
5. Full document metadata JSON is returned to the frontend
## Pattern Overview
**Overall:** Layered REST API + SPA with async background processing
**Key Characteristics:**
- API layer is thin — validation via Pydantic, business logic in `services/`
- No ORM relationships loaded — explicit queries only (prevents N+1)
- Async everywhere in FastAPI; Celery workers bridge to async via `asyncio.run()`
- Frontend Pinia stores own data-fetching; views delegate to stores; components emit events upward
- One DB session per request (yielded by `get_db` dep), one per Celery task invocation
- All resource ownership checked inline in handlers (`resource.user_id == current_user.id`)
## Layers
**API Layer:**
- Purpose: HTTP routing, request validation, response serialization
- Location: `backend/api/`
- Contains: APIRouter instances, Pydantic request/response models, FastAPI dep injection
- Depends on: `services/`, `deps/`, `db/models.py`
- Used by: Frontend via HTTP; not called from other backend modules
**Service Layer:**
- Purpose: Business logic with no FastAPI coupling (pure Python async functions)
- Location: `backend/services/`
- Contains: `auth.py`, `audit.py`, `classifier.py`, `extractor.py`, `storage.py`, `cloud_cache.py`, `email.py`
- Depends on: `db/models.py`, `storage/`, `ai/`, `config`
- Used by: `api/` layer and Celery tasks
**Storage Abstraction Layer:**
- Purpose: Backend-agnostic object storage interface
- Location: `backend/storage/`
- Contains: `base.py` (ABC), `minio_backend.py`, `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`, `cloud_utils.py` (HKDF encryption), `exceptions.py`
- Depends on: `config`, `db/models.py` (for cloud credential lookup)
- Used by: `services/storage.py`, `api/documents.py`, Celery tasks
**AI Abstraction Layer:**
- Purpose: Pluggable AI provider interface for document classification
- Location: `backend/ai/`
- Contains: `base.py` (ABC), `ollama_provider.py`, `openai_provider.py`, `anthropic_provider.py`, `lmstudio_provider.py`, `utils.py`
- Depends on: External AI APIs via httpx
- Used by: `services/classifier.py`
**Dependency Layer:**
- Purpose: FastAPI reusable dependencies (DI)
- Location: `backend/deps/`
- Contains: `db.py` (get_db), `auth.py` (get_current_user, get_current_admin, get_regular_user), `utils.py` (get_client_ip)
- Used by: All `api/` handlers
**Frontend Store Layer:**
- Purpose: Application state + async API calls
- Location: `frontend/src/stores/`
- Contains: `auth.js`, `documents.js`, `folders.js`, `topics.js`, `cloudConnections.js`
- Depends on: `api/client.js`
- Used by: Views and components
## Data Flow
### Document Upload (MinIO presigned URL path)
1. User drops file in `DropZone``StorageBrowser` emits `upload``FileManagerView.onFilesSelected` (`frontend/src/views/FileManagerView.vue`)
2. `documentsStore.upload(file, autoClassify, folderId)` (`frontend/src/stores/documents.js`)
3. `POST /api/documents/upload-url` → creates pending `Document` row, returns presigned PUT URL + `document_id` (`backend/api/documents.py`)
4. XHR `PUT` bytes directly from browser to MinIO presigned URL (no backend proxy, no auth header needed — URL is self-authenticating)
5. `POST /api/documents/{id}/confirm``stat_object()` for authoritative size → atomic quota `UPDATE … RETURNING` → status set to `'ready'` (`backend/api/documents.py`)
6. If `folderId != null`: `PATCH /api/documents/{id}/folder` → places document in folder
7. Celery task `extract_and_classify.delay(document_id)` enqueued → text extraction → AI classification → topic assignment (`backend/tasks/document_tasks.py`)
8. `authStore.fetchQuota()` called on frontend to refresh sidebar quota bar
### Authentication Flow
1. `POST /api/auth/login` with `{email, password}` — per-account Redis rate limit checked first (`backend/api/auth.py`)
2. Password verified with Argon2 (constant-time via pwdlib)
3. If TOTP enabled and no code provided → returns `{requires_totp: true}` challenge
4. If TOTP code provided → verified against pyotp + Redis replay prevention window
5. On success: `create_access_token()` (HS256 JWT, 15-min TTL) + `create_refresh_token()` (SHA-256 hashed, stored in DB) (`backend/services/auth.py`)
6. Access token returned in JSON body; refresh token set as `httpOnly; Secure; SameSite=Strict` cookie scoped to `/api/auth/refresh` path only
7. Frontend stores access token in `authStore.accessToken` (Pinia `ref()` — memory only, never localStorage)
8. On page reload: router `beforeEach` guard calls `authStore.refresh()``POST /api/auth/refresh` sends httpOnly cookie → new access token returned
9. `api/client.js` intercepts any 401 → calls `authStore.refresh()` → retries request once (`frontend/src/api/client.js`)
### Refresh Token Rotation + Family Revocation
1. `POST /api/auth/refresh` reads httpOnly cookie, looks up `RefreshToken` row by SHA-256 hash
2. If token already revoked → all user's refresh tokens revoked → 401 + security alert email enqueued via Celery
3. If valid: old token marked `revoked=True`, new raw token generated and stored (hashed), rotated cookie set
### Cloud Storage OAuth Flow
1. `GET /api/cloud/oauth/initiate/{provider}` → state token stored in Redis (TTL 1800s, single-use) → authorization URL returned
2. Browser navigates to OAuth provider → callback to `GET /api/cloud/oauth/callback/{provider}`
3. State token validated (single-use consumed from Redis), authorization code exchanged for credentials
4. Credentials encrypted with HKDF-derived per-user Fernet key → stored in `cloud_connections.credentials_enc`
5. On document operations: `get_storage_backend_for_document()` decrypts credentials, instantiates cloud backend — transparent to API handlers (`backend/storage/__init__.py`)
**State Management (frontend):**
- Access token: `authStore.accessToken` — Pinia `ref(null)`, JS memory only, cleared on logout/error
- User profile: `authStore.user` — Pinia `ref(null)`
- Quota: `authStore.quota` — fetched after upload/delete, displayed in `QuotaBar`
- Documents: `documentsStore.documents` — local array, kept in sync via explicit `fetchDocuments()` calls
- Folder tree: `foldersStore.rootFolders` (sidebar) + `foldersStore.folders` (current level)
- Upload progress: `documentsStore.uploadProgress` — keyed `${filename}__${Date.now()}` to prevent key collision
## Key Abstractions
**StorageBackend ABC (`backend/storage/base.py`):**
- Purpose: Uniform interface over MinIO and all cloud providers
- Methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`, `generate_presigned_put_url`, `stat_object`
- Implementations: `MinIOBackend`, `GoogleDriveBackend`, `OneDriveBackend`, `NextcloudBackend`, `WebDAVBackend`
- Selected by: `get_storage_backend_for_document()` in `backend/storage/__init__.py`
**AIProvider ABC (`backend/ai/base.py`):**
- Purpose: Pluggable classification backend
- Methods: `classify`, `suggest_topics`, `health_check`
- Returns: `ClassificationResult(topics, suggested_new_topics, reasoning)`
- Implementations: `OllamaProvider`, `OpenAIProvider`, `AnthropicProvider`, `LMStudioProvider`
- Selected by: `ai/__init__.py` factory, keyed to per-user `ai_provider`/`ai_model` from DB
**Dependency Chain:**
- `get_current_user` → parses Bearer JWT → loads `User` from DB, checks `is_active`
- `get_current_admin` → wraps `get_current_user` + `role == 'admin'` check (raises 403)
- `get_regular_user` → wraps `get_current_user` + rejects `role == 'admin'` (admins get 403 on document endpoints)
## Entry Points
**Backend:**
- Location: `backend/main.py`
- Triggers: `uvicorn main:app`
- Responsibilities: FastAPI app factory, lifespan (MinIO bucket init, Redis connection, admin bootstrap), middleware registration in correct order, router inclusion
**Celery Worker:**
- Location: `backend/celery_app.py` (factory) + `backend/tasks/`
- Triggers: `celery -A celery_app worker -Q documents`
- Responsibilities: Async document text extraction + classification, email delivery, scheduled nightly audit CSV export
**Frontend:**
- Location: `frontend/src/main.js`
- Triggers: Vite dev server (`npm run dev`) or built static files served by frontend container
- Responsibilities: Mount Vue app with Pinia and Router
## Architectural Constraints
- **Threading:** FastAPI runs on a single-threaded asyncio event loop (uvicorn). Blocking MinIO SDK calls use `asyncio.to_thread()`. Celery workers are separate sync processes that bridge to async via `asyncio.run()` — they never share an event loop with FastAPI.
- **Global state:** `backend/services/storage.py` holds a module-level `_storage` singleton for the default MinIO backend. `backend/main.py` stores MinIO client on `app.state.minio` and Redis client on `app.state.redis`.
- **Circular imports:** Celery task modules must never import from `main.py` or router modules. `backend/celery_app.py` intentionally avoids importing `config` — reads `REDIS_URL` directly from `os.environ` to avoid pydantic-settings side effects.
- **Admin isolation:** Admin accounts cannot access document content — enforced by `get_regular_user` dep on all document/folder/share endpoints. No impersonation code path exists (`backend/deps/auth.py`).
- **Quota atomicity:** Quota enforcement uses a single atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — no read-then-write in Python.
- **Object key privacy:** MinIO keys are `{user_id}/{document_id}/{uuid4()}{ext}` — original filenames stored only in the DB `filename` column, never in the storage key.
## Anti-Patterns
### Accessing document content via unauthenticated iframe src
**What happens:** Setting `<iframe src="/api/documents/{id}/content">` directly would bypass Bearer token auth in browsers that do not send cookies cross-origin.
**Why it's wrong:** The document content endpoint requires `Authorization: Bearer` header; browser `src=` attributes do not send custom headers.
**Do this instead:** Use `fetchDocumentContent(docId)` in `frontend/src/api/client.js` — it injects Bearer + handles 401-refresh-retry, then builds an object URL from the Blob response.
### Committing inside `write_audit_log`
**What happens:** Calling `session.commit()` inside `write_audit_log` creates a separate transaction for the audit entry.
**Why it's wrong:** The audit entry would commit even if the primary operation subsequently fails, creating phantom audit records.
**Do this instead:** `write_audit_log` calls `session.flush()` only. The caller owns `session.commit()``backend/services/audit.py`.
### CloudConnection query without user scope
**What happens:** Querying `CloudConnection` without filtering `user_id == current_user.id` would allow one user's cloud credentials to service another user's request.
**Why it's wrong:** IDOR — cross-user credential access.
**Do this instead:** Always filter `CloudConnection.user_id == user.id` as enforced in `get_storage_backend_for_document()` in `backend/storage/__init__.py`.
## Error Handling
**Strategy:** Services raise `ValueError`; API handlers catch and re-raise as `HTTPException`. No service module imports FastAPI.
**Patterns:**
- Auth service raises `ValueError` → API layer maps to 401/422/400
- Storage errors (`S3Error`, cloud provider errors) wrapped in `backend/storage/exceptions.py` → 503 or 404
- `write_audit_log` never raises — silently logs and swallows to protect primary operations
- `CloudConnectionError` (`backend/storage/exceptions.py`) used for cloud-specific failures
## Cross-Cutting Concerns
**Logging:** Python `logging` module with `logger = logging.getLogger(__name__)` in each module. No structured logging framework.
**Validation:** Pydantic models at API boundary. Field validators on sensitive fields (filename rejects path separators, permission allowlists, non-negative quota). No model accepts `**kwargs`.
**Authentication:** Every non-public endpoint injects `get_current_user`, `get_current_admin`, or `get_regular_user` via FastAPI `Depends`. No endpoint bypasses the dependency chain.
**Rate Limiting:** slowapi (wraps limits-library) on all auth endpoints. Per-IP limits via `@limiter.limit("10/minute")`. Per-account Redis counter on login: `login_attempts:{email}`, 10 attempts per 15-minute window.
**Audit Logging:** `write_audit_log()` called inline in API handlers for all auth events, document operations, admin actions, and cloud connections. Written within the handler's transaction via `session.flush()`.
**HKDF Credential Encryption:** Cloud credentials encrypted with `Fernet(HKDF-SHA256(master_key, salt=user_id, purpose="cloud-creds"))` before DB storage. Implementation in `backend/storage/cloud_utils.py`.
---
## AI Provider Abstraction
- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
- `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
- `suggest_topics(document_text, system_prompt) → list[str]`
- `health_check() → bool`
- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
- Provider is re-instantiated on every request (stateless; no connection pooling)
---
## Data Persistence
All state is stored on the local filesystem — no database:
| Store | Path | Format | Access |
|---|---|---|---|
| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
| Settings | `data/settings.json` | JSON object | `filelock` protected |
`filelock` is used to prevent concurrent write corruption on JSON files.
---
## Frontend Architecture
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
---
## Key Patterns
- **Provider Pattern** — AI backends are interchangeable at runtime via settings
- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
---
## Constraints & Notable Decisions
- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
- **Auth dependency chain (Phase 2+):** `get_current_user` (validates JWT, returns User) → `get_current_admin` (requires role=admin) / `get_regular_user` (requires role!=admin, 403 for admin accounts on document endpoints). `get_regular_user` enforces SEC-04: admin accounts cannot read document content (CLAUDE.md).
- **Ownership assertion pattern (Phase 3+):** Every `/api/documents/*` handler asserts `doc.user_id == current_user.id` before returning — raises 404 (not 403) to prevent information leakage (D-16, T-03-11). Cross-user access and non-existence are indistinguishable.
- **Topic namespace model (Phase 3+):** `user_id=NULL` = system topic (visible to all); `user_id=<uuid>` = per-user topic. `load_topics_for_user(session, user_id)` returns union via `or_(Topic.user_id == user_id, Topic.user_id.is_(None))`. Admin creates system topics via `POST /api/admin/topics`.
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
- AI provider re-instantiated per request (no connection reuse)
- Data directory is volume-mounted in Docker; no backup or migration strategy
---
## Gaps / Unknowns
- No API versioning strategy visible
- Frontend has no error boundary or global error handling component
- No pagination on document list endpoint (could be a scaling concern)
*Architecture analysis: 2026-06-02*