docs(codebase): refresh codebase map after Phase 06.2 completion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
+273
-105
@@ -1,116 +1,284 @@
|
||||
# ARCHITECTURE — document-scanner
|
||||
<!-- refreshed: 2026-06-02 -->
|
||||
# Architecture
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
|
||||
|
||||
---
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## System Overview
|
||||
|
||||
```
|
||||
Browser (Vue 3 SPA)
|
||||
│ HTTP/JSON + multipart
|
||||
▼
|
||||
FastAPI (port 8000)
|
||||
├── api/documents.py – upload, list, get, delete, reclassify
|
||||
├── api/topics.py – CRUD for topic list
|
||||
├── api/settings.py – AI provider config + system prompt
|
||||
│
|
||||
├── services/
|
||||
│ ├── extractor.py – text extraction dispatch
|
||||
│ ├── classifier.py – orchestrates AI call + topic creation
|
||||
│ └── storage.py – flat-file JSON + filesystem persistence
|
||||
│
|
||||
└── ai/ – provider abstraction layer
|
||||
├── base.py – AIProvider ABC + ClassificationResult
|
||||
├── __init__.py – get_provider() factory
|
||||
├── anthropic_provider.py
|
||||
├── openai_provider.py
|
||||
├── ollama_provider.py (subclasses OpenAIProvider)
|
||||
└── lmstudio_provider.py (subclasses OpenAIProvider)
|
||||
│
|
||||
▼
|
||||
External AI service (Anthropic API / OpenAI API /
|
||||
Ollama / LM Studio — host.docker.internal)
|
||||
```text
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Browser (Vue 3 SPA) │
|
||||
│ Pinia stores: auth · documents · folders · topics · cloudConnections │
|
||||
│ Router: / /folders/:id /document/:id /cloud /admin /shared │
|
||||
└─────────────────────┬──────────────────────────────────┬────────────────┘
|
||||
│ fetch() + Bearer JWT │ PUT (presigned)
|
||||
▼ ▼
|
||||
┌──────────────────────────────────┐ ┌───────────────────────────────┐
|
||||
│ FastAPI Backend :8000 │ │ MinIO :9000 │
|
||||
│ api/auth api/documents │ │ Bucket: docuvault │
|
||||
│ api/folders api/shares │ │ Keys: {uid}/{did}/{uuid}{e} │
|
||||
│ api/cloud api/admin │ └───────────────────────────────┘
|
||||
│ api/audit api/topics │
|
||||
│ │ ┌───────────────────────────────┐
|
||||
│ Middleware stack (per request):│ │ Cloud Backends │
|
||||
│ OriginValidation (first) │ │ Google Drive / OneDrive │
|
||||
│ CORS │ │ Nextcloud / WebDAV │
|
||||
│ SecurityHeaders (CSP, etc.) │ └───────────────────────────────┘
|
||||
│ SlowAPI rate limiter │
|
||||
│ │ ┌───────────────────────────────┐
|
||||
│ Deps layer: │ │ Celery Worker │
|
||||
│ get_db (AsyncSession) │◄────► tasks/document_tasks.py │
|
||||
│ get_current_user (JWT) │ │ tasks/email_tasks.py │
|
||||
│ get_current_admin │ │ tasks/audit_tasks.py │
|
||||
│ get_regular_user │ └───────────────────────────────┘
|
||||
└────────────┬─────────────────────┘
|
||||
│ SQLAlchemy async ┌───────────────────────────────┐
|
||||
▼ │ Redis :6379 │
|
||||
┌──────────────────────────┐ │ Rate limiting (slowapi) │
|
||||
│ PostgreSQL :5432 │ │ TOTP replay cache │
|
||||
│ 11 tables: │◄──────────► Celery broker + results │
|
||||
│ users · quotas │ │ OAuth state tokens (TTL) │
|
||||
│ refresh_tokens │ └───────────────────────────────┘
|
||||
│ backup_codes · folders │
|
||||
│ documents · topics │ ┌───────────────────────────────┐
|
||||
│ document_topics │ │ AI Providers (pluggable) │
|
||||
│ shares · audit_log │ │ Ollama · OpenAI · Anthropic │
|
||||
│ cloud_connections │ │ LMStudio │
|
||||
│ groups (v2 stub) │ │ ai/base.py → AIProvider ABC │
|
||||
└──────────────────────────┘ └───────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
## Component Responsibilities
|
||||
|
||||
## Request Flow — Document Upload + Classification
|
||||
| Component | Responsibility | Key File |
|
||||
|-----------|----------------|----------|
|
||||
| FastAPI app | ASGI entry point, middleware, router registration | `backend/main.py` |
|
||||
| Auth API | Register, login (TOTP/backup), refresh, logout, password reset | `backend/api/auth.py` |
|
||||
| Documents API | Upload URL, confirm, list, delete, classify, stream content | `backend/api/documents.py` |
|
||||
| Folders API | CRUD folders, move documents between folders | `backend/api/folders.py` |
|
||||
| Shares API | Grant/revoke/list document shares between users | `backend/api/shares.py` |
|
||||
| Cloud API | OAuth flows, WebDAV connect, folder listing, default storage | `backend/api/cloud.py` |
|
||||
| Admin API | User CRUD, quota, AI config, audit log, delete user | `backend/api/admin.py` |
|
||||
| Audit API | Paginated audit log viewer + CSV export | `backend/api/audit.py` |
|
||||
| Topics API | CRUD topics, topic suggestions | `backend/api/topics.py` |
|
||||
| Auth service | Password hashing, JWT, refresh token family, TOTP, HIBP | `backend/services/auth.py` |
|
||||
| Audit service | `write_audit_log()` — flushed within caller's transaction | `backend/services/audit.py` |
|
||||
| Classifier service | Selects AI provider, assigns topics, auto-creates suggestions | `backend/services/classifier.py` |
|
||||
| Extractor service | PDF/DOCX/image/text extraction | `backend/services/extractor.py` |
|
||||
| Storage service | ORM queries for documents + topic resolution | `backend/services/storage.py` |
|
||||
| StorageBackend ABC | Interface for all object storage backends | `backend/storage/base.py` |
|
||||
| Storage factory | Returns MinIOBackend or cloud backend from document record | `backend/storage/__init__.py` |
|
||||
| MinIO backend | Presigned URL, put/get/delete, stat | `backend/storage/minio_backend.py` |
|
||||
| Cloud backends | Google Drive, OneDrive, Nextcloud, WebDAV implementations | `backend/storage/*_backend.py` |
|
||||
| AIProvider ABC | Interface: classify, suggest_topics, health_check | `backend/ai/base.py` |
|
||||
| AI factory | Returns provider instance from string slug | `backend/ai/__init__.py` |
|
||||
| Celery app | Task routing, beat schedule, JSON serialization | `backend/celery_app.py` |
|
||||
| Document task | extract_and_classify — async bridge from sync Celery worker | `backend/tasks/document_tasks.py` |
|
||||
| ORM models | 11-table schema, all UUID PKs, full index set | `backend/db/models.py` |
|
||||
| DB session | Async engine, session factory (expire_on_commit=False) | `backend/db/session.py` |
|
||||
| FastAPI deps | get_db, get_current_user, get_current_admin, get_regular_user | `backend/deps/` |
|
||||
| Auth store | accessToken (memory only), user, quota, refresh deduplication | `frontend/src/stores/auth.js` |
|
||||
| Documents store | CRUD, 3-step MinIO upload with progress, search debounce | `frontend/src/stores/documents.js` |
|
||||
| Folders store | CRUD folders, breadcrumb, rootFolders for sidebar | `frontend/src/stores/folders.js` |
|
||||
| Topics store | CRUD topics | `frontend/src/stores/topics.js` |
|
||||
| CloudConnections store | List/disconnect cloud connections | `frontend/src/stores/cloudConnections.js` |
|
||||
| API client | fetch wrapper, Bearer injection, 401→refresh→retry | `frontend/src/api/client.js` |
|
||||
| Vue Router | SPA routes, beforeEach guard (silent refresh on reload) | `frontend/src/router/index.js` |
|
||||
| FileManagerView | Unified file manager for local folders and documents | `frontend/src/views/FileManagerView.vue` |
|
||||
| StorageBrowser | Reusable file listing component (local + cloud modes) | `frontend/src/components/storage/StorageBrowser.vue` |
|
||||
|
||||
1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
|
||||
2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
|
||||
3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
|
||||
4. If `auto_classify=true`, `classifier.classify_document()` is called:
|
||||
a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
|
||||
b. Passes document text + existing topics to `provider.classify()`
|
||||
c. Any suggested new topics are created via `storage.add_topic()`
|
||||
d. Document metadata is updated with assigned topics
|
||||
5. Full document metadata JSON is returned to the frontend
|
||||
## Pattern Overview
|
||||
|
||||
**Overall:** Layered REST API + SPA with async background processing
|
||||
|
||||
**Key Characteristics:**
|
||||
- API layer is thin — validation via Pydantic, business logic in `services/`
|
||||
- No ORM relationships loaded — explicit queries only (prevents N+1)
|
||||
- Async everywhere in FastAPI; Celery workers bridge to async via `asyncio.run()`
|
||||
- Frontend Pinia stores own data-fetching; views delegate to stores; components emit events upward
|
||||
- One DB session per request (yielded by `get_db` dep), one per Celery task invocation
|
||||
- All resource ownership checked inline in handlers (`resource.user_id == current_user.id`)
|
||||
|
||||
## Layers
|
||||
|
||||
**API Layer:**
|
||||
- Purpose: HTTP routing, request validation, response serialization
|
||||
- Location: `backend/api/`
|
||||
- Contains: APIRouter instances, Pydantic request/response models, FastAPI dep injection
|
||||
- Depends on: `services/`, `deps/`, `db/models.py`
|
||||
- Used by: Frontend via HTTP; not called from other backend modules
|
||||
|
||||
**Service Layer:**
|
||||
- Purpose: Business logic with no FastAPI coupling (pure Python async functions)
|
||||
- Location: `backend/services/`
|
||||
- Contains: `auth.py`, `audit.py`, `classifier.py`, `extractor.py`, `storage.py`, `cloud_cache.py`, `email.py`
|
||||
- Depends on: `db/models.py`, `storage/`, `ai/`, `config`
|
||||
- Used by: `api/` layer and Celery tasks
|
||||
|
||||
**Storage Abstraction Layer:**
|
||||
- Purpose: Backend-agnostic object storage interface
|
||||
- Location: `backend/storage/`
|
||||
- Contains: `base.py` (ABC), `minio_backend.py`, `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`, `cloud_utils.py` (HKDF encryption), `exceptions.py`
|
||||
- Depends on: `config`, `db/models.py` (for cloud credential lookup)
|
||||
- Used by: `services/storage.py`, `api/documents.py`, Celery tasks
|
||||
|
||||
**AI Abstraction Layer:**
|
||||
- Purpose: Pluggable AI provider interface for document classification
|
||||
- Location: `backend/ai/`
|
||||
- Contains: `base.py` (ABC), `ollama_provider.py`, `openai_provider.py`, `anthropic_provider.py`, `lmstudio_provider.py`, `utils.py`
|
||||
- Depends on: External AI APIs via httpx
|
||||
- Used by: `services/classifier.py`
|
||||
|
||||
**Dependency Layer:**
|
||||
- Purpose: FastAPI reusable dependencies (DI)
|
||||
- Location: `backend/deps/`
|
||||
- Contains: `db.py` (get_db), `auth.py` (get_current_user, get_current_admin, get_regular_user), `utils.py` (get_client_ip)
|
||||
- Used by: All `api/` handlers
|
||||
|
||||
**Frontend Store Layer:**
|
||||
- Purpose: Application state + async API calls
|
||||
- Location: `frontend/src/stores/`
|
||||
- Contains: `auth.js`, `documents.js`, `folders.js`, `topics.js`, `cloudConnections.js`
|
||||
- Depends on: `api/client.js`
|
||||
- Used by: Views and components
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Document Upload (MinIO presigned URL path)
|
||||
|
||||
1. User drops file in `DropZone` → `StorageBrowser` emits `upload` → `FileManagerView.onFilesSelected` (`frontend/src/views/FileManagerView.vue`)
|
||||
2. `documentsStore.upload(file, autoClassify, folderId)` (`frontend/src/stores/documents.js`)
|
||||
3. `POST /api/documents/upload-url` → creates pending `Document` row, returns presigned PUT URL + `document_id` (`backend/api/documents.py`)
|
||||
4. XHR `PUT` bytes directly from browser to MinIO presigned URL (no backend proxy, no auth header needed — URL is self-authenticating)
|
||||
5. `POST /api/documents/{id}/confirm` → `stat_object()` for authoritative size → atomic quota `UPDATE … RETURNING` → status set to `'ready'` (`backend/api/documents.py`)
|
||||
6. If `folderId != null`: `PATCH /api/documents/{id}/folder` → places document in folder
|
||||
7. Celery task `extract_and_classify.delay(document_id)` enqueued → text extraction → AI classification → topic assignment (`backend/tasks/document_tasks.py`)
|
||||
8. `authStore.fetchQuota()` called on frontend to refresh sidebar quota bar
|
||||
|
||||
### Authentication Flow
|
||||
|
||||
1. `POST /api/auth/login` with `{email, password}` — per-account Redis rate limit checked first (`backend/api/auth.py`)
|
||||
2. Password verified with Argon2 (constant-time via pwdlib)
|
||||
3. If TOTP enabled and no code provided → returns `{requires_totp: true}` challenge
|
||||
4. If TOTP code provided → verified against pyotp + Redis replay prevention window
|
||||
5. On success: `create_access_token()` (HS256 JWT, 15-min TTL) + `create_refresh_token()` (SHA-256 hashed, stored in DB) (`backend/services/auth.py`)
|
||||
6. Access token returned in JSON body; refresh token set as `httpOnly; Secure; SameSite=Strict` cookie scoped to `/api/auth/refresh` path only
|
||||
7. Frontend stores access token in `authStore.accessToken` (Pinia `ref()` — memory only, never localStorage)
|
||||
8. On page reload: router `beforeEach` guard calls `authStore.refresh()` → `POST /api/auth/refresh` sends httpOnly cookie → new access token returned
|
||||
9. `api/client.js` intercepts any 401 → calls `authStore.refresh()` → retries request once (`frontend/src/api/client.js`)
|
||||
|
||||
### Refresh Token Rotation + Family Revocation
|
||||
|
||||
1. `POST /api/auth/refresh` reads httpOnly cookie, looks up `RefreshToken` row by SHA-256 hash
|
||||
2. If token already revoked → all user's refresh tokens revoked → 401 + security alert email enqueued via Celery
|
||||
3. If valid: old token marked `revoked=True`, new raw token generated and stored (hashed), rotated cookie set
|
||||
|
||||
### Cloud Storage OAuth Flow
|
||||
|
||||
1. `GET /api/cloud/oauth/initiate/{provider}` → state token stored in Redis (TTL 1800s, single-use) → authorization URL returned
|
||||
2. Browser navigates to OAuth provider → callback to `GET /api/cloud/oauth/callback/{provider}`
|
||||
3. State token validated (single-use consumed from Redis), authorization code exchanged for credentials
|
||||
4. Credentials encrypted with HKDF-derived per-user Fernet key → stored in `cloud_connections.credentials_enc`
|
||||
5. On document operations: `get_storage_backend_for_document()` decrypts credentials, instantiates cloud backend — transparent to API handlers (`backend/storage/__init__.py`)
|
||||
|
||||
**State Management (frontend):**
|
||||
- Access token: `authStore.accessToken` — Pinia `ref(null)`, JS memory only, cleared on logout/error
|
||||
- User profile: `authStore.user` — Pinia `ref(null)`
|
||||
- Quota: `authStore.quota` — fetched after upload/delete, displayed in `QuotaBar`
|
||||
- Documents: `documentsStore.documents` — local array, kept in sync via explicit `fetchDocuments()` calls
|
||||
- Folder tree: `foldersStore.rootFolders` (sidebar) + `foldersStore.folders` (current level)
|
||||
- Upload progress: `documentsStore.uploadProgress` — keyed `${filename}__${Date.now()}` to prevent key collision
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
**StorageBackend ABC (`backend/storage/base.py`):**
|
||||
- Purpose: Uniform interface over MinIO and all cloud providers
|
||||
- Methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`, `generate_presigned_put_url`, `stat_object`
|
||||
- Implementations: `MinIOBackend`, `GoogleDriveBackend`, `OneDriveBackend`, `NextcloudBackend`, `WebDAVBackend`
|
||||
- Selected by: `get_storage_backend_for_document()` in `backend/storage/__init__.py`
|
||||
|
||||
**AIProvider ABC (`backend/ai/base.py`):**
|
||||
- Purpose: Pluggable classification backend
|
||||
- Methods: `classify`, `suggest_topics`, `health_check`
|
||||
- Returns: `ClassificationResult(topics, suggested_new_topics, reasoning)`
|
||||
- Implementations: `OllamaProvider`, `OpenAIProvider`, `AnthropicProvider`, `LMStudioProvider`
|
||||
- Selected by: `ai/__init__.py` factory, keyed to per-user `ai_provider`/`ai_model` from DB
|
||||
|
||||
**Dependency Chain:**
|
||||
- `get_current_user` → parses Bearer JWT → loads `User` from DB, checks `is_active`
|
||||
- `get_current_admin` → wraps `get_current_user` + `role == 'admin'` check (raises 403)
|
||||
- `get_regular_user` → wraps `get_current_user` + rejects `role == 'admin'` (admins get 403 on document endpoints)
|
||||
|
||||
## Entry Points
|
||||
|
||||
**Backend:**
|
||||
- Location: `backend/main.py`
|
||||
- Triggers: `uvicorn main:app`
|
||||
- Responsibilities: FastAPI app factory, lifespan (MinIO bucket init, Redis connection, admin bootstrap), middleware registration in correct order, router inclusion
|
||||
|
||||
**Celery Worker:**
|
||||
- Location: `backend/celery_app.py` (factory) + `backend/tasks/`
|
||||
- Triggers: `celery -A celery_app worker -Q documents`
|
||||
- Responsibilities: Async document text extraction + classification, email delivery, scheduled nightly audit CSV export
|
||||
|
||||
**Frontend:**
|
||||
- Location: `frontend/src/main.js`
|
||||
- Triggers: Vite dev server (`npm run dev`) or built static files served by frontend container
|
||||
- Responsibilities: Mount Vue app with Pinia and Router
|
||||
|
||||
## Architectural Constraints
|
||||
|
||||
- **Threading:** FastAPI runs on a single-threaded asyncio event loop (uvicorn). Blocking MinIO SDK calls use `asyncio.to_thread()`. Celery workers are separate sync processes that bridge to async via `asyncio.run()` — they never share an event loop with FastAPI.
|
||||
- **Global state:** `backend/services/storage.py` holds a module-level `_storage` singleton for the default MinIO backend. `backend/main.py` stores MinIO client on `app.state.minio` and Redis client on `app.state.redis`.
|
||||
- **Circular imports:** Celery task modules must never import from `main.py` or router modules. `backend/celery_app.py` intentionally avoids importing `config` — reads `REDIS_URL` directly from `os.environ` to avoid pydantic-settings side effects.
|
||||
- **Admin isolation:** Admin accounts cannot access document content — enforced by `get_regular_user` dep on all document/folder/share endpoints. No impersonation code path exists (`backend/deps/auth.py`).
|
||||
- **Quota atomicity:** Quota enforcement uses a single atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — no read-then-write in Python.
|
||||
- **Object key privacy:** MinIO keys are `{user_id}/{document_id}/{uuid4()}{ext}` — original filenames stored only in the DB `filename` column, never in the storage key.
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Accessing document content via unauthenticated iframe src
|
||||
|
||||
**What happens:** Setting `<iframe src="/api/documents/{id}/content">` directly would bypass Bearer token auth in browsers that do not send cookies cross-origin.
|
||||
**Why it's wrong:** The document content endpoint requires `Authorization: Bearer` header; browser `src=` attributes do not send custom headers.
|
||||
**Do this instead:** Use `fetchDocumentContent(docId)` in `frontend/src/api/client.js` — it injects Bearer + handles 401-refresh-retry, then builds an object URL from the Blob response.
|
||||
|
||||
### Committing inside `write_audit_log`
|
||||
|
||||
**What happens:** Calling `session.commit()` inside `write_audit_log` creates a separate transaction for the audit entry.
|
||||
**Why it's wrong:** The audit entry would commit even if the primary operation subsequently fails, creating phantom audit records.
|
||||
**Do this instead:** `write_audit_log` calls `session.flush()` only. The caller owns `session.commit()` — `backend/services/audit.py`.
|
||||
|
||||
### CloudConnection query without user scope
|
||||
|
||||
**What happens:** Querying `CloudConnection` without filtering `user_id == current_user.id` would allow one user's cloud credentials to service another user's request.
|
||||
**Why it's wrong:** IDOR — cross-user credential access.
|
||||
**Do this instead:** Always filter `CloudConnection.user_id == user.id` as enforced in `get_storage_backend_for_document()` in `backend/storage/__init__.py`.
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Strategy:** Services raise `ValueError`; API handlers catch and re-raise as `HTTPException`. No service module imports FastAPI.
|
||||
|
||||
**Patterns:**
|
||||
- Auth service raises `ValueError` → API layer maps to 401/422/400
|
||||
- Storage errors (`S3Error`, cloud provider errors) wrapped in `backend/storage/exceptions.py` → 503 or 404
|
||||
- `write_audit_log` never raises — silently logs and swallows to protect primary operations
|
||||
- `CloudConnectionError` (`backend/storage/exceptions.py`) used for cloud-specific failures
|
||||
|
||||
## Cross-Cutting Concerns
|
||||
|
||||
**Logging:** Python `logging` module with `logger = logging.getLogger(__name__)` in each module. No structured logging framework.
|
||||
|
||||
**Validation:** Pydantic models at API boundary. Field validators on sensitive fields (filename rejects path separators, permission allowlists, non-negative quota). No model accepts `**kwargs`.
|
||||
|
||||
**Authentication:** Every non-public endpoint injects `get_current_user`, `get_current_admin`, or `get_regular_user` via FastAPI `Depends`. No endpoint bypasses the dependency chain.
|
||||
|
||||
**Rate Limiting:** slowapi (wraps limits-library) on all auth endpoints. Per-IP limits via `@limiter.limit("10/minute")`. Per-account Redis counter on login: `login_attempts:{email}`, 10 attempts per 15-minute window.
|
||||
|
||||
**Audit Logging:** `write_audit_log()` called inline in API handlers for all auth events, document operations, admin actions, and cloud connections. Written within the handler's transaction via `session.flush()`.
|
||||
|
||||
**HKDF Credential Encryption:** Cloud credentials encrypted with `Fernet(HKDF-SHA256(master_key, salt=user_id, purpose="cloud-creds"))` before DB storage. Implementation in `backend/storage/cloud_utils.py`.
|
||||
|
||||
---
|
||||
|
||||
## AI Provider Abstraction
|
||||
|
||||
- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
|
||||
- `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
|
||||
- `suggest_topics(document_text, system_prompt) → list[str]`
|
||||
- `health_check() → bool`
|
||||
- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
|
||||
- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
|
||||
- Provider is re-instantiated on every request (stateless; no connection pooling)
|
||||
|
||||
---
|
||||
|
||||
## Data Persistence
|
||||
|
||||
All state is stored on the local filesystem — no database:
|
||||
|
||||
| Store | Path | Format | Access |
|
||||
|---|---|---|---|
|
||||
| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
|
||||
| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
|
||||
| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
|
||||
| Settings | `data/settings.json` | JSON object | `filelock` protected |
|
||||
|
||||
`filelock` is used to prevent concurrent write corruption on JSON files.
|
||||
|
||||
---
|
||||
|
||||
## Frontend Architecture
|
||||
|
||||
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
|
||||
- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
|
||||
- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
|
||||
- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
|
||||
|
||||
---
|
||||
|
||||
## Key Patterns
|
||||
|
||||
- **Provider Pattern** — AI backends are interchangeable at runtime via settings
|
||||
- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
|
||||
- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
|
||||
|
||||
---
|
||||
|
||||
## Constraints & Notable Decisions
|
||||
|
||||
- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
|
||||
- **Auth dependency chain (Phase 2+):** `get_current_user` (validates JWT, returns User) → `get_current_admin` (requires role=admin) / `get_regular_user` (requires role!=admin, 403 for admin accounts on document endpoints). `get_regular_user` enforces SEC-04: admin accounts cannot read document content (CLAUDE.md).
|
||||
- **Ownership assertion pattern (Phase 3+):** Every `/api/documents/*` handler asserts `doc.user_id == current_user.id` before returning — raises 404 (not 403) to prevent information leakage (D-16, T-03-11). Cross-user access and non-existence are indistinguishable.
|
||||
- **Topic namespace model (Phase 3+):** `user_id=NULL` = system topic (visible to all); `user_id=<uuid>` = per-user topic. `load_topics_for_user(session, user_id)` returns union via `or_(Topic.user_id == user_id, Topic.user_id.is_(None))`. Admin creates system topics via `POST /api/admin/topics`.
|
||||
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
|
||||
- AI provider re-instantiated per request (no connection reuse)
|
||||
- Data directory is volume-mounted in Docker; no backup or migration strategy
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No API versioning strategy visible
|
||||
- Frontend has no error boundary or global error handling component
|
||||
- No pagination on document list endpoint (could be a scaling concern)
|
||||
*Architecture analysis: 2026-06-02*
|
||||
|
||||
+397
-69
@@ -1,87 +1,415 @@
|
||||
# CONCERNS — document-scanner
|
||||
# Codebase Concerns
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment.
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
---
|
||||
|
||||
## Concerns by Severity
|
||||
## Security Concerns
|
||||
|
||||
### HIGH
|
||||
### JWT Algorithm Downgrade: HS256 Instead of ES256
|
||||
|
||||
**1. File type validation is defined but never enforced**
|
||||
`ALLOWED_MIME_TYPES` is defined in `backend/api/documents.py` but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives.
|
||||
|
||||
**2. No file size limit on uploads**
|
||||
The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No `MAX_UPLOAD_SIZE` check exists at the HTTP boundary.
|
||||
|
||||
**3. API keys stored in plain-text JSON**
|
||||
`backend/data/settings.json` stores API keys in plaintext. The volume mount in `docker-compose.yml` (`./backend/data:/app/data`) means any process with Docker access can read them. Masking only applies to API responses, not to disk.
|
||||
|
||||
**4. CORS fully open**
|
||||
`allow_origins=["*"]` in `main.py` means any website can make cross-origin requests to the API, including with credentials if ever added.
|
||||
|
||||
**5. Docker Compose mounts entire backend source as writable volume**
|
||||
`./backend:/app` gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files.
|
||||
- **Risk:** CLAUDE.md specifies ES256 (asymmetric ECDSA P-256) as the required algorithm, but the implementation uses HS256 (symmetric HMAC-SHA256).
|
||||
- **Files:** `backend/services/auth.py` lines 99, 109, 132, 141
|
||||
- **Impact:** A leaked `SECRET_KEY` allows arbitrary token forgery. With HS256 any party that has the secret can forge access tokens, impersonate admin users, and bypass all auth checks. ES256 would require the private key for forgery while the public key could safely be distributed for verification.
|
||||
- **Fix approach:** Generate an ECDSA P-256 key pair, store the private key in an env var (`JWT_PRIVATE_KEY`), store the public key as `JWT_PUBLIC_KEY`. Update `create_access_token` to use `algorithm="ES256"` and `decode_access_token` / `decode_password_reset_token` to use the public key. Rotate all active refresh tokens after deploy.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### MEDIUM
|
||||
### No JTI Claim and No JTI Revocation in Redis
|
||||
|
||||
**6. Blocking I/O in async FastAPI handlers**
|
||||
`storage.py` uses synchronous file reads/writes and `filelock` blocking calls inside `async def` endpoints. This blocks the uvicorn event loop during every request. Should use `asyncio.to_thread()` or `aiofiles` (which is already in requirements but unused).
|
||||
|
||||
**7. Topic rename does not cascade to documents**
|
||||
Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations.
|
||||
|
||||
**8. `list_metadata` loads all documents before filtering**
|
||||
`storage.list_metadata()` reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows.
|
||||
|
||||
**9. `topic_doc_counts()` scans all metadata on every topic request**
|
||||
Every `GET /api/topics` call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly.
|
||||
|
||||
**10. `MAX_AI_CHARS` duplicated across 3 files**
|
||||
The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since `extractor.py` already truncates to `MAX_STORED_CHARS` (50,000).
|
||||
|
||||
**11. `_parse_classification` / `_parse_suggestions` duplicated between providers**
|
||||
`anthropic_provider.py` and `openai_provider.py` each define their own JSON parsing helpers for AI responses. `test_classifier.py` only imports from `openai_provider`, meaning the Anthropic variants are untested.
|
||||
|
||||
**12. `health_check()` makes real billed API calls**
|
||||
The "Test Connection" UI action calls `provider.health_check()`, which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status).
|
||||
- **Risk:** CLAUDE.md mandates JTI (JWT ID) in every access token stored in Redis for revocation, but the `create_access_token` function emits no `jti` claim and there is no check in `get_current_user`.
|
||||
- **Files:** `backend/services/auth.py` (create_access_token), `backend/deps/auth.py` (get_current_user)
|
||||
- **Impact:** Deactivated users can continue using valid access tokens until TTL expiry (up to 15 minutes). Password changes and account deactivations do not immediately invalidate active sessions (only refresh tokens are revoked — not the live access token in the client's Pinia store).
|
||||
- **Fix approach:** Add `jti=str(uuid.uuid4())` to the access token payload. In `get_current_user`, after successful decode, check `await redis.get(f"jti_revoked:{jti}")` and raise 401 if set. Add a `revoke_access_token(jti, ttl)` helper called from account deactivation and password change.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### LOW
|
||||
### No Token Fingerprint / Token Binding
|
||||
|
||||
**13. `uvicorn --reload` hardcoded in docker-compose.yml**
|
||||
Hot-reload is hardcoded in the production compose file. There is no separate `docker-compose.prod.yml` or build-arg to disable it.
|
||||
|
||||
**14. Unused `shutil` import in `storage.py`**
|
||||
`import shutil` appears in `storage.py` but is never used.
|
||||
|
||||
**15. Topic IDs are 8-character UUID prefixes**
|
||||
`str(uuid.uuid4())[:8]` generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers.
|
||||
|
||||
**16. `classify_document` request body uses raw `dict`, not a Pydantic model**
|
||||
The reclassify endpoint accepts an unvalidated `dict` body. Invalid input causes an unformatted 500 rather than a clean 422 validation error.
|
||||
|
||||
**17. No global frontend error handling**
|
||||
There is no Vue error boundary or global `window.onerror` / `app.config.errorHandler`. Failed API calls in stores may surface as silent failures or unhandled promise rejections.
|
||||
|
||||
**18. No document download endpoint**
|
||||
Uploaded files are stored in `data/uploads/` but there is no `GET /api/documents/:id/file` endpoint to retrieve the original binary. Files are effectively write-only through the UI.
|
||||
|
||||
**19. `aiofiles` in requirements but never used**
|
||||
`aiofiles>=23.2` is listed in `requirements.txt` but no code imports it. The blocking I/O concern (item 6) should use it.
|
||||
- **Risk:** CLAUDE.md requires a `fgp` (fingerprint) claim = HMAC of `User-Agent + Accept-Language`, validated on every request. This is absent.
|
||||
- **Files:** `backend/services/auth.py`, `backend/deps/auth.py`
|
||||
- **Impact:** Stolen access tokens can be replayed from any device/browser. Token binding would limit the window of a stolen token attack.
|
||||
- **Fix approach:** On login, compute `fgp = hmac.new(key, (user_agent + accept_lang).encode(), sha256).hexdigest()[:16]`. Embed in JWT payload. In `get_current_user`, recompute and compare with `hmac.compare_digest`.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
### Password Change Does Not Revoke Active Sessions
|
||||
|
||||
- Production deployment path is undefined (no nginx, no TLS, no auth)
|
||||
- OCR language support for pytesseract is not configured (defaults to English only)
|
||||
- `suggest_topics` method on all providers is untested — unclear if it is used in the current UI flow
|
||||
- No backup or recovery strategy for `data/` volume
|
||||
- **Risk:** `POST /api/auth/change-password` updates `password_hash` and writes an audit log but never calls `revoke_all_refresh_tokens`. CLAUDE.md mandates "Password change… immediately revoke all active sessions."
|
||||
- **Files:** `backend/api/auth.py` lines 446–495
|
||||
- **Impact:** An attacker who has a valid refresh cookie can continue rotating tokens even after the account owner changes their password.
|
||||
- **Fix approach:** Add `await auth_service.revoke_all_refresh_tokens(session, current_user.id)` after the password hash update, before `session.commit()`, and also invalidate all JTIs for that user in Redis (once JTI is implemented).
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### TOTP Disable Does Not Revoke Active Sessions
|
||||
|
||||
- **Risk:** `DELETE /api/auth/totp` clears the TOTP secret and disables TOTP but does not call `revoke_all_refresh_tokens`. CLAUDE.md mandates revocation on "TOTP enroll/revoke."
|
||||
- **Files:** `backend/api/auth.py` lines 587–616
|
||||
- **Impact:** An attacker who triggered TOTP removal (via CSRF or compromised session) and has a refresh token continues to operate as an authenticated user with no second factor.
|
||||
- **Fix approach:** Add `await auth_service.revoke_all_refresh_tokens(session, current_user.id)` in `disable_totp` before `session.commit()`.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### Health Endpoint Exposes Internal Error Details Without Auth
|
||||
|
||||
- **Risk:** `GET /health` returns full Python exception class names and messages (e.g. `"error: OperationalError: (psycopg.OperationalError) …"`) with no authentication requirement. The comment at line 144 (T-01-05-03) acknowledges this but defers the fix to "Phase 2."
|
||||
- **Files:** `backend/main.py` lines 136–167
|
||||
- **Impact:** Exposes DB driver versions, hostnames, and connection string fragments to unauthenticated callers. Information useful for targeted attacks.
|
||||
- **Fix approach:** Replace `f"error: {type(e).__name__}: {e}"` with `"error"` in non-debug mode. Log the detail server-side only. Optionally require admin Bearer token for the detailed form.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Default Secrets Shipped in Code
|
||||
|
||||
- **Risk:** `backend/config.py` hardcodes `secret_key = "CHANGEME"`, `cloud_creds_key = "CHANGEME-32-bytes-padded!!"`, and `minio_secret_key = "changeme_minio_app"` as Pydantic field defaults.
|
||||
- **Files:** `backend/config.py` lines 31, 61, 21
|
||||
- **Impact:** If deployed without overriding env vars, production tokens are signed with the known `CHANGEME` key, all cloud credentials can be decrypted by anyone with the source code, and MinIO uses a known password. Critical misconfiguration vector.
|
||||
- **Fix approach:** Change defaults to `""` and add a startup validator (`@model_validator(mode="after")`) that raises `ValueError` when these fields equal their placeholder values in production (`DEBUG=false`). Log a WARNING in dev if the default is detected.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### `default_storage_backend` Not Validated Against Allowlist
|
||||
|
||||
- **Risk:** `PATCH /api/users/me/default-storage` accepts `body.backend` as a free string and writes it directly to the DB with no allowlist validation.
|
||||
- **Files:** `backend/api/cloud.py` lines 927–946
|
||||
- **Impact:** A user can set `default_storage_backend` to any arbitrary string. A future code path using it as a routing key could allow bypassing the `_CLOUD_PROVIDERS` allowlist.
|
||||
- **Fix approach:** Validate `body.backend in {"minio", "google_drive", "onedrive", "nextcloud", "webdav"}` before the DB write. Use a `Literal` type or `@field_validator` on `DefaultStorageRequest`.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### `X-Forwarded-For` Trusted for IP Rate Limiting Without Proxy Enforcement
|
||||
|
||||
- **Risk:** The IP-level rate limiter (`slowapi`) uses `get_remote_address` which reads `X-Forwarded-For`. Without a trusted reverse proxy normalizing this header, an attacker can bypass the IP rate limit.
|
||||
- **Files:** `backend/api/auth.py` line 44; `backend/deps/utils.py`
|
||||
- **Impact:** Attackers can bypass the 10 req/min IP-level limit on login, register, and TOTP endpoints by spoofing the forwarded IP on each request.
|
||||
- **Fix approach:** In Docker Compose, front the backend with nginx configured to set `X-Forwarded-For` from `$remote_addr`, stripping any client-supplied value. Document this as a mandatory production requirement.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### Email HTML Body Uses Unsanitized Server-Supplied Link
|
||||
|
||||
- **Risk:** `send_password_reset_email` builds an HTML body via f-string with `reset_link` directly in an `<a href='…'>` attribute without HTML-escaping.
|
||||
- **Files:** `backend/services/email.py` line 47; line ~105 (security alert email)
|
||||
- **Impact:** If `reset_link` contains a single-quote (possible under certain URL encoding), the HTML attribute breaks. Low-severity HTML injection risk that violates defense-in-depth.
|
||||
- **Fix approach:** Use `html.escape(reset_link, quote=True)` when embedding the link in the HTML body.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
### Audit Log Written After Commit in `delete_folder`
|
||||
|
||||
- **Risk:** In `DELETE /api/folders/{folder_id}`, `session.commit()` is called at line 424 and `write_audit_log()` is called at line 426 — after the commit, in a separate implicit transaction.
|
||||
- **Files:** `backend/api/folders.py` lines 424–435
|
||||
- **Impact:** If the audit log write fails (DB error, constraint violation), the folder is already deleted with no audit record. Inconsistent with the WR-08 pattern used by `delete_document` (`auto_commit=False`).
|
||||
- **Fix approach:** Move `write_audit_log()` before `session.commit()`, following the pattern used in `api/documents.py::delete_document`.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### OAuth Callback Error Redirect May Leak Internal Exception Details
|
||||
|
||||
- **Risk:** In `oauth_callback`, the `except Exception as exc` block redirects to `frontend_url/settings?cloud_error={urllib.parse.quote(str(exc))}`. Exception strings from google-auth-oauthlib or msal may include OAuth client secrets, state values, or internal URL fragments.
|
||||
- **Files:** `backend/api/cloud.py` lines 541–546
|
||||
- **Impact:** Exception details appear in the browser URL bar, referrer headers, browser history, and server access logs.
|
||||
- **Fix approach:** Map exception types to user-safe generic messages (`"auth_failed"`, `"connection_error"`). Log the real exception server-side at ERROR level. Only pass an opaque error code in the redirect.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
## Performance Concerns
|
||||
|
||||
### N+1 Query Pattern in `list_metadata` / `list_documents`
|
||||
|
||||
- **Risk:** `services/storage.py::list_metadata` loads all documents then calls `_load_topic_names(session, doc.id)` in a Python loop — one DB round-trip per document. The same pattern repeats in the `list_documents` handler's non-legacy code path.
|
||||
- **Files:** `backend/services/storage.py` lines 136–139; `backend/api/documents.py` lines 501–506
|
||||
- **Impact:** For a user with 100 documents, a single list request issues 101 DB queries. At 1000 documents, 1001 queries. Response time degrades linearly as the library grows.
|
||||
- **Fix approach:** Replace with a single JOIN query using PostgreSQL's `array_agg(t.name)` grouped by document. Or use a subquery fetching all document-topic associations for the user in one query and merging in Python.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### Entire File Loaded into Memory for Download and Task Processing
|
||||
|
||||
- **Risk:** `GET /api/documents/{id}/content` calls `await storage_backend.get_object(…)` which returns full `bytes`, loads them into a list, and returns `StreamingResponse(iter([file_bytes]))`. The Celery extraction task also buffers the full file.
|
||||
- **Files:** `backend/api/documents.py` lines 792, 827–831; `backend/tasks/document_tasks.py` line 74
|
||||
- **Impact:** A 100 MB file consumes 100 MB of heap per concurrent request. With 10 simultaneous downloads, the worker needs 1 GB just for file buffers. The 100 MB quota mitigates this today but does not scale.
|
||||
- **Fix approach:** For MinIO, return presigned GET URLs with short TTL instead of proxying through FastAPI. For cloud backends, pipe the provider HTTP response stream directly. For Celery extraction, stream text extraction from bytes in chunks.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### No Upload Size Pre-Validation
|
||||
|
||||
- **Risk:** `POST /api/documents/upload` (cloud path) reads the entire file via `await file.read()` before any quota or size check. FastAPI has no global `max_upload_size` configured.
|
||||
- **Files:** `backend/api/documents.py` line 207
|
||||
- **Impact:** A malicious user can upload a multi-gigabyte file, exhausting FastAPI worker memory before the quota check fires.
|
||||
- **Fix approach:** Check `Content-Length` header at endpoint entry; reject with 413 if above a configurable `MAX_UPLOAD_BYTES` limit. Add a `--limit-max-requests` or body-size middleware at the uvicorn/nginx level.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### `revoke_all_refresh_tokens` Issues One UPDATE per Token
|
||||
|
||||
- **Risk:** `services/auth.py::revoke_all_refresh_tokens` loads all active refresh token rows into Python, then marks each `revoked=True` individually via ORM, issuing one UPDATE statement per token.
|
||||
- **Files:** `backend/services/auth.py` lines 218–237
|
||||
- **Impact:** A user with many active sessions (e.g. 50 devices) causes 50 individual UPDATE statements on sign-out-all. Could be replaced with a single bulk UPDATE.
|
||||
- **Fix approach:** Replace with `UPDATE refresh_tokens SET revoked = true WHERE user_id = :uid AND revoked = false` and count affected rows via `result.rowcount`.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
### FTS Falls Back Silently on Any Exception
|
||||
|
||||
- **Risk:** The FTS code path in `list_documents` wraps the FTS query in `except Exception:` and falls back to an unfiltered query.
|
||||
- **Files:** `backend/api/documents.py` lines 486–489
|
||||
- **Impact:** Any PostgreSQL error causes silent fallback — the user sees all their documents when they searched for a term, with no indication of failure.
|
||||
- **Fix approach:** Narrow the catch to `sqlalchemy.exc.OperationalError` (for SQLite compat in tests only) and log all other exceptions at ERROR level before re-raising.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
## Reliability Concerns
|
||||
|
||||
### Email Queue Worker Missing From Docker Compose
|
||||
|
||||
- **Risk:** Celery routes email tasks to the `email` queue but `docker-compose.yml` defines only one Celery worker consuming `-Q documents`. No worker processes the `email` queue.
|
||||
- **Files:** `backend/celery_app.py` line 36; `docker-compose.yml` line 96
|
||||
- **Impact:** Password reset emails, security alert emails (refresh token reuse detection), and backup code emails are silently enqueued but never delivered. Callers receive 202 but emails never arrive.
|
||||
- **Fix approach:** Add a `celery-worker-email` service in `docker-compose.yml` consuming `-Q email`, or update the existing worker command to `-Q documents,email`.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### `documents.updated_at` Not Auto-Updated on Row Changes
|
||||
|
||||
- **Risk:** `Document.updated_at` is declared with `server_default=func.now()` but no `onupdate` trigger. When `extracted_text`, `status`, or `filename` is changed, `updated_at` stays as the creation timestamp.
|
||||
- **Files:** `backend/db/models.py` lines 192–194
|
||||
- **Impact:** `classified_at` in `_doc_to_dict` is computed from `doc.updated_at` when `status == "classified"` — if `updated_at` is stale, the displayed timestamp is incorrect. Sort-by-date after reclassification is also wrong.
|
||||
- **Fix approach:** Add a PostgreSQL `BEFORE UPDATE` trigger that sets `updated_at = now()`, or add `onupdate=func.now()` to the mapped column (requires SQLAlchemy ORM event to fire at update time).
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Celery Task Result Backend Accumulates Without Expiry
|
||||
|
||||
- **Risk:** Celery is configured to use Redis as result backend with no `result_expires` setting. Task results accumulate in Redis indefinitely.
|
||||
- **Files:** `backend/celery_app.py` lines 23–24
|
||||
- **Impact:** Redis memory grows unboundedly over time, potentially causing OOM which would also break rate limiting and TOTP replay prevention.
|
||||
- **Fix approach:** Add `celery_app.conf.result_expires = 3600` or disable the result backend entirely since no code reads task results.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Breadcrumb Builder in `get_folder` Has No Depth Limit
|
||||
|
||||
- **Risk:** The breadcrumb builder in `GET /api/folders/{folder_id}` walks up the parent chain iteratively with a `visited` set but no maximum depth cap.
|
||||
- **Files:** `backend/api/folders.py` lines 234–247
|
||||
- **Impact:** With a deeply nested folder tree (e.g. 200 levels of nesting), the loop issues 200 sequential DB round-trips before terminating.
|
||||
- **Fix approach:** Add `if len(crumbs) >= 20: break` to cap at a reasonable depth.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
## Code Quality Concerns
|
||||
|
||||
### Duplicate Inline IP Extraction (Not Using `get_client_ip`)
|
||||
|
||||
- **Risk:** Several endpoints extract the client IP inline instead of using `deps/utils.py::get_client_ip()`.
|
||||
- **Files:** `backend/api/documents.py` lines 269–271, 376; `backend/api/cloud.py` lines 624, 753
|
||||
- **Impact:** If the trusted-proxy logic changes, all inline copies must be updated individually.
|
||||
- **Fix approach:** Replace all inline `request.headers.get("X-Forwarded-For") or request.client.host` with `get_client_ip(request)`.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
### `Document.status` Is an Unconstrained String Column
|
||||
|
||||
- **Risk:** `Document.status` is `String, nullable=False, default="pending"` with no DB-level CHECK constraint or Python enum. Values `"pending"`, `"uploaded"`, `"classified"`, `"classification_failed"` are used in code but not enforced.
|
||||
- **Files:** `backend/db/models.py` line 188
|
||||
- **Impact:** A typo in a task or direct DB write silently sets an invalid status, causing silent bugs in status-checking code (e.g. `classified_at` timestamp never shown).
|
||||
- **Fix approach:** Add a migration with `ALTER TABLE documents ADD CONSTRAINT ck_documents_status CHECK (status IN ('pending', 'uploaded', 'classified', 'classification_failed'))`.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
### Stale Wave-2 Comment in `documents.py` Module Docstring
|
||||
|
||||
- **Risk:** `backend/api/documents.py` lines 19–20 contain `"NOTE (Wave 2): No auth guards on any endpoint yet — Plan 03-03 adds get_current_user…"` — this is false; all handlers use `get_regular_user`.
|
||||
- **Files:** `backend/api/documents.py` lines 19–20
|
||||
- **Impact:** Misleads reviewers into thinking auth is not applied, potentially causing incorrect security assessments.
|
||||
- **Fix approach:** Remove or replace the stale NOTE comment.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
### `classify_document` Endpoint Uses Mutable Default and Unvalidated Dict Body
|
||||
|
||||
- **Risk:** `POST /api/documents/{doc_id}/classify` has `body: dict = {}` — mutable default argument antipattern and no Pydantic validation.
|
||||
- **Files:** `backend/api/documents.py` line 695
|
||||
- **Impact:** Static analysis confusion; unvalidated request body accepts arbitrary JSON keys.
|
||||
- **Fix approach:** Define `class ClassifyRequest(BaseModel): topics: Optional[list[str]] = None` and replace `body: dict = {}`.
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
## Missing Tests / Coverage Gaps
|
||||
|
||||
### No Tests for JWT Algorithm, JTI, or Token Binding
|
||||
|
||||
- **Risk:** No tests verify the JWT algorithm, JTI presence/validation, or token binding.
|
||||
- **Files:** `backend/tests/test_auth_deps.py`, `backend/tests/test_auth_api.py`
|
||||
- **Impact:** Algorithm or claim changes would not be caught. The security invariants are untested.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### Password Change Has No Session-Revocation Test
|
||||
|
||||
- **Risk:** No test verifies that changing a password invalidates existing refresh tokens.
|
||||
- **Files:** `backend/tests/test_auth_api.py`
|
||||
- **Fix approach:** Add test: register → login (obtain refresh cookie) → change password → assert old refresh cookie returns 401.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### No Frontend E2E Tests
|
||||
|
||||
- **Risk:** The frontend has Vitest unit tests for 3 stores and a small set of components, but no Playwright or Cypress E2E tests for critical user flows.
|
||||
- **Files:** `frontend/src/stores/__tests__/`, `frontend/src/views/__tests__/`
|
||||
- **Impact:** Breaking changes in API contract, router guards, or component interactions are not caught until manual testing. The upload flow, TOTP enrollment, and admin operations have no automated coverage.
|
||||
- **Fix approach:** Add Playwright E2E tests for: login → upload → view → share → recipient download.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### No Regression Test for `delete_folder` Audit Log Ordering
|
||||
|
||||
- **Risk:** The audit log after-commit ordering issue in `delete_folder` has no test to prevent regression after fixing.
|
||||
- **Files:** `backend/tests/test_folders.py`
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Quota Concurrency Tests Run Against SQLite, Not PostgreSQL
|
||||
|
||||
- **Risk:** Quota enforcement tests run against SQLite in the default test config. CLAUDE.md specifies "integration tests against real PostgreSQL (not SQLite for quota/UUID tests)."
|
||||
- **Files:** `backend/tests/test_quota.py`; `backend/tests/conftest.py`
|
||||
- **Impact:** A race condition in the atomic quota UPDATE would only be detectable with concurrent clients on real PostgreSQL.
|
||||
- **Fix approach:** Mark quota atomicity tests with `@pytest.mark.skipif(not live_services_available, ...)` and add a concurrent-upload test using `asyncio.gather`.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Pinia Stores `documents.js` and `topics.js` Have No Unit Tests
|
||||
|
||||
- **Risk:** The stores for documents and topics — which implement pagination, filtering, and topic assignment logic — have no tests in `frontend/src/stores/__tests__/`.
|
||||
- **Files:** `frontend/src/stores/documents.js`, `frontend/src/stores/topics.js`
|
||||
- **Priority:** LOW
|
||||
|
||||
---
|
||||
|
||||
## Dependency Risks
|
||||
|
||||
### All Backend Dependencies Use Floor `>=` Version Pins
|
||||
|
||||
- **Risk:** `backend/requirements.txt` uses `>=` for all packages including security-critical ones: `PyJWT>=2.8.0`, `pwdlib[argon2]>=0.2.1`, `cryptography>=41.0.0`, `fastapi>=0.111`.
|
||||
- **Files:** `backend/requirements.txt`
|
||||
- **Impact:** `pip install` resolves to the latest available version at build time. A breaking change or vulnerability in any dependency silently takes effect on the next Docker build. CLAUDE.md mandates exact version pinning for security-critical packages.
|
||||
- **Fix approach:** Run `pip freeze > requirements.lock` to generate an exact pinned lockfile. Use `pip-tools` or `uv lock` to manage upgrades. At minimum, pin `PyJWT`, `pwdlib`, `cryptography`, and `fastapi` to exact versions.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### `minio/minio:latest` Tag in Docker Compose
|
||||
|
||||
- **Risk:** `docker-compose.yml` uses `image: minio/minio:latest` — a floating tag that pulls a new release on `docker compose pull`.
|
||||
- **Files:** `docker-compose.yml` line 19
|
||||
- **Impact:** Breaking MinIO API changes or security regressions in a new release could break file storage without warning.
|
||||
- **Fix approach:** Pin to a specific MinIO release tag (e.g. `minio/minio:RELEASE.2024-11-07T00-52-20Z`).
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
## Infrastructure and Operational Concerns
|
||||
|
||||
### No Reverse Proxy / TLS Termination in Production Setup
|
||||
|
||||
- **Risk:** `docker-compose.yml` exposes the FastAPI backend on port 8000 and frontend on port 5173 directly, with no nginx or Caddy container for TLS termination or `X-Forwarded-For` normalization.
|
||||
- **Files:** `docker-compose.yml`
|
||||
- **Impact:** (1) The refresh cookie uses `secure=True` in code but travels over plain HTTP, making the `secure` flag ineffective. (2) IP rate limiting is spoofable. (3) Credentials and session cookies travel in cleartext.
|
||||
- **Fix approach:** Add an nginx service to `docker-compose.yml` that terminates TLS (Let's Encrypt or self-signed), proxies `/api/` to the backend, and sets `proxy_set_header X-Forwarded-For $remote_addr`.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### MinIO Uses Plain HTTP Between Containers
|
||||
|
||||
- **Risk:** `Minio(…, secure=False)` — all object data travels over HTTP between FastAPI and MinIO containers.
|
||||
- **Files:** `backend/main.py` line 82; `backend/storage/__init__.py` line 48
|
||||
- **Impact:** An attacker with access to the Docker network can intercept document bytes in transit. Critical if containers share a host with untrusted workloads.
|
||||
- **Fix approach:** Enable TLS on MinIO (`secure=True`) or document the trust model explicitly. For shared-host deployments, configure mTLS between containers.
|
||||
- **Priority:** MEDIUM (acceptable on isolated Docker bridge; critical on shared host)
|
||||
|
||||
---
|
||||
|
||||
### No Backup Strategy for PostgreSQL or MinIO Data
|
||||
|
||||
- **Risk:** `docker-compose.yml` uses named volumes (`postgres_data`, `minio_data`) with no backup tooling, retention policy, or point-in-time recovery.
|
||||
- **Files:** `docker-compose.yml` lines 138–140
|
||||
- **Impact:** A disk failure, container wipe, or accidental `docker volume rm` causes permanent loss of all user documents, credentials, audit logs, and accounts.
|
||||
- **Fix approach:** Add a `backup` service running `pg_dump` on a schedule (e.g. via `ofelia` or a cron sidecar), compressing and shipping to an off-site store. Configure MinIO `mc mirror` to a second bucket or provider. Document RTO/RPO targets.
|
||||
- **Priority:** HIGH
|
||||
|
||||
---
|
||||
|
||||
### Redis Has No Persistence Configuration
|
||||
|
||||
- **Risk:** Redis is started with only `--requirepass`. No `--save` or `--appendonly yes` flags are set, making all Redis data ephemeral.
|
||||
- **Files:** `docker-compose.yml` line 42
|
||||
- **Impact:** A Redis restart clears all rate-limit counters (brief brute-force window on auth endpoints), TOTP replay prevention keys (30-second replay window reopens), and pending OAuth state tokens.
|
||||
- **Fix approach:** Add `--save 60 1 --appendonly yes` to the Redis command and mount a Redis data volume. Document that Redis restart is a brief security event requiring monitoring.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Docker Compose Mounts Source Code as Live Volume
|
||||
|
||||
- **Risk:** `docker-compose.yml` mounts `./backend:/app` and `./frontend/src:/app/src` as live volumes (appropriate for dev hot-reload but dangerous in production if the same file is used).
|
||||
- **Files:** `docker-compose.yml` lines 53–54, 131–132
|
||||
- **Impact:** In production, host filesystem modifications immediately affect the running container without a deploy cycle.
|
||||
- **Fix approach:** Create a `docker-compose.prod.yml` that omits the volume mounts and uses the Dockerfile `COPY . .` layer only. Document the two-file strategy clearly.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### Dockerfile Runs Application as Root
|
||||
|
||||
- **Risk:** `backend/Dockerfile` uses `FROM python:3.12-slim` with no `USER` directive. FastAPI and Celery run as root inside the container.
|
||||
- **Files:** `backend/Dockerfile`
|
||||
- **Impact:** A container escape vulnerability or SSRF leading to RCE gives the attacker root-equivalent access to the container filesystem.
|
||||
- **Fix approach:** Add `RUN adduser --disabled-password --gecos "" appuser && chown -R appuser /app` and `USER appuser` before `EXPOSE 8000`.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
### No Structured Logging, Metrics, or Alerting
|
||||
|
||||
- **Risk:** All logging uses Python's stdlib logger with no structured format, no Prometheus/StatsD metrics endpoint, no error aggregation service, and no alerting on security events in the audit log.
|
||||
- **Files:** All backend files
|
||||
- **Impact:** Silent failures — email queue not processing, repeated TOTP replay attempts, brute-force login spikes — go undetected. Failed Celery tasks log to stderr with no aggregation. The security alert email on refresh token reuse is the only active notification mechanism.
|
||||
- **Fix approach:** Add `structlog` for JSON-formatted structured logs. Add a `/metrics` endpoint with `prometheus-fastapi-instrumentator`. Configure alerting on `auth.login_failed` count spikes in the audit log.
|
||||
- **Priority:** MEDIUM
|
||||
|
||||
---
|
||||
|
||||
*Concerns audit: 2026-06-02*
|
||||
|
||||
@@ -1,94 +1,216 @@
|
||||
# CONVENTIONS — document-scanner
|
||||
# Coding Conventions
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## Summary
|
||||
## Naming Patterns
|
||||
|
||||
The codebase follows standard Python and Vue 3 conventions without heavy tooling enforcement. Backend uses async/await throughout with type hints on public interfaces. Frontend uses Vue Options API with Pinia stores as the data layer. No linter or formatter configuration is committed.
|
||||
**Python files:**
|
||||
- `snake_case` throughout — `auth.py`, `cloud_utils.py`, `document_tasks.py`
|
||||
- Modules named for their responsibility, not their layer (e.g., `services/auth.py`, `services/audit.py`)
|
||||
|
||||
**Python functions:**
|
||||
- `snake_case` for all functions and methods: `hash_password`, `verify_password`, `create_access_token`, `write_audit_log`
|
||||
- Private helpers prefixed with underscore: `_set_refresh_cookie`, `_port_open`, `_set_doc_user_id`
|
||||
- Async functions use same convention — no `async_` prefix
|
||||
|
||||
**Python classes:**
|
||||
- `PascalCase` for ORM models and Pydantic models: `User`, `Document`, `RegisterRequest`, `DocumentPatch`
|
||||
- Request/response models end in `Request` or `Response`: `RegisterRequest`, `LoginRequest`, `ChangePasswordRequest`
|
||||
|
||||
**Python variables:**
|
||||
- `snake_case`: `user_id`, `access_token`, `used_bytes`, `credentials_enc`
|
||||
- Constants use `UPPER_SNAKE_CASE`: `_PASSWORD_DETAIL` (underscore prefix when module-private)
|
||||
- Module-level singletons prefixed underscore: `_pwd`, `_CLOUD_PROVIDERS`
|
||||
|
||||
**DB column naming:**
|
||||
- `snake_case` for all columns: `user_id`, `password_hash`, `is_active`, `created_at`
|
||||
- Exception: ORM attribute `metadata_` maps to DB column `metadata` (reserved SQLAlchemy name)
|
||||
- Timestamp columns use `_at` suffix: `created_at`, `used_at`
|
||||
- Boolean columns use `is_` or no prefix: `is_active`, `totp_enabled`, `password_must_change`
|
||||
|
||||
**Frontend files:**
|
||||
- Vue components: `PascalCase` — `DocumentCard.vue`, `FolderTreeItem.vue`, `StorageBrowser.vue`
|
||||
- Stores: `camelCase.js` — `auth.js`, `documents.js`, `cloudConnections.js`
|
||||
- Utilities: `camelCase.js` — `formatters.js`
|
||||
- API client: single file `src/api/client.js`
|
||||
- Test files: `ComponentName.test.js` or `storeName.test.js` inside `__tests__/` subdirectory
|
||||
|
||||
**Frontend functions and variables:**
|
||||
- `camelCase`: `formatDate`, `formatSize`, `providerColor`, `fetchDocuments`, `uploadToMinIO`
|
||||
- Store composables use `use` prefix: `useAuthStore`, `useFoldersStore`, `useDocumentsStore`
|
||||
- Private helpers prefixed underscore: `_refreshInFlight`
|
||||
- Event names emitted from components: `kebab-case` — `'breadcrumb-navigate'`, `'folder-create'`, `'file-open'`
|
||||
|
||||
## Code Style
|
||||
|
||||
**Formatting:**
|
||||
- No Prettier, ESLint, Black, or Ruff config committed — style maintained by convention only
|
||||
- Backend follows PEP 8 organically; 4-space indentation
|
||||
- Tailwind CSS utility classes applied inline in Vue templates; no scoped `<style>` blocks used
|
||||
|
||||
**Python style specifics:**
|
||||
- `from __future__ import annotations` at top of all `api/` and `services/` files (all 8 api/ files confirmed)
|
||||
- `Optional[X]` used instead of `X | None` union syntax — maintained for Python < 3.10 compatibility even though runtime is 3.12
|
||||
- Type annotations on all function signatures and ORM `Mapped[...]` column declarations
|
||||
- Docstrings present on all public functions and modules; module docstrings explain invariants and phase context
|
||||
|
||||
**Vue/JS style specifics:**
|
||||
- `<script setup>` Composition API used for ALL Vue components — no Options API exists (all 30+ components confirmed)
|
||||
- Pinia stores use setup function syntax (not options syntax): `defineStore('name', () => { ... })`
|
||||
- `ref()` for all reactive state; `computed()` for derived values; `watch()` for side effects
|
||||
- Props always explicitly typed: `{ type: Object, required: true }`
|
||||
- `emits` declared on components that emit events
|
||||
|
||||
## Import Organization
|
||||
|
||||
**Python imports (consistent order across all api/ and services/ files):**
|
||||
1. `from __future__ import annotations` (first line, when present)
|
||||
2. Standard library (`import uuid`, `import hashlib`, `import logging`)
|
||||
3. Third-party (`from fastapi import ...`, `from sqlalchemy import ...`, `from pydantic import ...`)
|
||||
4. Internal (`from config import settings`, `from db.models import ...`, `from deps.auth import ...`, `from services import ...`)
|
||||
|
||||
Example from `backend/api/auth.py`:
|
||||
```python
|
||||
from __future__ import annotations
|
||||
|
||||
import uuid
|
||||
from typing import Literal, Optional
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Request, Response, status
|
||||
from pydantic import BaseModel, EmailStr
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from config import settings
|
||||
from db.models import BackupCode, Quota, RefreshToken, User
|
||||
from deps.auth import get_current_user
|
||||
from deps.db import get_db
|
||||
from services import auth as auth_service
|
||||
```
|
||||
|
||||
**Frontend imports (consistent order):**
|
||||
1. `import { ... } from 'vue'` — Vue composables
|
||||
2. `import { ... } from 'vue-router'` — router composables
|
||||
3. `import { useXStore } from '../stores/x.js'` — Pinia stores
|
||||
4. `import * as api from '../../api/client.js'` — API client (namespace import)
|
||||
5. `import ChildComponent from './ChildComponent.vue'` — child components
|
||||
6. `import { formatDate } from '../../utils/formatters.js'` — shared utilities
|
||||
|
||||
**Path resolution:** Relative paths throughout — no `@/` alias configured.
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Backend — service vs API layer separation (strict pattern):**
|
||||
- `services/` functions raise `ValueError` with descriptive messages — NEVER `HTTPException`
|
||||
- `api/` handlers catch `ValueError` and map to HTTP status codes
|
||||
- Pattern from `api/auth.py`:
|
||||
```python
|
||||
try:
|
||||
auth_service.validate_password_strength(body.new_password)
|
||||
except ValueError as exc:
|
||||
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, detail=str(exc))
|
||||
```
|
||||
|
||||
**HTTP status codes used:**
|
||||
- `201` — resource created (register, share, folder)
|
||||
- `401` — unauthenticated or wrong credentials
|
||||
- `403` — forbidden (wrong role, wrong owner, admin blocked from document content)
|
||||
- `404` — not found
|
||||
- `409` — conflict (duplicate email/handle)
|
||||
- `413` — quota exceeded
|
||||
- `422` — validation failure (weak password, invalid field value)
|
||||
- `429` — rate limited
|
||||
|
||||
**Audit log exceptions:**
|
||||
- `services/audit.py` `write_audit_log()` catches all exceptions and calls `logger.warning()`
|
||||
- Audit failure MUST NOT abort the primary operation — no re-raise under any circumstance
|
||||
|
||||
**Frontend error handling:**
|
||||
- Stores catch errors and set `error.value = e.message`; `loading.value` always reset in `finally`
|
||||
- `api/client.js` `request()` throws `Error` with `.status` and optional `.payload` properties
|
||||
- On 401: automatic single-retry after `authStore.refresh()`; on refresh failure throws `'Session expired'`
|
||||
|
||||
## Logging
|
||||
|
||||
**Framework:** Python `logging` module with `logger = logging.getLogger(__name__)` per module.
|
||||
|
||||
**Patterns:**
|
||||
- `%`-style format strings (never f-strings in log calls): `logger.warning("audit log write failed: %s", exc)`
|
||||
- `logger.info` for successful notable operations; `logger.warning` for non-fatal failures; `logger.error` for operation failures
|
||||
- Never log secrets, tokens, passwords, or PII
|
||||
- Auth events, quota violations, and admin actions are written to the `AuditLog` DB table via `write_audit_log()` — not the Python logger
|
||||
|
||||
**Frontend:** No logging framework — `console.*` not used in production code.
|
||||
|
||||
## Comments
|
||||
|
||||
**Module docstrings — every backend module has:**
|
||||
- Summary of what it implements (with HTTP endpoint paths)
|
||||
- Security invariants it enforces (with REQ-IDs: `SEC-02`, `AUTH-07`, `D-04`)
|
||||
- Plan/phase traceability note
|
||||
|
||||
**Inline comments:**
|
||||
- Security-sensitive lines carry rationale: `# CLAUDE.md constraint`, `# SEC-06`, `# T-03-22`
|
||||
- SQLAlchemy quirks explained inline where non-obvious
|
||||
- `# ── Section Name ──────` horizontal rules separate logical sections within long files
|
||||
|
||||
**Test docstrings:**
|
||||
- Every test function has a one-line docstring describing what it asserts: `"""POST /api/auth/register with valid data returns 201 with id and handle."""`
|
||||
|
||||
## Function Design
|
||||
|
||||
**Backend:**
|
||||
- Single responsibility per function — auth service functions do exactly one thing
|
||||
- DB-touching functions are `async` and take `AsyncSession` as a parameter
|
||||
- Pydantic `@field_validator` used for complex field constraints (e.g., `filename_no_path_separators`)
|
||||
|
||||
**Frontend:**
|
||||
- Store actions are `async` functions defined inside `defineStore` setup
|
||||
- Utility functions in `src/utils/formatters.js` are pure — no side effects, no imports
|
||||
- Test factory helpers follow `makeFolder(overrides = {})` pattern — spread overrides over defaults
|
||||
|
||||
## Module Design
|
||||
|
||||
**Backend:**
|
||||
- All routers named `router`: `router = APIRouter(prefix="/api/...", tags=[...])`
|
||||
- Settings singleton: `settings = Settings()` at bottom of `config.py`; imported as `from config import settings`
|
||||
- No `__all__` declarations — convention limits what callers import
|
||||
|
||||
**Frontend:**
|
||||
- Named exports from stores: `export const useAuthStore = defineStore(...)`
|
||||
- Named exports from utilities: `export function formatDate(iso) { ... }`
|
||||
- Default exports from Vue components (implicit via `<script setup>`)
|
||||
- `src/api/client.js`: named exports only; `request()` is unexported internal helper
|
||||
|
||||
## Backend Dependency Injection
|
||||
|
||||
FastAPI `Depends()` is used for all cross-cutting concerns. Three standard dependencies in `backend/deps/`:
|
||||
|
||||
- `get_db` (`deps/db.py`) — yields `AsyncSession`; overridden in tests with in-memory SQLite session
|
||||
- `get_current_user` (`deps/auth.py`) — validates Bearer JWT, returns `User`; raises 401
|
||||
- `get_current_admin` (`deps/auth.py`) — delegates to `get_current_user`, checks `role == 'admin'`; raises 403
|
||||
- `get_regular_user` (`deps/auth.py`) — delegates to `get_current_user`, blocks `role == 'admin'`; raises 403
|
||||
|
||||
Usage pattern in route handlers:
|
||||
```python
|
||||
@router.get("/protected")
|
||||
async def protected_endpoint(
|
||||
current_user: User = Depends(get_regular_user),
|
||||
session: AsyncSession = Depends(get_db),
|
||||
):
|
||||
...
|
||||
```
|
||||
|
||||
## Security-Enforced Invariants in Code
|
||||
|
||||
The following patterns are mandatory and must not be deviated from:
|
||||
- **Token storage:** `accessToken` lives only in Pinia `ref()` — never `localStorage`, never `sessionStorage`
|
||||
- **Refresh cookie:** `httponly=True, secure=True, samesite="strict"` on every `set_cookie` call
|
||||
- **Ownership check:** every document/folder/share endpoint asserts `resource.user_id == current_user.id`
|
||||
- **Object keys:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filename stored in DB only
|
||||
- **Quota:** atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — never read-then-write
|
||||
- **Admin exclusion:** admin accounts blocked from all `/api/documents/*` endpoints via `get_regular_user`
|
||||
|
||||
---
|
||||
|
||||
## Python Conventions (Backend)
|
||||
|
||||
### Naming
|
||||
- Files: `snake_case.py`
|
||||
- Classes: `PascalCase` (e.g., `AnthropicProvider`, `ClassificationResult`)
|
||||
- Functions/variables: `snake_case`
|
||||
- Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_STORED_CHARS`, `DATA_DIR`)
|
||||
- Private helpers: leading underscore (e.g., `_extract_pdf`, `_parse_classification`)
|
||||
|
||||
### Async
|
||||
- All API endpoint functions are `async def`
|
||||
- All `AIProvider` methods are `async def`
|
||||
- `pytest-asyncio` with `asyncio_mode=auto` (set in `pytest.ini`)
|
||||
|
||||
### Type Hints
|
||||
- Used on public function signatures in `ai/` layer and `services/`
|
||||
- Dataclass used for `ClassificationResult` (`@dataclass` with `field(default_factory=...)`)
|
||||
- Not used consistently in `api/` routers (rely on FastAPI/Pydantic implicit validation)
|
||||
|
||||
### Error Handling
|
||||
- `extractor.py` wraps all extraction in `try/except Exception` and returns error strings (never raises)
|
||||
- AI providers raise on hard failures; caller (`classifier.py`) is responsible for propagating
|
||||
- No global exception handler registered in `main.py`
|
||||
|
||||
### Imports
|
||||
- Standard library first, then third-party, then local — not enforced by isort
|
||||
- Heavy library imports (`fitz`, `pytesseract`, `docx`) are deferred inside functions to avoid import-time cost when unused
|
||||
|
||||
### Module Docstrings
|
||||
- Present on `extractor.py` and `test_classifier.py`; absent elsewhere
|
||||
|
||||
---
|
||||
|
||||
## JavaScript / Vue Conventions (Frontend)
|
||||
|
||||
### Naming
|
||||
- Vue files: `PascalCase.vue` (e.g., `DocumentCard.vue`, `AppSidebar.vue`)
|
||||
- Pinia stores: `camelCase` filename matching store ID (e.g., `documents.js` → `useDocumentsStore`)
|
||||
- Views: `<Name>View.vue` suffix
|
||||
- Components grouped by domain in subdirectories: `documents/`, `topics/`, `upload/`, `layout/`
|
||||
|
||||
### Vue Style
|
||||
- Options API used throughout (not Composition API)
|
||||
- Props defined with type and default; no `defineProps` (Options API syntax)
|
||||
- `v-model`, `v-for`, `v-if` used directly in templates
|
||||
|
||||
### Pinia Pattern
|
||||
- Each store encapsulates `state`, `getters`, and `actions`
|
||||
- Actions call `src/api/client.js` — components never import `client.js` directly
|
||||
- Stores are the single source of truth; views read from store state
|
||||
|
||||
### API Client
|
||||
- `src/api/client.js` is the sole HTTP adapter
|
||||
- All paths are prefixed `/api/` (proxied to backend in dev via Vite config)
|
||||
|
||||
### Styling
|
||||
- Tailwind CSS utility classes used directly in templates
|
||||
- No scoped `<style>` blocks observed in component list
|
||||
- Global styles in `src/style.css`
|
||||
|
||||
---
|
||||
|
||||
## API Design Conventions (Backend)
|
||||
|
||||
- All endpoints prefixed `/api/` (set per router)
|
||||
- JSON responses; multipart for file upload
|
||||
- HTTP verbs follow REST: GET list, GET by ID, POST create, PUT/PATCH update, DELETE remove
|
||||
- No versioning (`/api/v1/`) — flat namespace
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
- Runtime paths controlled entirely by `DATA_DIR` env var (defaults to `/app/data`)
|
||||
- AI settings persisted in `data/settings.json` — no env var overrides at runtime for provider config (except `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` noted in `.env.example`)
|
||||
- No `.env` loading in backend code — env vars passed via Docker Compose `environment:` block
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No ESLint, Prettier, Black, or Ruff configuration committed
|
||||
- No pre-commit hooks
|
||||
- No consistent JSDoc or Python docstring coverage
|
||||
*Convention analysis: 2026-06-02*
|
||||
|
||||
@@ -1,144 +1,235 @@
|
||||
# INTEGRATIONS — document-scanner
|
||||
# External Integrations
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## Summary
|
||||
## AI / ML Classification
|
||||
|
||||
The backend integrates with four interchangeable AI providers for document classification: Anthropic Claude, OpenAI (and any OpenAI-compatible endpoint), Ollama, and LM Studio. There are no external databases, auth services, or cloud storage integrations — all persistence is local filesystem. The active provider is selected at runtime via settings persisted in `backend/data/settings.json`.
|
||||
All AI providers implement the `AIProvider` abstract interface in `backend/ai/base.py`. The active provider is selected at classification time via the `DEFAULT_AI_PROVIDER` setting (`backend/config.py`).
|
||||
|
||||
---
|
||||
|
||||
## AI Providers
|
||||
|
||||
All providers implement the `AIProvider` abstract interface defined in `backend/ai/base.py`. The active provider is resolved at request time in `backend/ai/__init__.py:get_provider()`.
|
||||
|
||||
### Anthropic
|
||||
### Anthropic Claude
|
||||
|
||||
- **SDK:** `anthropic>=0.26` — `backend/ai/anthropic_provider.py`
|
||||
- **Client:** `anthropic.AsyncAnthropic`
|
||||
- **Client:** `anthropic.AsyncAnthropic(api_key=...)`
|
||||
- **API:** Messages API (`client.messages.create`)
|
||||
- **Default model:** `claude-sonnet-4-6`
|
||||
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.anthropic.api_key`; optionally seeded from env var `ANTHROPIC_API_KEY` (`.env.example`)
|
||||
- **Default model:** `claude-sonnet-4-6` (configurable via `DEFAULT_AI_MODEL`)
|
||||
- **Auth env var:** API key passed at provider instantiation; stored in DB per-user or system-wide (not yet confirmed in code)
|
||||
- **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
|
||||
- **Text limit:** 8,000 characters per request (`MAX_AI_CHARS = 8_000`)
|
||||
- **Text cap:** 8,000 chars per call (`MAX_AI_CHARS = 8_000` in `backend/ai/anthropic_provider.py`)
|
||||
|
||||
### OpenAI
|
||||
|
||||
- **SDK:** `openai>=1.30` — `backend/ai/openai_provider.py`
|
||||
- **Client:** `openai.AsyncOpenAI`
|
||||
- **Client:** `openai.AsyncOpenAI(api_key=..., base_url=...)`
|
||||
- **API:** Chat Completions (`client.chat.completions.create`)
|
||||
- **Default model:** `gpt-4o`
|
||||
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.openai.api_key`; optionally seeded from env var `OPENAI_API_KEY` (`.env.example`)
|
||||
- **Custom base URL:** Supported via `providers.openai.base_url` in settings (allows pointing at any OpenAI-compatible endpoint)
|
||||
- **Auth:** `api_key` at instantiation; `base_url` override supported for custom endpoints
|
||||
|
||||
### Ollama
|
||||
### Ollama (local, OpenAI-compatible)
|
||||
|
||||
- **Provider file:** `backend/ai/ollama_provider.py`
|
||||
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
|
||||
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
|
||||
- **Default base URL:** `http://host.docker.internal:11434/v1`
|
||||
- **Default model:** `llama3.2`
|
||||
- **Auth:** Stub key `"ollama"` (no real auth required)
|
||||
- **Network path:** Reaches the host machine's Ollama daemon via Docker's `host.docker.internal` DNS alias (configured in `docker-compose.yml` via `extra_hosts`)
|
||||
- **Auth:** Stub key `"ollama"` — no real auth
|
||||
- **Network path:** Reaches host machine Ollama daemon via Docker `extra_hosts: host.docker.internal:host-gateway`
|
||||
|
||||
### LM Studio
|
||||
### LM Studio (local, OpenAI-compatible)
|
||||
|
||||
- **Provider file:** `backend/ai/lmstudio_provider.py`
|
||||
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
|
||||
- **Implementation:** Subclass of `OpenAIProvider` with fixed `base_url`
|
||||
- **Default base URL:** `http://host.docker.internal:1234/v1`
|
||||
- **Default model:** `gemma-4-e4b-it`
|
||||
- **Auth:** Stub key `"lm-studio"` (no real auth required)
|
||||
- **Network path:** Reaches the host machine's LM Studio server via `host.docker.internal` (same `extra_hosts` setting)
|
||||
- **Default active provider** — the app works out of the box with LM Studio and no API keys
|
||||
- **Auth:** Stub key `"lm-studio"` — no real auth
|
||||
- **Network path:** Same `host.docker.internal` Docker alias as Ollama
|
||||
|
||||
---
|
||||
|
||||
## Provider Selection & Settings Persistence
|
||||
## Data Storage
|
||||
|
||||
- Active provider and all per-provider config (model names, API keys, base URLs) are persisted in `backend/data/settings.json`.
|
||||
- Settings are loaded fresh on each classification request in `backend/services/classifier.py:classify_document()`.
|
||||
- API keys returned from the settings API are masked (last 4 chars shown) via `backend/services/storage.py:mask_api_key()`.
|
||||
- The Settings UI allows switching providers without restart.
|
||||
### PostgreSQL (primary database)
|
||||
|
||||
- **Image:** `postgres:17-alpine` (Docker Compose)
|
||||
- **Driver:** `psycopg[binary]>=3.3.4` (psycopg v3 async)
|
||||
- **ORM:** SQLAlchemy 2.0 asyncio — `backend/db/session.py`
|
||||
- **Schema migrations:** Alembic — `backend/migrations/`
|
||||
- **Connection env vars:** `DATABASE_URL` (app user, DML only), `DATABASE_MIGRATE_URL` (migrate user, DDL)
|
||||
- **Role separation:** `docuvault_app` (DML), `docuvault_migrate` (DDL) — `docker/postgres/initdb.d/01-init-users.sql`
|
||||
|
||||
### MinIO (object storage)
|
||||
|
||||
- **Image:** `minio/minio:latest` (Docker Compose), ports 9000 + 9001
|
||||
- **SDK:** `minio>=7.2.20` — `backend/storage/minio_backend.py`
|
||||
- **Object key scheme:** `{user_id}/{document_id}/{uuid4()}{ext}` — human filenames stored in DB only
|
||||
- **Presigned URLs:** Generated for browser direct-PUT uploads and GET downloads
|
||||
- **Auth env vars:** `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET`
|
||||
- **Public endpoint:** `MINIO_PUBLIC_ENDPOINT` — browser-resolvable hostname for presigned URLs (may differ from internal Docker endpoint)
|
||||
- **CORS:** `MINIO_API_CORS_ALLOW_ORIGIN` set to `FRONTEND_URL` to allow browser preflight
|
||||
|
||||
### Redis
|
||||
|
||||
- **Image:** `redis:7-alpine` (Docker Compose), password-protected
|
||||
- **Client:** `redis>=4.6.0` (async via `redis.asyncio`)
|
||||
- **Uses:**
|
||||
- Celery broker and result backend (`backend/celery_app.py`)
|
||||
- JTI token revocation store (access + refresh token blacklist)
|
||||
- Per-account rate limiting via slowapi (`backend/main.py`)
|
||||
- TOTP replay prevention (used TOTP codes invalidated within 90 s window)
|
||||
- **Auth env var:** `REDIS_URL` (includes password in DSN)
|
||||
|
||||
---
|
||||
|
||||
## Frontend ↔ Backend Communication
|
||||
## Cloud Storage Backends
|
||||
|
||||
- **Protocol:** HTTP REST over JSON (and multipart form for uploads)
|
||||
- **Client:** Native browser `fetch` API — `frontend/src/api/client.js`
|
||||
- **Base path:** All requests go to `/api/*` — no hardcoded backend hostname in the frontend
|
||||
- **Proxy (dev):** Vite dev server proxies `/api` → `http://backend:8000` — `frontend/vite.config.js`
|
||||
- **Proxy (prod):** Comment in `frontend/src/api/client.js` notes nginx is expected; no nginx config is present in the repo
|
||||
All backends implement `StorageBackend` ABC from `backend/storage/base.py`. Credentials are encrypted at rest with HKDF per-user key derivation using master key from `CLOUD_CREDS_KEY` env var.
|
||||
|
||||
### API Endpoints consumed by the frontend
|
||||
### Google Drive v3
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|---|---|---|
|
||||
| POST | `/api/documents/upload` | Upload file with optional auto-classify flag |
|
||||
| GET | `/api/documents` | List documents (paginated, optional topic filter) |
|
||||
| GET | `/api/documents/:id` | Get single document metadata |
|
||||
| DELETE | `/api/documents/:id` | Delete document |
|
||||
| POST | `/api/documents/:id/classify` | (Re)classify document, optional topic list |
|
||||
| GET | `/api/topics` | List all topics |
|
||||
| POST | `/api/topics` | Create topic |
|
||||
| PATCH | `/api/topics/:id` | Update topic |
|
||||
| DELETE | `/api/topics/:id` | Delete topic |
|
||||
| POST | `/api/topics/suggest` | AI topic suggestions for a document |
|
||||
| GET | `/api/settings` | Get settings (keys masked) |
|
||||
| PATCH | `/api/settings` | Update settings |
|
||||
| POST | `/api/settings/test-provider` | Health-check the active or named provider |
|
||||
| GET | `/api/settings/default-prompt` | Retrieve the default classification system prompt |
|
||||
- **SDK:** `google-auth-oauthlib>=1.3.1` + `google-api-python-client>=2.196.0`
|
||||
- **Backend file:** `backend/storage/google_drive_backend.py`
|
||||
- **Auth:** OAuth2 flow; tokens stored encrypted in DB; `token_uri`, `client_id`, `client_secret`, `access_token`, `refresh_token` in credentials dict
|
||||
- **Scope:** `https://www.googleapis.com/auth/drive.file`
|
||||
- **Note:** All `googleapiclient` calls are synchronous and wrapped in `asyncio.to_thread()` to avoid blocking the event loop; `cache_discovery=False` prevents `/tmp` writes (path traversal mitigation)
|
||||
- **Auth env vars:** `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`
|
||||
- **OAuth callback:** `{BACKEND_URL}/api/cloud/google/callback`
|
||||
|
||||
---
|
||||
### Microsoft OneDrive (Graph API)
|
||||
|
||||
## Docker Services
|
||||
- **SDK:** `msal>=1.36.0` (token management) + `httpx>=0.27` (async Graph API calls)
|
||||
- **Backend file:** `backend/storage/onedrive_backend.py`
|
||||
- **API base:** `https://graph.microsoft.com/v1.0`
|
||||
- **Auth:** OAuth2 via MSAL; tokens stored encrypted in DB; credentials dict contains `access_token`, `refresh_token`, `expires_at`
|
||||
- **Upload strategy:** Resumable upload sessions (`createUploadSession`) for all files; chunk size 10 MB
|
||||
- **Auth env vars:** `ONEDRIVE_CLIENT_ID`, `ONEDRIVE_CLIENT_SECRET`, `ONEDRIVE_TENANT_ID` (default: `"common"`)
|
||||
|
||||
Defined in `docker-compose.yml`:
|
||||
### Nextcloud
|
||||
|
||||
| Service | Image | Port | Notes |
|
||||
|---|---|---|---|
|
||||
| `backend` | Built from `./backend/Dockerfile` | `8000:8000` | Mounts `./backend/data:/app/data` for persistence; `./backend:/app` for hot-reload |
|
||||
| `frontend` | Built from `./frontend/Dockerfile` | `5173:5173` | Mounts `./frontend/src` and `index.html` for hot-reload; depends on `backend` |
|
||||
- **Backend file:** `backend/storage/nextcloud_backend.py`
|
||||
- **Inheritance:** `NextcloudBackend → WebDAVBackend → StorageBackend`
|
||||
- **Protocol:** WebDAV via `webdavclient3>=3.14.7`
|
||||
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
|
||||
- **SSRF prevention:** `validate_cloud_url()` called at construction time and before every outbound request (`backend/storage/cloud_utils.py`)
|
||||
- **No OAuth:** Credential-based only (username + password)
|
||||
|
||||
Both services use `extra_hosts: host.docker.internal:host-gateway` on the backend to allow Ollama/LM Studio connections to the host machine.
|
||||
### Generic WebDAV
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Required | Where used | Notes |
|
||||
|---|---|---|---|
|
||||
| `DATA_DIR` | No | `backend/config.py` | Root path for uploads/metadata/settings; defaults to `/app/data` |
|
||||
| `ANTHROPIC_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
|
||||
| `OPENAI_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
|
||||
| `PYTHONDONTWRITEBYTECODE` | No | `docker-compose.yml` | Set to `1` to suppress `.pyc` files in Docker |
|
||||
- **Backend file:** `backend/storage/webdav_backend.py`
|
||||
- **SDK:** `webdavclient3>=3.14.7`
|
||||
- **Credentials dict:** `{"server_url": str, "username": str, "password": str}`
|
||||
- **SSRF prevention:** Same dual-call `validate_cloud_url()` pattern as Nextcloud
|
||||
- **Path encoding:** `urllib.parse.quote()` per path segment to handle non-ASCII filenames
|
||||
|
||||
---
|
||||
|
||||
## Authentication & Identity
|
||||
|
||||
- No user authentication. The application has no login system, sessions, or identity provider.
|
||||
- API keys for AI providers are stored in plain text in `backend/data/settings.json` (masked only when returned via the settings API).
|
||||
No external auth provider (SSO, Auth0, Cognito, etc.). Authentication is custom-built:
|
||||
|
||||
- **Password hashing:** Argon2id via `pwdlib[argon2]` — `backend/services/auth.py`
|
||||
- **JWT access tokens:** PyJWT `>=2.8.0`; ES256 (ECDSA P-256) algorithm; 15-minute TTL; JTI claim for revocation; fingerprint claim (`fgp`) bound to `User-Agent + Accept-Language`
|
||||
- **Refresh tokens:** 30-day httpOnly Strict SameSite=Strict cookie; rotated on every use; family revocation on reuse
|
||||
- **JTI store:** Redis (TTL matching token lifetime)
|
||||
- **TOTP (2FA):** `pyotp>=2.9.0`; replay prevention via Redis within 90 s window; QR codes generated in frontend with `qrcode ^1.5.4`
|
||||
- **Backup codes:** Generated, hashed (Argon2id), stored in DB — `backend/db/models.py:BackupCode`
|
||||
|
||||
---
|
||||
|
||||
## External HTTP APIs
|
||||
|
||||
### HaveIBeenPwned (HIBP)
|
||||
|
||||
- **Purpose:** k-anonymity password breach check on registration and password change
|
||||
- **Client:** `httpx` async GET to `https://api.pwnedpasswords.com/range/{prefix}`
|
||||
- **Implementation:** `backend/services/auth.py:check_hibp()` — sends first 5 chars of SHA-1 hash only; fail-open (check failures are logged and do not block registration)
|
||||
- **Auth:** None required (public API)
|
||||
|
||||
---
|
||||
|
||||
## Email / Notifications
|
||||
|
||||
- **Protocol:** SMTP via Python stdlib `smtplib` — `backend/services/email.py`
|
||||
- **Transport security:** STARTTLS (port 587 default)
|
||||
- **Auth:** Optional SMTP username + password
|
||||
- **Auth env vars:** `SMTP_HOST`, `SMTP_PORT`, `SMTP_USER`, `SMTP_PASSWORD`, `SMTP_FROM`
|
||||
- **Dev fallback:** When `SMTP_HOST` is empty, email content is logged to stdout instead of sent
|
||||
- **Emails sent:**
|
||||
- Password reset link (1-hour validity) — triggered from `backend/tasks/email_tasks.py`
|
||||
- Security alert (suspicious refresh token reuse / session family revocation) — triggered from `backend/services/auth.py` via Celery
|
||||
- **Celery queue:** `email` queue, separate from `documents` queue
|
||||
|
||||
---
|
||||
|
||||
## Frontend ↔ Backend Communication
|
||||
|
||||
- **Protocol:** HTTP REST over JSON; multipart/form-data for document upload
|
||||
- **Client:** Native browser `fetch` API — `frontend/src/api/` directory
|
||||
- **Base path:** All requests use relative `/api/*` — no hardcoded backend hostname
|
||||
- **Dev proxy:** Vite proxies `/api` → `http://backend:8000` (`frontend/vite.config.js`)
|
||||
- **Auth flow:** Access token stored in Pinia store (memory only); refresh token in httpOnly cookie; token refresh handled transparently in API client
|
||||
|
||||
---
|
||||
|
||||
## Background Task Queues (Celery)
|
||||
|
||||
- **Broker + result backend:** Redis (`REDIS_URL`)
|
||||
- **Serialization:** JSON only (no pickle)
|
||||
- **Queues and task modules:**
|
||||
- `documents` — `backend/tasks/document_tasks.py` (extraction, classification, cleanup)
|
||||
- `email` — `backend/tasks/email_tasks.py` (password reset, security alert)
|
||||
- `documents` (reused) — `backend/tasks/audit_tasks.py` (audit log export)
|
||||
- **Scheduled tasks (Celery Beat):**
|
||||
- `cleanup-abandoned-uploads` — every 30 minutes
|
||||
- `audit-log-daily-export` — midnight UTC daily
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
- No error tracking service (no Sentry, Datadog, etc.).
|
||||
- No structured logging framework — FastAPI default stdout logging only.
|
||||
- A `/health` endpoint exists at `backend/main.py` returning `{"status": "ok"}`.
|
||||
- Provider connectivity tested on demand via `POST /api/settings/test-provider`.
|
||||
- **Error tracking:** None (no Sentry, Datadog, etc.)
|
||||
- **Logging:** Python stdlib `logging`; stdout; no structured logging framework
|
||||
- **Health endpoint:** `GET /health` — probes PostgreSQL (`SELECT 1`) and MinIO (bucket exists check); always returns HTTP 200 with `status: ok | degraded`
|
||||
- **Audit log:** All auth events, quota violations, and admin actions written to DB audit log (no document content) — `backend/services/audit.py`, `backend/api/audit.py`
|
||||
|
||||
---
|
||||
|
||||
## Webhooks & Callbacks
|
||||
## CI/CD & Deployment
|
||||
|
||||
- None — the application makes no outbound webhook calls and exposes no webhook receiver endpoints.
|
||||
- **Hosting:** Docker Compose only; no cloud provider manifests detected
|
||||
- **CI pipeline:** None detected in repository
|
||||
- **Container registry:** None configured
|
||||
- **Secrets management:** Environment variables only; `.env` file for local dev (not committed)
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
## Required Environment Variables Summary
|
||||
|
||||
- No nginx or reverse-proxy config present for production deployments; the client-side comment references it but no config exists.
|
||||
- No container registry or CI/CD pipeline configuration detected.
|
||||
- API keys are stored in a plain JSON file on disk with no encryption at rest.
|
||||
- The `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` env vars from `.env.example` are noted as bootstrap helpers but no code in the repo reads them directly — they appear to be manual seeding hints only.
|
||||
| Variable | Required | Service | Purpose |
|
||||
|---|---|---|---|
|
||||
| `DATABASE_URL` | Yes | backend | App DB connection (DML user) |
|
||||
| `DATABASE_MIGRATE_URL` | Yes | migrations | Alembic DDL connection |
|
||||
| `MINIO_ENDPOINT` | Yes | backend, workers | MinIO S3 API endpoint |
|
||||
| `MINIO_ACCESS_KEY` | Yes | backend, workers | MinIO credentials |
|
||||
| `MINIO_SECRET_KEY` | Yes | backend, workers | MinIO credentials |
|
||||
| `MINIO_BUCKET` | Yes | backend, workers | Object storage bucket name |
|
||||
| `REDIS_URL` | Yes | backend, workers, beat | Redis DSN (broker + JTI store) |
|
||||
| `SECRET_KEY` | Yes | backend | JWT signing secret |
|
||||
| `CLOUD_CREDS_KEY` | Yes | celery-worker | 32-byte master key for HKDF |
|
||||
| `POSTGRES_PASSWORD` | Yes | postgres service | Docker postgres init |
|
||||
| `MINIO_ROOT_USER` | Yes | minio service | MinIO root credentials |
|
||||
| `MINIO_ROOT_PASSWORD` | Yes | minio service | MinIO root credentials |
|
||||
| `REDIS_PASSWORD` | Yes | redis service | Redis auth password |
|
||||
| `SMTP_HOST` | No | backend | Transactional email (dev: logs to stdout) |
|
||||
| `GOOGLE_CLIENT_ID` | No | backend | Google Drive OAuth |
|
||||
| `GOOGLE_CLIENT_SECRET` | No | backend | Google Drive OAuth |
|
||||
| `ONEDRIVE_CLIENT_ID` | No | backend | OneDrive OAuth |
|
||||
| `ONEDRIVE_CLIENT_SECRET` | No | backend | OneDrive OAuth |
|
||||
| `ADMIN_EMAIL` | No | backend | Bootstrap admin account |
|
||||
| `ADMIN_PASSWORD` | No | backend | Bootstrap admin account |
|
||||
| `DEFAULT_AI_PROVIDER` | No | backend | AI provider selection (default: `ollama`) |
|
||||
| `DEFAULT_AI_MODEL` | No | backend | AI model selection (default: `llama3.2`) |
|
||||
| `CORS_ORIGINS` | No | backend | Allowed CORS origins |
|
||||
| `FRONTEND_URL` | No | backend, minio | Password reset links + MinIO CORS |
|
||||
| `BACKEND_URL` | No | backend | OAuth callback URL construction |
|
||||
|
||||
---
|
||||
|
||||
*Integration audit: 2026-06-02*
|
||||
|
||||
+130
-76
@@ -1,129 +1,183 @@
|
||||
# STACK — document-scanner
|
||||
# Technology Stack
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
Document Scanner is a full-stack application with a Python/FastAPI backend and a Vue 3 frontend, containerised with Docker Compose. The backend handles document ingestion, text extraction, and AI-powered topic classification; the frontend is a single-page app served by Vite. No external database is used — all state is persisted to the local filesystem.
|
||||
|
||||
---
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## Languages
|
||||
|
||||
| Language | Version | Where used |
|
||||
|---|---|---|
|
||||
| Python | 3.12 (pinned in `backend/Dockerfile`) | Backend API, AI providers, services |
|
||||
| JavaScript (ES modules) | ES2022+ (`"type": "module"` in `frontend/package.json`) | Frontend SPA |
|
||||
**Primary:**
|
||||
- Python 3.12 — backend API, services, Celery tasks, storage backends
|
||||
- JavaScript (ES Modules, ES2022+) — Vue 3 frontend SPA
|
||||
|
||||
---
|
||||
**Secondary:**
|
||||
- SQL — PostgreSQL schema via Alembic migrations (`backend/migrations/`)
|
||||
- HTML/CSS — Vue SFC templates, Tailwind utility classes
|
||||
|
||||
## Runtime
|
||||
|
||||
**Backend:**
|
||||
- CPython 3.12 (Docker image: `python:3.12-slim`)
|
||||
- ASGI server: Uvicorn `>=0.29` with standard extras (websockets, httptools)
|
||||
- CPython 3.12 (pinned: `FROM python:3.12-slim` in `backend/Dockerfile`)
|
||||
- ASGI server: Uvicorn `>=0.29` with `[standard]` extras
|
||||
- Entry point: `backend/main.py` — `uvicorn main:app`
|
||||
|
||||
**Frontend:**
|
||||
- Node.js 20 (Docker image: `node:20-alpine`)
|
||||
- Dev server: Vite 5 on port 5173
|
||||
- Node.js 20 (pinned: `FROM node:20-alpine` in `frontend/Dockerfile`)
|
||||
- Dev server: Vite 5 on port 5173, proxies `/api` → `http://backend:8000`
|
||||
- Entry point: `frontend/index.html` → `frontend/src/main.js`
|
||||
|
||||
**Package Manager:**
|
||||
- Backend: `pip` — lockfile: none (ranges only in `backend/requirements.txt`)
|
||||
- Frontend: `npm` — lockfile: `frontend/package-lock.json` (present but not committed, generated on `npm install`)
|
||||
|
||||
---
|
||||
- Backend: `pip` — `backend/requirements.txt`; no lockfile (floating `>=` ranges used throughout — see CONCERNS.md)
|
||||
- Frontend: `npm` — lockfile: `frontend/package-lock.json`
|
||||
|
||||
## Frameworks
|
||||
|
||||
### Backend
|
||||
### Backend Core
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `fastapi` | `>=0.111` | REST API framework — `backend/main.py` |
|
||||
| `fastapi` | `>=0.111` | Async REST API framework — `backend/main.py` |
|
||||
| `uvicorn[standard]` | `>=0.29` | ASGI server |
|
||||
| `pydantic-settings` | `>=2.2` | Settings/config validation |
|
||||
| `python-multipart` | latest | Multipart file upload parsing |
|
||||
| `pydantic` | `>=2.0` with `[email]` | Request/response validation |
|
||||
| `pydantic-settings` | `>=2.2` | Environment-based config — `backend/config.py` |
|
||||
| `python-multipart` | `>=0.0.27` | Multipart file upload parsing |
|
||||
|
||||
### ORM / Database
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `sqlalchemy[asyncio]` | `>=2.0.49` | Async ORM — `backend/db/session.py`, `backend/db/models.py` |
|
||||
| `psycopg[binary]` | `>=3.3.4` | psycopg v3 async PostgreSQL driver |
|
||||
| `alembic` | `>=1.18.4` | Schema migrations — `backend/migrations/` |
|
||||
| `aiosqlite` | `>=0.20.0` | SQLite async driver (test isolation only) |
|
||||
|
||||
### Background Tasks
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `celery[redis]` | `>=5.5.0` | Async task queue — `backend/celery_app.py` |
|
||||
| `redis` | `>=4.6.0` | Redis async client; Celery broker + result backend + JTI token store |
|
||||
|
||||
### Auth / Security
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `PyJWT` | `>=2.8.0` | JWT access token creation and verification — `backend/services/auth.py` |
|
||||
| `pwdlib[argon2]` | `>=0.2.1` | Argon2id password hashing |
|
||||
| `pyotp` | `>=2.9.0` | TOTP provisioning and verification (2FA) |
|
||||
| `cryptography` | `>=41.0.0` | HKDF per-user key derivation; Fernet encryption for cloud credentials |
|
||||
| `slowapi` | `>=0.1.9` | Rate limiting middleware on auth endpoints |
|
||||
| `httpx` | `>=0.27` | Async HTTP client (HIBP k-anonymity checks, OneDrive Graph API) |
|
||||
|
||||
### Document Processing
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `PyMuPDF` | `>=1.26.7` | PDF text extraction — `backend/services/extractor.py` |
|
||||
| `python-docx` | `>=1.1` | DOCX text extraction — `backend/services/extractor.py` |
|
||||
| `pytesseract` | `>=0.3` | OCR for image files — `backend/services/extractor.py` |
|
||||
| `Pillow` | `>=10.3` | Image loading for OCR pipeline |
|
||||
| `aiofiles` | `>=23.2` | Async file I/O |
|
||||
|
||||
### AI Classification
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `anthropic` | `>=0.26` | Anthropic Claude SDK — `backend/ai/anthropic_provider.py` |
|
||||
| `openai` | `>=1.30` | OpenAI SDK; also used as shim for Ollama and LM Studio — `backend/ai/openai_provider.py` |
|
||||
|
||||
### Cloud Storage SDKs
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `minio` | `>=7.2.20` | MinIO/S3 object storage SDK — `backend/storage/minio_backend.py` |
|
||||
| `google-auth-oauthlib` | `>=1.3.1` | Google OAuth2 flow — `backend/storage/google_drive_backend.py` |
|
||||
| `google-api-python-client` | `>=2.196.0` | Google Drive v3 API — `backend/storage/google_drive_backend.py` |
|
||||
| `msal` | `>=1.36.0` | Microsoft Auth Library for OneDrive — `backend/storage/onedrive_backend.py` |
|
||||
| `webdavclient3` | `>=3.14.7` | Generic WebDAV + Nextcloud — `backend/storage/webdav_backend.py` |
|
||||
| `cachetools` | `>=5.3.0` | Cloud connection caching — `backend/services/cloud_cache.py` |
|
||||
|
||||
### Frontend
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `vue` | `^3.4.0` | UI framework — `frontend/src/App.vue` and all components |
|
||||
| `vue-router` | `^4.3.0` | Client-side routing — `frontend/src/router/index.js` |
|
||||
| `pinia` | `^2.1.0` | State management — `frontend/src/stores/` |
|
||||
| `vue` | `^3.4.0` | UI framework (Options API) — `frontend/src/` |
|
||||
| `vue-router` | `^4.3.0` | Client-side routing — `frontend/src/router/` |
|
||||
| `pinia` | `^2.1.0` | State management (JWT access token stored in memory only) — `frontend/src/stores/` |
|
||||
| `qrcode` | `^1.5.4` | TOTP QR code generation for 2FA enrollment UI |
|
||||
| `tailwindcss` | `^3.4.0` | Utility-first CSS — `frontend/tailwind.config.js` |
|
||||
|
||||
### Build / Dev Tooling
|
||||
### Frontend Dev / Build
|
||||
|
||||
| Tool | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `vite` | `^5.2.0` | Frontend bundler and dev server — `frontend/vite.config.js` |
|
||||
| `@vitejs/plugin-vue` | `^5.0.0` | Vue SFC support in Vite |
|
||||
| `tailwindcss` | `^3.4.0` | Utility-first CSS — `frontend/tailwind.config.js` |
|
||||
| `vite` | `^5.2.0` | Dev server and bundler — `frontend/vite.config.js` |
|
||||
| `@vitejs/plugin-vue` | `^5.0.0` | Vue SFC compilation |
|
||||
| `postcss` | `^8.4.0` | CSS processing — `frontend/postcss.config.js` |
|
||||
| `autoprefixer` | `^10.4.0` | CSS vendor prefixing |
|
||||
|
||||
---
|
||||
|
||||
## Key Backend Dependencies
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `anthropic` | `>=0.26` | Anthropic Claude API client — `backend/ai/anthropic_provider.py` |
|
||||
| `openai` | `>=1.30` | OpenAI / OpenAI-compatible API client — `backend/ai/openai_provider.py`, also used for Ollama and LM Studio via `base_url` override |
|
||||
| `PyMuPDF` (`fitz`) | `>=1.24` | PDF text extraction — `backend/services/extractor.py` |
|
||||
| `python-docx` | `>=1.1` | DOCX text extraction — `backend/services/extractor.py` |
|
||||
| `pytesseract` | `>=0.3` | OCR for image files — `backend/services/extractor.py` |
|
||||
| `Pillow` | `>=10.3` | Image handling for OCR — `backend/services/extractor.py` |
|
||||
| `filelock` | `>=3.14` | File-based concurrency locks — `backend/services/storage.py` |
|
||||
| `aiofiles` | `>=23.2` | Async file I/O support |
|
||||
| `httpx` | `>=0.27` | Async HTTP client (used internally by `anthropic` and `openai` SDKs) |
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
### Testing
|
||||
|
||||
| Tool | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `pytest` | `>=8.2` | Test runner — `backend/pytest.ini`, `backend/tests/` |
|
||||
| `pytest-asyncio` | `>=0.23` | Async test support; `asyncio_mode = auto` set in `backend/pytest.ini` |
|
||||
| `pytest` | `>=8.2` | Backend test runner — `backend/pytest.ini` |
|
||||
| `pytest-asyncio` | `>=1.3.0` | Async test support (`asyncio_mode = auto`) |
|
||||
| `vitest` | `^4.1.7` | Frontend test runner — `frontend/vitest.config.js` |
|
||||
| `@vue/test-utils` | `^2.4.10` | Vue component test utilities |
|
||||
| `happy-dom` | `^20.9.0` | DOM environment for Vitest |
|
||||
|
||||
No frontend test framework is present.
|
||||
## Infrastructure
|
||||
|
||||
---
|
||||
### Docker Compose Services (`docker-compose.yml`)
|
||||
|
||||
## Storage
|
||||
| Service | Image | Port(s) | Notes |
|
||||
|---|---|---|---|
|
||||
| `postgres` | `postgres:17-alpine` | internal | Persistent `postgres_data` volume |
|
||||
| `minio` | `minio/minio:latest` | `9000`, `9001` | S3-compatible object store; persistent `minio_data` volume |
|
||||
| `redis` | `redis:7-alpine` | internal | Password-protected; Celery broker + JTI revocation store |
|
||||
| `backend` | Built from `./backend` | `8000` | Hot-reload via volume mount; depends on postgres, minio, redis |
|
||||
| `celery-worker` | Built from `./backend` | — | Processes `documents` queue |
|
||||
| `celery-beat` | Built from `./backend` | — | Periodic task scheduler |
|
||||
| `frontend` | Built from `./frontend` | `5173` | Vite dev server; proxies `/api` → `backend:8000` |
|
||||
|
||||
- **File system only** — no database engine.
|
||||
- Upload files stored at `backend/data/uploads/` (UUID-named).
|
||||
- Document metadata stored as per-document JSON files at `backend/data/metadata/`.
|
||||
- Topics registry: `backend/data/topics.json`.
|
||||
- App settings: `backend/data/settings.json`.
|
||||
- File-level concurrency managed via `filelock` (`backend/services/storage.py`).
|
||||
### Database Role Separation
|
||||
|
||||
---
|
||||
- `docuvault_app` — DML only (SELECT/INSERT/UPDATE/DELETE); used by FastAPI app
|
||||
- `docuvault_migrate` — DDL; used by Alembic migrations only
|
||||
- Init script: `docker/postgres/initdb.d/01-init-users.sql`
|
||||
|
||||
## System Dependencies (backend Docker image)
|
||||
### System Dependencies (backend Docker image)
|
||||
|
||||
Installed via `apt-get` in `backend/Dockerfile`:
|
||||
- `tesseract-ocr` — OCR binary for `pytesseract`
|
||||
- `libgl1`, `libglib2.0-0` — shared libraries required by PyMuPDF
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
- Environment variable `DATA_DIR` sets the root data path (default: `/app/data`).
|
||||
- AI provider settings (models, API keys, base URLs) are stored in `backend/data/settings.json` and managed through the in-app Settings UI.
|
||||
- Optional bootstrap via `.env` (see `.env.example`): only `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` are referenced.
|
||||
- Default active provider is `lmstudio` (no API key required).
|
||||
**Environment variables** are the single source of truth, read by `pydantic-settings` in `backend/config.py`.
|
||||
|
||||
Required for core operation:
|
||||
- `DATABASE_URL` — psycopg v3 async DSN for app user
|
||||
- `DATABASE_MIGRATE_URL` — psycopg v3 DSN for migrate user
|
||||
- `MINIO_ENDPOINT`, `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, `MINIO_BUCKET`
|
||||
- `REDIS_URL` — used by both FastAPI (JTI store) and Celery
|
||||
- `SECRET_KEY` — JWT signing secret
|
||||
- `CLOUD_CREDS_KEY` — 32-byte master key for HKDF cloud credential encryption
|
||||
|
||||
Optional:
|
||||
- `SMTP_HOST/PORT/USER/PASSWORD/FROM` — transactional email
|
||||
- `GOOGLE_CLIENT_ID/SECRET`, `ONEDRIVE_CLIENT_ID/SECRET` — OAuth cloud storage
|
||||
- `ADMIN_EMAIL`, `ADMIN_PASSWORD` — bootstrap admin account
|
||||
- `SYSTEM_PROMPT`, `DEFAULT_AI_PROVIDER`, `DEFAULT_AI_MODEL` — AI defaults
|
||||
- `CORS_ORIGINS`, `FRONTEND_URL`, `BACKEND_URL`
|
||||
|
||||
## Platform Requirements
|
||||
|
||||
**Development:**
|
||||
- Docker + Docker Compose (preferred), or
|
||||
- Python 3.12, Node.js 20 plus running PostgreSQL 17, MinIO, Redis instances locally
|
||||
|
||||
**Production:**
|
||||
- Containerised via Docker Compose; no cloud-native manifests or reverse-proxy config detected in repo
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No Python version pinning file (`.python-version`, `pyproject.toml`) outside the Dockerfile — local dev outside Docker may use a different Python version.
|
||||
- No frontend lockfile committed; exact transitive dependency versions are non-deterministic until `npm install` is run.
|
||||
- No linter or formatter config detected (no `.eslintrc`, `.prettierrc`, `biome.json`, `ruff.toml`, `mypy.ini`, etc.).
|
||||
- No production deployment config beyond Docker Compose (no nginx config, no cloud provider manifests).
|
||||
*Stack analysis: 2026-06-02*
|
||||
|
||||
+324
-123
@@ -1,144 +1,345 @@
|
||||
# STRUCTURE — document-scanner
|
||||
<!-- refreshed: 2026-06-02 -->
|
||||
# Codebase Structure
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## Summary
|
||||
|
||||
The project is a monorepo with two top-level service directories (`backend/`, `frontend/`) and Docker Compose at the root. Backend is a Python/FastAPI app; frontend is a Vue 3 SPA built with Vite. All persistent data lives under `backend/data/`.
|
||||
|
||||
---
|
||||
|
||||
## Top-Level Layout
|
||||
## Directory Layout
|
||||
|
||||
```
|
||||
document_scanner/
|
||||
├── backend/ Python FastAPI service
|
||||
├── frontend/ Vue 3 SPA
|
||||
├── docker-compose.yml Two-service compose (backend + frontend)
|
||||
├── .env.example Optional env vars (API keys)
|
||||
└── .claude/ Claude Code settings
|
||||
document_scanner/ # Repo root
|
||||
├── backend/ # FastAPI Python backend
|
||||
│ ├── main.py # App factory, middleware, router registration
|
||||
│ ├── config.py # Pydantic Settings (all env vars)
|
||||
│ ├── celery_app.py # Celery factory, task routing, beat schedule
|
||||
│ ├── alembic.ini # Alembic migration config
|
||||
│ ├── requirements.txt # Pinned Python dependencies
|
||||
│ ├── Dockerfile # Backend container image
|
||||
│ ├── pytest.ini # pytest config
|
||||
│ ├── api/ # HTTP route handlers (thin — no business logic)
|
||||
│ │ ├── auth.py # /api/auth/* — register, login, TOTP, refresh
|
||||
│ │ ├── documents.py # /api/documents/* — upload, confirm, list, stream
|
||||
│ │ ├── folders.py # /api/folders/* — CRUD + document move
|
||||
│ │ ├── shares.py # /api/shares/* — share grants and revocation
|
||||
│ │ ├── cloud.py # /api/cloud/* + /api/users/me/default-storage
|
||||
│ │ ├── admin.py # /api/admin/* — user management, quota, AI config
|
||||
│ │ ├── audit.py # /api/admin/audit-log — viewer + CSV export
|
||||
│ │ └── topics.py # /api/topics/* — CRUD topics + suggest
|
||||
│ ├── services/ # Business logic (no FastAPI coupling)
|
||||
│ │ ├── auth.py # Argon2, JWT, refresh tokens, TOTP, HIBP
|
||||
│ │ ├── audit.py # write_audit_log() helper
|
||||
│ │ ├── classifier.py # AI classification orchestration
|
||||
│ │ ├── extractor.py # PDF/DOCX/image/text extraction
|
||||
│ │ ├── storage.py # ORM document queries + topic resolution
|
||||
│ │ ├── cloud_cache.py # TTL-cached cloud folder listing
|
||||
│ │ └── email.py # Email composition helpers
|
||||
│ ├── storage/ # Pluggable object storage backends
|
||||
│ │ ├── base.py # StorageBackend ABC
|
||||
│ │ ├── __init__.py # Factory: get_storage_backend(), get_storage_backend_for_document()
|
||||
│ │ ├── minio_backend.py # MinIO/S3 implementation (primary)
|
||||
│ │ ├── google_drive_backend.py
|
||||
│ │ ├── onedrive_backend.py
|
||||
│ │ ├── nextcloud_backend.py
|
||||
│ │ ├── webdav_backend.py
|
||||
│ │ ├── cloud_utils.py # HKDF encryption/decryption, URL validation
|
||||
│ │ └── exceptions.py # CloudConnectionError
|
||||
│ ├── ai/ # Pluggable AI classification providers
|
||||
│ │ ├── base.py # AIProvider ABC + ClassificationResult dataclass
|
||||
│ │ ├── __init__.py # Factory: get_provider()
|
||||
│ │ ├── ollama_provider.py
|
||||
│ │ ├── openai_provider.py
|
||||
│ │ ├── anthropic_provider.py
|
||||
│ │ ├── lmstudio_provider.py
|
||||
│ │ └── utils.py # Shared AI utilities
|
||||
│ ├── db/ # Database layer
|
||||
│ │ ├── models.py # SQLAlchemy ORM — 11 tables, all UUID PKs
|
||||
│ │ └── session.py # Async engine + AsyncSessionLocal factory
|
||||
│ ├── deps/ # FastAPI dependency injection
|
||||
│ │ ├── auth.py # get_current_user, get_current_admin, get_regular_user
|
||||
│ │ ├── db.py # get_db (per-request AsyncSession)
|
||||
│ │ └── utils.py # get_client_ip
|
||||
│ ├── tasks/ # Celery async task modules
|
||||
│ │ ├── document_tasks.py # extract_and_classify, cleanup_abandoned_uploads
|
||||
│ │ ├── email_tasks.py # send_reset_email, send_security_alert_email
|
||||
│ │ └── audit_tasks.py # audit_log_daily_export (nightly Celery beat)
|
||||
│ ├── migrations/ # Alembic migration scripts
|
||||
│ │ ├── versions/
|
||||
│ │ │ ├── 0001_initial_schema.py
|
||||
│ │ │ ├── 0002_add_backup_codes_and_password_must_change.py
|
||||
│ │ │ ├── 0003_multi_user_isolation.py
|
||||
│ │ │ └── 0004_phase4_pdf_open_mode_tsvector.py
|
||||
│ │ └── env.py # Alembic async migration runner
|
||||
│ ├── tests/ # Backend test suite (pytest + httpx)
|
||||
│ │ ├── conftest.py # Shared fixtures (async engine, client, users)
|
||||
│ │ ├── test_auth_api.py
|
||||
│ │ ├── test_documents.py
|
||||
│ │ ├── test_folders.py
|
||||
│ │ ├── test_shares.py
|
||||
│ │ ├── test_cloud.py
|
||||
│ │ ├── test_admin_api.py
|
||||
│ │ ├── test_audit.py
|
||||
│ │ ├── test_quota.py
|
||||
│ │ ├── test_security.py
|
||||
│ │ └── ... # 28 test files total
|
||||
│ └── data/ # Static data files (topic seed data etc.)
|
||||
│
|
||||
├── frontend/ # Vue 3 SPA
|
||||
│ ├── src/
|
||||
│ │ ├── main.js # Vue app mount, Pinia + Router registration
|
||||
│ │ ├── App.vue # Root component — layout switcher (auth vs app)
|
||||
│ │ ├── style.css # Global Tailwind CSS entry
|
||||
│ │ ├── api/
|
||||
│ │ │ └── client.js # fetch wrapper, Bearer injection, 401→refresh→retry
|
||||
│ │ ├── stores/ # Pinia state stores
|
||||
│ │ │ ├── auth.js # accessToken (memory), user, quota, refresh
|
||||
│ │ │ ├── documents.js # documents list, upload flow, search/sort
|
||||
│ │ │ ├── folders.js # folder tree, breadcrumb, rootFolders
|
||||
│ │ │ ├── topics.js # topics list CRUD
|
||||
│ │ │ └── cloudConnections.js # cloud connection list
|
||||
│ │ ├── router/
|
||||
│ │ │ └── index.js # Routes + beforeEach auth guard (silent refresh)
|
||||
│ │ ├── layouts/
|
||||
│ │ │ └── AuthLayout.vue # Centered card layout for login/register pages
|
||||
│ │ ├── views/ # Page-level components (one per route)
|
||||
│ │ │ ├── FileManagerView.vue # / and /folders/:id — unified file manager
|
||||
│ │ │ ├── DocumentView.vue # /document/:id — document detail + preview
|
||||
│ │ │ ├── TopicsView.vue # /topics — topic management
|
||||
│ │ │ ├── SettingsView.vue # /settings — user settings + TOTP
|
||||
│ │ │ ├── AdminView.vue # /admin — admin panel (users, audit log)
|
||||
│ │ │ ├── SharedView.vue # /shared — documents shared with me
|
||||
│ │ │ ├── CloudStorageView.vue # /cloud — cloud connections overview
|
||||
│ │ │ ├── CloudFolderView.vue # /cloud/:provider/:folderId — cloud folder browser
|
||||
│ │ │ └── auth/ # Auth flow pages
|
||||
│ │ │ ├── LoginView.vue
|
||||
│ │ │ ├── RegisterView.vue
|
||||
│ │ │ ├── PasswordResetView.vue
|
||||
│ │ │ └── NewPasswordView.vue
|
||||
│ │ ├── components/ # Reusable UI components
|
||||
│ │ │ ├── storage/
|
||||
│ │ │ │ └── StorageBrowser.vue # Core file manager widget (local + cloud modes)
|
||||
│ │ │ ├── layout/
|
||||
│ │ │ │ ├── AppSidebar.vue # Navigation sidebar with folder tree + quota bar
|
||||
│ │ │ │ └── QuotaBar.vue # Storage quota progress bar
|
||||
│ │ │ ├── documents/
|
||||
│ │ │ │ └── DocumentCard.vue # Single document row in file manager
|
||||
│ │ │ ├── folders/
|
||||
│ │ │ │ ├── FolderTreeItem.vue # Recursive sidebar folder tree node
|
||||
│ │ │ │ └── FolderDeleteModal.vue
|
||||
│ │ │ ├── cloud/
|
||||
│ │ │ │ ├── CloudProviderTreeItem.vue
|
||||
│ │ │ │ └── CloudFolderTreeItem.vue
|
||||
│ │ │ ├── sharing/
|
||||
│ │ │ │ └── ShareModal.vue # Share document with another user
|
||||
│ │ │ ├── upload/
|
||||
│ │ │ │ └── DropZone.vue # Drag-and-drop file upload zone
|
||||
│ │ │ ├── auth/ # Auth form components
|
||||
│ │ │ ├── admin/ # Admin panel sub-components
|
||||
│ │ │ ├── settings/ # Settings page sub-components
|
||||
│ │ │ ├── topics/ # Topic chip/badge components
|
||||
│ │ │ └── ui/ # Generic UI primitives (TreeItem.vue, etc.)
|
||||
│ │ └── utils/ # Frontend utility functions
|
||||
│ ├── index.html # Vite HTML entry
|
||||
│ ├── vite.config.js # Vite config (proxy /api → :8000)
|
||||
│ ├── tailwind.config.js # Tailwind CSS config
|
||||
│ ├── vitest.config.js # Vitest test config
|
||||
│ └── package.json # npm dependencies
|
||||
│
|
||||
├── docker/
|
||||
│ └── postgres/
|
||||
│ └── initdb.d/ # PostgreSQL init scripts (DB user + role setup)
|
||||
│
|
||||
├── docker-compose.yml # All services: postgres, minio, redis, backend,
|
||||
│ # celery-worker, celery-beat, frontend
|
||||
├── .env.example # Documented env var template (safe to commit)
|
||||
├── .env # Local secrets (gitignored)
|
||||
├── CLAUDE.md # Project instructions for Claude agents
|
||||
├── SECURITY.md # Security audit findings and mitigations
|
||||
└── .planning/ # GSD workflow planning artifacts
|
||||
├── ROADMAP.md
|
||||
├── REQUIREMENTS.md
|
||||
├── STATE.md
|
||||
├── PROJECT.md
|
||||
└── codebase/ # Codebase map (this directory)
|
||||
```
|
||||
|
||||
---
|
||||
## Directory Purposes
|
||||
|
||||
## Backend
|
||||
**`backend/api/`:**
|
||||
- Purpose: HTTP endpoint handlers — thin layer only. No business logic.
|
||||
- Contains: One module per resource (`auth.py`, `documents.py`, `folders.py`, etc.)
|
||||
- Key files: `backend/api/documents.py` (presigned upload flow), `backend/api/auth.py` (JWT issuance)
|
||||
|
||||
```
|
||||
backend/
|
||||
├── main.py FastAPI app: CORS, lifespan, router registration
|
||||
├── config.py Path constants, DEFAULT_SETTINGS, ensure_data_dirs()
|
||||
├── requirements.txt Python dependencies
|
||||
├── pytest.ini pytest config (asyncio_mode=auto)
|
||||
├── Dockerfile
|
||||
│
|
||||
├── api/ FastAPI routers (thin HTTP layer)
|
||||
│ ├── documents.py Upload, list, get, delete, reclassify endpoints
|
||||
│ ├── topics.py Topic CRUD endpoints
|
||||
│ └── settings.py AI provider settings endpoints
|
||||
│
|
||||
├── ai/ AI provider abstraction
|
||||
│ ├── base.py AIProvider ABC + ClassificationResult dataclass
|
||||
│ ├── __init__.py get_provider() factory
|
||||
│ ├── anthropic_provider.py
|
||||
│ ├── openai_provider.py
|
||||
│ ├── ollama_provider.py extends OpenAIProvider
|
||||
│ └── lmstudio_provider.py extends OpenAIProvider
|
||||
│
|
||||
├── services/ Business logic (no FastAPI dependency)
|
||||
│ ├── extractor.py Text extraction: PDF/DOCX/image/text dispatch
|
||||
│ ├── classifier.py Orchestrates AI call + topic auto-creation
|
||||
│ └── storage.py Flat-file JSON CRUD + filelock
|
||||
│
|
||||
├── data/ Runtime data (volume-mounted in Docker)
|
||||
│ ├── uploads/ Uploaded document files
|
||||
│ ├── metadata/ Per-document JSON metadata files
|
||||
│ ├── topics.json Global topic list
|
||||
│ └── settings.json Active AI provider + system prompt config
|
||||
│
|
||||
└── tests/
|
||||
├── conftest.py Fixtures: isolated tmp data dir, TestClient, sample files
|
||||
├── test_health.py
|
||||
├── test_documents.py
|
||||
├── test_topics.py
|
||||
├── test_settings.py
|
||||
├── test_extractor.py
|
||||
├── test_classifier.py
|
||||
└── test_lmstudio.py
|
||||
```
|
||||
**`backend/services/`:**
|
||||
- Purpose: Business logic decoupled from FastAPI. Functions are pure async Python.
|
||||
- Contains: `auth.py` (crypto, TOTP, HIBP), `classifier.py` (AI orchestration), `extractor.py` (text extraction), `storage.py` (ORM queries), `audit.py` (audit log writer), `cloud_cache.py` (TTL cache), `email.py` (email helpers)
|
||||
- Rule: No module in `services/` may import from `fastapi` or `api/`
|
||||
|
||||
---
|
||||
**`backend/storage/`:**
|
||||
- Purpose: All object storage interaction behind the `StorageBackend` ABC
|
||||
- Contains: `base.py` (interface), factory `__init__.py`, one file per backend, `cloud_utils.py` (HKDF encrypt/decrypt), `exceptions.py`
|
||||
- Key invariant: `get_storage_backend_for_document()` is the only place cloud credentials are decrypted
|
||||
|
||||
## Frontend
|
||||
**`backend/ai/`:**
|
||||
- Purpose: AI classification providers behind the `AIProvider` ABC
|
||||
- Contains: `base.py` (interface + `ClassificationResult`), factory `__init__.py`, one file per provider
|
||||
- Selected per-user via `users.ai_provider` + `users.ai_model` DB columns
|
||||
|
||||
```
|
||||
frontend/
|
||||
├── index.html Vite entry HTML
|
||||
├── vite.config.js Vite config (Vue plugin, /api proxy)
|
||||
├── tailwind.config.js
|
||||
├── postcss.config.js
|
||||
├── package.json Vue 3, Vue Router 4, Pinia; no test framework
|
||||
├── Dockerfile
|
||||
│
|
||||
└── src/
|
||||
├── main.js App bootstrap: Vue + Pinia + Router
|
||||
├── App.vue Root component (sidebar layout wrapper)
|
||||
├── style.css Global Tailwind imports
|
||||
│
|
||||
├── api/
|
||||
│ └── client.js fetch wrapper; all API calls go through here
|
||||
│
|
||||
├── stores/ Pinia stores (data + actions layer)
|
||||
│ ├── documents.js Document list, upload, classify state
|
||||
│ ├── topics.js Topic list CRUD state
|
||||
│ └── settings.js AI provider settings state
|
||||
│
|
||||
├── router/
|
||||
│ └── index.js Routes: /, /topics, /topics/:name, /document/:id, /settings
|
||||
│
|
||||
├── views/ Page-level components (one per route)
|
||||
│ ├── HomeView.vue
|
||||
│ ├── TopicsView.vue
|
||||
│ ├── DocumentView.vue
|
||||
│ └── SettingsView.vue
|
||||
│
|
||||
└── components/ Reusable UI components
|
||||
├── layout/
|
||||
│ └── AppSidebar.vue
|
||||
├── documents/
|
||||
│ └── DocumentCard.vue
|
||||
├── topics/
|
||||
│ ├── TopicBadge.vue
|
||||
│ └── TopicManager.vue
|
||||
└── upload/
|
||||
├── DropZone.vue
|
||||
└── UploadProgress.vue
|
||||
```
|
||||
**`backend/db/`:**
|
||||
- Purpose: ORM schema and session management
|
||||
- Contains: `models.py` (11 tables, all UUID PKs, full index declarations), `session.py` (async engine, `AsyncSessionLocal`)
|
||||
- Note: Two DB users — `docuvault_app` (DML only, used at runtime) and `docuvault_migrate` (DDL, used by Alembic only)
|
||||
|
||||
---
|
||||
**`backend/deps/`:**
|
||||
- Purpose: FastAPI `Depends()` callables — shared dependency injection
|
||||
- Contains: `get_db` (per-request session), `get_current_user`, `get_current_admin`, `get_regular_user`, `get_client_ip`
|
||||
|
||||
## Key Entry Points
|
||||
**`backend/tasks/`:**
|
||||
- Purpose: Celery task definitions for async background work
|
||||
- Contains: `document_tasks.py` (extraction + classification + cleanup), `email_tasks.py` (password reset + security alerts), `audit_tasks.py` (nightly CSV export)
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `backend/main.py` | FastAPI app instantiation, middleware, router registration |
|
||||
| `backend/config.py` | All path constants and default settings — change storage paths here |
|
||||
| `backend/ai/__init__.py` | Add a new AI provider here |
|
||||
| `frontend/src/main.js` | Vue app bootstrap |
|
||||
| `frontend/src/api/client.js` | All HTTP calls originate here |
|
||||
**`backend/migrations/versions/`:**
|
||||
- Purpose: Alembic migration history
|
||||
- Contains: Sequentially numbered migration scripts (`0001_` → `0004_`)
|
||||
- Generated: Manually reviewed, never auto-generated and committed directly
|
||||
|
||||
---
|
||||
**`backend/tests/`:**
|
||||
- Purpose: pytest test suite using `httpx.AsyncClient` with real PostgreSQL
|
||||
- Contains: 28 test files covering all endpoints, security invariants, and services
|
||||
- Key files: `conftest.py` (shared fixtures), `test_security.py` (IDOR, admin block, CSRF tests)
|
||||
|
||||
**`frontend/src/stores/`:**
|
||||
- Purpose: Pinia stores — application state + API calls
|
||||
- Contains: `auth.js`, `documents.js`, `folders.js`, `topics.js`, `cloudConnections.js`
|
||||
- Rule: Stores are the only place `api/client.js` is called from. Views do not call `api/` directly.
|
||||
|
||||
**`frontend/src/api/`:**
|
||||
- Purpose: Thin HTTP client wrapper
|
||||
- Contains: `client.js` — all `fetch()` calls, Bearer header injection, 401→refresh→retry logic, all exported API functions
|
||||
- Rule: No business logic here — purely request/response translation
|
||||
|
||||
**`frontend/src/views/`:**
|
||||
- Purpose: Route-level page components
|
||||
- Contains: One `.vue` file per route. Views wire stores to components via event delegation.
|
||||
- Key file: `FileManagerView.vue` — root view, delegates to `StorageBrowser` component
|
||||
|
||||
**`frontend/src/components/storage/`:**
|
||||
- Purpose: Reusable file manager widget
|
||||
- Contains: `StorageBrowser.vue` — unified listing component for local folder mode and cloud folder mode
|
||||
|
||||
**`frontend/src/components/layout/`:**
|
||||
- Purpose: Persistent app shell
|
||||
- Contains: `AppSidebar.vue` (navigation, folder tree, cloud links, quota bar), `QuotaBar.vue` (storage progress)
|
||||
|
||||
## Key File Locations
|
||||
|
||||
**Entry Points:**
|
||||
- `backend/main.py`: FastAPI app — start here for any backend investigation
|
||||
- `backend/celery_app.py`: Celery factory — start here for task routing investigation
|
||||
- `frontend/src/main.js`: Vue app mount
|
||||
- `frontend/src/router/index.js`: All routes + auth guard
|
||||
|
||||
**Configuration:**
|
||||
- `backend/config.py`: All env vars with defaults (Pydantic Settings)
|
||||
- `.env.example`: Documented env var template
|
||||
- `docker-compose.yml`: Full service topology with env var wiring
|
||||
- `frontend/vite.config.js`: Dev proxy config (`/api` → `:8000`)
|
||||
|
||||
**Core Logic:**
|
||||
- `backend/db/models.py`: Full ORM schema — reference for all table structures
|
||||
- `backend/services/auth.py`: JWT, Argon2, TOTP, HIBP — all auth primitives
|
||||
- `backend/storage/__init__.py`: Storage backend factory — entry point for understanding storage routing
|
||||
- `backend/storage/cloud_utils.py`: HKDF credential encryption/decryption
|
||||
|
||||
**Testing:**
|
||||
- `backend/tests/conftest.py`: Test fixtures — DB setup, user creation, auth helpers
|
||||
- `backend/tests/test_security.py`: Security invariant tests (IDOR, admin block, CSRF, timing)
|
||||
|
||||
## Naming Conventions
|
||||
|
||||
**Backend files:**
|
||||
- Modules: `snake_case.py`
|
||||
- One module per resource/concern in `api/` (matches the resource noun: `documents.py`, `folders.py`)
|
||||
- One module per backend in `storage/` (`{provider}_backend.py`)
|
||||
- One module per provider in `ai/` (`{provider}_provider.py`)
|
||||
|
||||
**Frontend files:**
|
||||
- Vue components: `PascalCase.vue`
|
||||
- Stores: `camelCase.js` matching the resource noun (`documents.js`, `folders.js`)
|
||||
- Views: `{Name}View.vue` pattern
|
||||
|
||||
**Database:**
|
||||
- All tables: `snake_case` plural (`users`, `refresh_tokens`, `cloud_connections`)
|
||||
- All PKs: UUID type
|
||||
- FKs: `{table_singular}_id` pattern (`user_id`, `folder_id`, `document_id`)
|
||||
|
||||
## Where to Add New Code
|
||||
|
||||
- **New API endpoint**: add router in `backend/api/`, register in `backend/main.py`
|
||||
- **New AI provider**: implement `AIProvider` ABC in `backend/ai/`, add case in `get_provider()`
|
||||
- **New document type**: add extraction branch in `backend/services/extractor.py`
|
||||
- **New frontend page**: add view in `src/views/`, add route in `src/router/index.js`
|
||||
- **New shared UI component**: add to relevant `src/components/<category>/` subdirectory
|
||||
**New API endpoint (new resource):**
|
||||
- Create `backend/api/{resource}.py` with `APIRouter(prefix="/api/{resource}")`
|
||||
- Add service logic to `backend/services/{resource}.py` (or extend existing service)
|
||||
- Register router in `backend/main.py` with `app.include_router()`
|
||||
- Add corresponding `export function {action}{Resource}()` calls to `frontend/src/api/client.js`
|
||||
|
||||
**New Vue page (new route):**
|
||||
- Create `frontend/src/views/{Name}View.vue`
|
||||
- Add route to `frontend/src/router/index.js`
|
||||
- If it needs auth: add `meta: { requiresAuth: true }` (or `requiresAdmin: true`)
|
||||
|
||||
**New Pinia store:**
|
||||
- Create `frontend/src/stores/{resource}.js` using Composition API pattern (`defineStore('name', () => { ... })`)
|
||||
- Export named: `export const use{Resource}Store`
|
||||
|
||||
**New storage backend:**
|
||||
- Implement `StorageBackend` ABC from `backend/storage/base.py`
|
||||
- Create `backend/storage/{provider}_backend.py`
|
||||
- Add lazy import branch in `get_storage_backend_for_document()` in `backend/storage/__init__.py`
|
||||
|
||||
**New AI provider:**
|
||||
- Implement `AIProvider` ABC from `backend/ai/base.py`
|
||||
- Create `backend/ai/{provider}_provider.py`
|
||||
- Register in `backend/ai/__init__.py` factory
|
||||
|
||||
**New Celery task:**
|
||||
- Add task function to appropriate `backend/tasks/*.py` module
|
||||
- Decorate with `@celery_app.task(name="tasks.{module}.{task_name}")`
|
||||
- If periodic: add to `celery_app.conf.beat_schedule` in `backend/celery_app.py`
|
||||
|
||||
**New DB table:**
|
||||
- Add ORM model class to `backend/db/models.py` extending `Base`
|
||||
- Create new Alembic migration: `alembic revision --autogenerate -m "description"`
|
||||
- Review and test the generated migration before committing
|
||||
|
||||
**New tests:**
|
||||
- Backend: add `backend/tests/test_{resource}.py`
|
||||
- Use fixtures from `backend/tests/conftest.py` (async session, auth client, test users)
|
||||
- Security invariant tests belong in `backend/tests/test_security.py`
|
||||
|
||||
## Special Directories
|
||||
|
||||
**`.planning/`:**
|
||||
- Purpose: GSD workflow planning artifacts (roadmap, requirements, phase plans, codebase maps)
|
||||
- Generated: Partially (codebase maps regenerated by mapper agents)
|
||||
- Committed: Yes
|
||||
|
||||
**`backend/data/`:**
|
||||
- Purpose: Static data files (topic seed data, fixture CSVs)
|
||||
- Generated: No
|
||||
- Committed: Yes
|
||||
|
||||
**`frontend/dist/`:**
|
||||
- Purpose: Vite production build output
|
||||
- Generated: Yes (`npm run build`)
|
||||
- Committed: No (gitignored)
|
||||
|
||||
**`backend/migrations/versions/`:**
|
||||
- Purpose: Alembic migration history — one file per schema change
|
||||
- Generated: Via `alembic revision` then manually reviewed
|
||||
- Committed: Yes — each migration is a permanent historical artifact
|
||||
|
||||
**`.claude/worktrees/`:**
|
||||
- Purpose: Isolated git worktrees used by Claude Code agent subprocesses
|
||||
- Generated: Yes (by `/gsd:execute-phase` and related commands)
|
||||
- Committed: No
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No `src/components/settings/` subdirectory — settings UI is entirely in `SettingsView.vue`
|
||||
- No migration or schema versioning for `topics.json` / `settings.json` flat files
|
||||
*Structure analysis: 2026-06-02*
|
||||
|
||||
+318
-74
@@ -1,87 +1,331 @@
|
||||
# TESTING — document-scanner
|
||||
# Testing Patterns
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## Summary
|
||||
## Test Framework
|
||||
|
||||
The backend has solid integration test coverage across all API surfaces and services using pytest + FastAPI TestClient. Each test runs in a fully isolated temporary data directory, so there is no shared state between tests. The frontend has no test framework configured at all.
|
||||
**Backend Runner:**
|
||||
- pytest 8.2+ with pytest-asyncio
|
||||
- Config: `backend/pytest.ini` — `asyncio_mode = auto`, `testpaths = tests`
|
||||
- `asyncio_mode = auto` means all `async def test_*` functions run as coroutines automatically
|
||||
|
||||
---
|
||||
**Backend Assertion Library:**
|
||||
- pytest built-in `assert`
|
||||
- `unittest.mock` for `AsyncMock`, `MagicMock`, `patch`
|
||||
|
||||
## Backend Testing
|
||||
|
||||
### Framework
|
||||
- **pytest** + **pytest-asyncio** (`asyncio_mode = auto` in `pytest.ini`)
|
||||
- **FastAPI TestClient** (synchronous ASGI test client from `httpx`)
|
||||
- No mocking library — AI calls are either tested with real parsing logic or the AI layer is swapped via provider mocking
|
||||
|
||||
### Test Isolation Strategy (conftest.py)
|
||||
- `isolated_data_dir` fixture is `autouse=True` — every test automatically gets:
|
||||
- A fresh `tmp_path/data/` directory with `uploads/`, `metadata/`
|
||||
- Clean `topics.json` and `settings.json` initialized from `DEFAULT_SETTINGS`
|
||||
- Monkeypatched `DATA_DIR` env var and all module-level path constants in `config` and `services.storage`
|
||||
- New `FileLock` instances pointing to the tmp dir
|
||||
- `client` fixture wraps FastAPI `TestClient` with the isolated data dir active
|
||||
|
||||
### Test Files
|
||||
|
||||
| File | What it covers |
|
||||
|---|---|
|
||||
| `test_health.py` | `GET /health` returns `{"status": "ok"}` |
|
||||
| `test_documents.py` | Upload TXT/PDF (no-classify), list, get, delete; extracts text correctly |
|
||||
| `test_topics.py` | Create, list, delete topics via API |
|
||||
| `test_settings.py` | Read default settings, update provider config |
|
||||
| `test_extractor.py` | Unit tests for `extract_text()` on TXT, PDF, DOCX, image paths |
|
||||
| `test_classifier.py` | Unit tests for JSON parsing helpers (`_parse_classification`, `_parse_suggestions`, `_strip_code_fences`) — no real AI calls |
|
||||
| `test_lmstudio.py` | LMStudio provider-specific behaviour (likely mocked or uses a local endpoint) |
|
||||
|
||||
### Fixtures Available
|
||||
|
||||
| Fixture | Provides |
|
||||
|---|---|
|
||||
| `isolated_data_dir` | Autouse — clean tmp data dir |
|
||||
| `client` | FastAPI TestClient with isolated data |
|
||||
| `sample_txt` | A `.txt` file with test content |
|
||||
| `sample_pdf` | A minimal valid PDF created with PyMuPDF |
|
||||
|
||||
### What Is NOT Tested
|
||||
|
||||
- Auto-classification flow end-to-end (requires a live AI provider)
|
||||
- Document reclassify endpoint
|
||||
- Anthropic, OpenAI, Ollama provider implementations directly
|
||||
- Any concurrent write / filelock contention scenarios
|
||||
- File size / type validation edge cases
|
||||
- Frontend — no tests exist
|
||||
|
||||
---
|
||||
|
||||
## Frontend Testing
|
||||
|
||||
- **No test framework installed** — `package.json` has no `vitest`, `jest`, or `@testing-library/vue`
|
||||
- No test files found under `frontend/src/`
|
||||
- No Cypress or Playwright configuration
|
||||
|
||||
---
|
||||
|
||||
## Running Tests
|
||||
**Frontend Runner:**
|
||||
- Vitest 4.1.7
|
||||
- Config: `frontend/vitest.config.js` — `environment: 'happy-dom'`, `globals: true`
|
||||
- `@vue/test-utils` 2.4.10 for component mounting
|
||||
|
||||
**Run Commands:**
|
||||
```bash
|
||||
# From backend/
|
||||
pytest
|
||||
# Backend — from backend/ directory
|
||||
pytest -v # Run all tests
|
||||
pytest tests/test_auth_api.py # Single file
|
||||
INTEGRATION=1 pytest -v # Run with live Docker services (PostgreSQL + MinIO + Redis)
|
||||
|
||||
# With verbose output
|
||||
pytest -v
|
||||
# Frontend — from frontend/ directory
|
||||
npm test # vitest run (one-shot)
|
||||
npx vitest # watch mode
|
||||
```
|
||||
|
||||
# Single file
|
||||
pytest tests/test_documents.py
|
||||
## Test File Organization
|
||||
|
||||
**Backend location:** All tests in `backend/tests/`; flat structure, one file per concern.
|
||||
|
||||
**Naming:**
|
||||
- `test_<area>.py` — `test_auth_api.py`, `test_documents.py`, `test_shares.py`
|
||||
- `test_<layer>_<area>.py` for unit tests: `test_task2_auth_service.py`, `test_cloud_backends.py`
|
||||
|
||||
**Frontend location:** Co-located in `__tests__/` subdirectories next to the code they test:
|
||||
- `frontend/src/stores/__tests__/auth.test.js`
|
||||
- `frontend/src/components/folders/__tests__/FolderTreeItem.test.js`
|
||||
- `frontend/src/views/__tests__/FileManagerView.test.js`
|
||||
- `frontend/src/router/__tests__/router.guard.test.js`
|
||||
|
||||
## Backend Test Structure
|
||||
|
||||
**Standard async test (most common pattern):**
|
||||
```python
|
||||
@pytest.mark.asyncio
|
||||
async def test_register_success(authed_client):
|
||||
"""POST /api/auth/register with valid data returns 201 with id and handle."""
|
||||
resp = await _register(authed_client)
|
||||
assert resp.status_code == 201, resp.text
|
||||
data = resp.json()
|
||||
assert "id" in data
|
||||
assert data["handle"] == "testuser"
|
||||
```
|
||||
|
||||
**Module-level async mark (newer pattern, avoids per-function decorator):**
|
||||
```python
|
||||
pytestmark = pytest.mark.asyncio # at module top — used in test_shares.py, test_audit.py
|
||||
```
|
||||
|
||||
**Shared helper functions:** Each test file defines async helper functions (not fixtures) for setup operations:
|
||||
```python
|
||||
async def _register(async_client, handle="testuser", email="t@example.com", password="ValidPass12!"):
|
||||
return await async_client.post("/api/auth/register", json={...})
|
||||
```
|
||||
|
||||
**ORM-direct test data creation:** Tests often insert data via ORM rather than API to test specific states:
|
||||
```python
|
||||
doc = Document(id=doc_id, user_id=auth_user["user"].id, ...)
|
||||
db_session.add(doc)
|
||||
await db_session.commit()
|
||||
```
|
||||
|
||||
## Backend Fixtures (conftest.py)
|
||||
|
||||
All fixtures are async (`@pytest_asyncio.fixture`) unless purely synchronous.
|
||||
|
||||
**Session fixture:**
|
||||
```python
|
||||
@pytest_asyncio.fixture
|
||||
async def db_session():
|
||||
# In-memory SQLite with PostgreSQL type shims (INET, JSONB patched to TEXT)
|
||||
# Used for all unit/integration tests without live services
|
||||
```
|
||||
|
||||
**HTTP client fixtures:**
|
||||
```python
|
||||
@pytest_asyncio.fixture
|
||||
async def async_client(db_session):
|
||||
# httpx.AsyncClient + ASGITransport wrapping the real FastAPI app
|
||||
# DB dependency overridden via app.dependency_overrides[get_db]
|
||||
```
|
||||
|
||||
**Auth fixtures (shared across all API tests):**
|
||||
```python
|
||||
@pytest_asyncio.fixture
|
||||
async def auth_user(db_session):
|
||||
# Creates User + Quota, issues JWT, returns:
|
||||
# { "user": User, "token": str, "headers": {"Authorization": "Bearer ..."} }
|
||||
|
||||
@pytest_asyncio.fixture
|
||||
async def second_auth_user(db_session):
|
||||
# Same shape as auth_user — used for sharing tests (owner + recipient)
|
||||
|
||||
@pytest_asyncio.fixture
|
||||
async def admin_user(db_session):
|
||||
# Same shape, role="admin"
|
||||
```
|
||||
|
||||
**Infrastructure mocks:**
|
||||
```python
|
||||
@pytest.fixture
|
||||
def mock_minio_presigned(monkeypatch):
|
||||
# Patches MinIOBackend.generate_presigned_put_url with AsyncMock
|
||||
|
||||
@pytest.fixture
|
||||
def mock_minio_stat(monkeypatch):
|
||||
# Patches MinIOBackend.stat_object with AsyncMock returning 1024 bytes
|
||||
# Override per-test: mock_minio_stat.return_value = 50_000_000
|
||||
```
|
||||
|
||||
**Cloud fixtures:**
|
||||
```python
|
||||
@pytest.fixture
|
||||
def mock_google_drive_creds(): # Fake OAuth credential dict
|
||||
|
||||
@pytest.fixture
|
||||
def mock_onedrive_creds(): # Fake MSAL credential dict
|
||||
|
||||
@pytest.fixture
|
||||
async def cloud_connection_factory(db_session):
|
||||
# Factory: creates CloudConnection ORM rows
|
||||
# Usage: conn = await cloud_connection_factory(session, user_id, provider="google_drive")
|
||||
```
|
||||
|
||||
**File fixtures:**
|
||||
```python
|
||||
@pytest.fixture
|
||||
def sample_txt(tmp_path): # Creates "sample.txt" in tmp_path
|
||||
|
||||
@pytest.fixture
|
||||
def sample_pdf(tmp_path): # Creates minimal PDF via PyMuPDF
|
||||
```
|
||||
|
||||
## Service Availability and Integration Mode
|
||||
|
||||
Tests default to **in-memory SQLite** (no live services required):
|
||||
- PostgreSQL-specific types (UUID, INET, JSONB) are patched via `SQLiteTypeCompiler` monkey-patching
|
||||
- Tests that require PostgreSQL row-level locking semantics are marked `@pytest.mark.xfail(strict=False)`
|
||||
|
||||
For **live service testing**, set `INTEGRATION=1` or have Docker services running on their default ports (PostgreSQL:5432, MinIO:9000, Redis:6379). The `live_services_available()` fixture detects this.
|
||||
|
||||
## Mocking
|
||||
|
||||
**Backend mocking:**
|
||||
- `unittest.mock.patch` for external service calls: `patch("services.auth.check_hibp", return_value=True)`
|
||||
- `AsyncMock` for async methods: `monkeypatch.setattr(MinIOBackend, "stat_object", mock, raising=False)`
|
||||
- `FakeRedis` class defined inline in test files that need it (test_auth_api.py, test_security_headers.py, test_totp_replay.py) — in-memory dict with TTL support, mirrors Redis get/set/incr/expire interface
|
||||
- Celery tasks mocked with `MagicMock`: `monkeypatch.setattr("api.documents.extract_and_classify.delay", MagicMock())`
|
||||
- `app.dependency_overrides[get_db] = lambda: db_session` for DB substitution
|
||||
|
||||
**Frontend mocking:**
|
||||
- `vi.mock('../../api/client.js', () => ({ login: vi.fn(), ... }))` — mock entire API module
|
||||
- Individual function mocks: `const mockListFolders = vi.fn()` then `vi.mock(...)` referencing the mock
|
||||
- Store mocks for component tests: `vi.mock('../../stores/auth.js', () => ({ useAuthStore: () => ({ user: {...} }) }))`
|
||||
- Heavy child component stubs: `vi.mock('../../components/X.vue', () => ({ default: { template: '<div/>' } }))`
|
||||
- Browser storage stubs: `Object.defineProperty(globalThis, 'localStorage', { value: fakeLocalStorage })`
|
||||
|
||||
## Frontend Test Structure
|
||||
|
||||
**Store tests (primary coverage):**
|
||||
```javascript
|
||||
import { describe, it, expect, vi, beforeEach } from 'vitest'
|
||||
import { setActivePinia, createPinia } from 'pinia'
|
||||
|
||||
beforeEach(() => {
|
||||
setActivePinia(createPinia()) // fresh Pinia before each test
|
||||
vi.clearAllMocks()
|
||||
})
|
||||
|
||||
describe('useAuthStore — behavior group', () => {
|
||||
it('describes exactly one assertion', async () => {
|
||||
api.login.mockResolvedValue({ access_token: 'tok', user: {...} })
|
||||
const store = useAuthStore()
|
||||
await store.login('u@x.com', 'pass')
|
||||
expect(store.accessToken).toBe('tok')
|
||||
})
|
||||
})
|
||||
```
|
||||
|
||||
**Component tests (mount-based):**
|
||||
```javascript
|
||||
import { mount, flushPromises } from '@vue/test-utils'
|
||||
// ...
|
||||
const wrapper = mount(ComponentName, {
|
||||
props: { item: makeItem() },
|
||||
global: { plugins: [router] }
|
||||
})
|
||||
await flushPromises()
|
||||
expect(wrapper.find('button').exists()).toBe(false)
|
||||
```
|
||||
|
||||
## Coverage by Area
|
||||
|
||||
### Backend Coverage (329 test functions across 26 test files)
|
||||
|
||||
| Area | Test file(s) | Coverage |
|
||||
|------|-------------|----------|
|
||||
| Auth API (register, login, TOTP, backup codes, refresh, logout, change-password) | `test_auth_api.py` (498 lines) | High |
|
||||
| Auth service unit tests (JWT, password, TOTP, backup codes) | `test_task2_auth_service.py` | High |
|
||||
| Auth dependencies (get_current_user, get_current_admin) | `test_auth_deps.py` | High |
|
||||
| TOTP replay prevention (AUTH-08) | `test_totp_replay.py` (239 lines) | High |
|
||||
| Per-account rate limiting (SEC-02) | `test_auth_api.py` | High |
|
||||
| Documents API (list, filter, confirm, delete, PATCH, content) | `test_documents.py` (925 lines) | High |
|
||||
| Quota enforcement (atomic increment, concurrent race, delete decrement) | `test_quota.py` (239 lines) | Medium — concurrent race xfail on SQLite |
|
||||
| Folder API (CRUD, breadcrumb, IDOR) | `test_folders.py` (494 lines) | High |
|
||||
| Sharing API (SHARE-01 through SHARE-05) | `test_shares.py` (454 lines) | High |
|
||||
| Admin API (users, quotas, AI config, ADMIN-07 no-impersonation) | `test_admin_api.py` (431 lines) | High |
|
||||
| Audit log (SHARE events, AUTH events, CSV export) | `test_audit.py` (355 lines) | High |
|
||||
| Security headers (CSP, X-Frame-Options, nosniff) | `test_security_headers.py` | High |
|
||||
| Security invariants (credentials_enc not exposed, IDOR) | `test_security.py` | High |
|
||||
| Constant-time comparisons (SEC-03, hmac.compare_digest) | `test_constant_time_auth.py` | High |
|
||||
| Cloud storage (CLOUD-01 through CLOUD-07, SSRF, IDOR) | `test_cloud.py` (855 lines) | High |
|
||||
| Cloud backends (Google Drive, OneDrive, WebDAV, Nextcloud) | `test_cloud_backends.py`, `test_webdav_backend.py` | Medium |
|
||||
| Cloud credential encryption/decryption | `test_cloud_utils.py` (273 lines) | High |
|
||||
| AI classifier JSON parsing | `test_classifier.py` (266 lines) | High |
|
||||
| Text extraction | `test_extractor.py` | High |
|
||||
| MinIO object key schema | `test_storage.py` (277 lines) | Medium |
|
||||
| Settings API | `test_settings.py` | Medium |
|
||||
| Topics API | `test_topics.py` (204 lines) | High |
|
||||
| Health endpoint | `test_health.py` | Low (smoke test) |
|
||||
| Alembic migrations | `test_alembic.py` (246 lines) | Medium |
|
||||
| LM Studio provider | `test_lmstudio.py` | Conditional — `@pytest.mark.skipif` unless reachable |
|
||||
|
||||
### Frontend Coverage (14 test files, ~163 test cases)
|
||||
|
||||
| Area | Test file | Coverage |
|
||||
|------|-----------|----------|
|
||||
| Auth store (login, logout, TOTP, no-browser-storage invariant) | `stores/__tests__/auth.test.js` | High |
|
||||
| Folders store (fetchFolders, createFolder, rename, delete) | `stores/__tests__/folders.test.js` | High |
|
||||
| Cloud connections store | `stores/__tests__/cloudConnections.test.js` | Medium |
|
||||
| Router guards (meta.public, meta.layout, redirect on unauthenticated) | `router/__tests__/router.guard.test.js` | High |
|
||||
| FileManagerView (folder navigation, search, sort, move, delete) | `views/__tests__/FileManagerView.test.js` | Medium |
|
||||
| FolderTreeItem (expand arrow, active state) | `components/folders/__tests__/FolderTreeItem.test.js` | Medium |
|
||||
| FolderBreadcrumb | `components/folders/__tests__/FolderBreadcrumb.test.js` | Medium |
|
||||
| TotpEnrollment component | `components/auth/__tests__/TotpEnrollment.test.js` | Medium |
|
||||
| PasswordStrengthBar component | `components/auth/__tests__/PasswordStrengthBar.test.js` | Medium |
|
||||
| AdminUsersTab component | `components/admin/__tests__/AdminUsersTab.test.js` | Medium |
|
||||
| AdminQuotasTab component | `components/admin/__tests__/AdminQuotasTab.test.js` | Medium |
|
||||
| AdminAiConfigTab component | `components/admin/__tests__/AdminAiConfigTab.test.js` | Medium |
|
||||
| SettingsAccountTab component | `components/settings/__tests__/SettingsAccountTab.test.js` | Medium |
|
||||
| SettingsCloudTab component | `components/settings/__tests__/SettingsCloudTab.test.js` | Medium |
|
||||
|
||||
## Test Gaps
|
||||
|
||||
**Backend gaps:**
|
||||
- `test_storage.py` — MinIO object key tests are largely `xfail(strict=False)` waiting for module implementation
|
||||
- Concurrent quota race (`test_concurrent_quota_race`) is `xfail(strict=False)` — requires PostgreSQL row-level locking
|
||||
- Delete quota decrement (`test_delete_decrements_quota`) is `xfail(strict=False)` on SQLite
|
||||
- No `pytest-cov` — no coverage measurement enforced
|
||||
- No CI configuration (no GitHub Actions yaml)
|
||||
|
||||
**Frontend gaps:**
|
||||
- `src/components/documents/` — `DocumentCard.vue`, `DocumentPreviewModal.vue`, `SearchBar.vue`, `SortControls.vue` have **no tests**
|
||||
- `src/components/cloud/` — `CloudFolderTreeItem.vue`, `CloudProviderTreeItem.vue`, `CloudCredentialModal.vue` have **no tests**
|
||||
- `src/components/sharing/` — `ShareModal.vue` has **no tests**
|
||||
- `src/components/upload/` — `DropZone.vue`, `UploadProgress.vue` have **no tests**
|
||||
- `src/components/layout/` — `AppSidebar.vue`, `QuotaBar.vue` have **no tests**
|
||||
- `src/stores/documents.js` — documents store has **no tests**
|
||||
- No E2E tests (no Playwright or Cypress)
|
||||
|
||||
## Security-Specific Tests
|
||||
|
||||
These test files exist specifically to enforce security invariants:
|
||||
|
||||
- `test_constant_time_auth.py` — asserts `hmac.compare_digest` used (source inspection + behavioral)
|
||||
- `test_security.py` — asserts `credentials_enc` never appears in API responses (SEC-08); asserts admin DELETE calls `storage.delete_object` (SEC-09)
|
||||
- `test_security_headers.py` — asserts CSP, X-Frame-Options, X-Content-Type-Options on every response (SEC-05)
|
||||
- `test_totp_replay.py` — asserts same TOTP code rejected on second use (AUTH-08)
|
||||
- `test_auth_api.py` — includes `test_origin_rejected` (CSRF), `test_per_account_rate_limit` (SEC-02)
|
||||
- `test_auth_deps.py` — includes wrong-owner 403, deactivated user 401, admin-blocked 403
|
||||
|
||||
## Common Patterns
|
||||
|
||||
**Async testing:**
|
||||
```python
|
||||
# Option 1 — per-test decorator
|
||||
@pytest.mark.asyncio
|
||||
async def test_something(async_client, auth_user):
|
||||
resp = await async_client.get("/api/documents", headers=auth_user["headers"])
|
||||
assert resp.status_code == 200
|
||||
|
||||
# Option 2 — module-level mark
|
||||
pytestmark = pytest.mark.asyncio
|
||||
async def test_something(async_client, auth_user):
|
||||
...
|
||||
```
|
||||
|
||||
**Security negative tests (wrong owner → 403/404):**
|
||||
```python
|
||||
async def test_cannot_access_other_users_document(async_client, auth_user, second_auth_user, db_session):
|
||||
doc_id = await _make_doc(db_session, auth_user)
|
||||
resp = await async_client.get(f"/api/documents/{doc_id}", headers=second_auth_user["headers"])
|
||||
assert resp.status_code in (403, 404)
|
||||
```
|
||||
|
||||
**Patching external calls:**
|
||||
```python
|
||||
with patch("services.auth.check_hibp", return_value=True) as mock_hibp:
|
||||
resp = await authed_client.post("/api/auth/change-password", ...)
|
||||
assert resp.status_code == 422
|
||||
```
|
||||
|
||||
**Frontend security invariant testing:**
|
||||
```javascript
|
||||
it('login() never writes accessToken to localStorage', async () => {
|
||||
api.login.mockResolvedValue({ access_token: 'tok', user: {...} })
|
||||
const store = useAuthStore()
|
||||
await store.login('alice@example.com', 'password')
|
||||
expect(fakeLocalStorage.setItem).not.toHaveBeenCalled()
|
||||
})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No test coverage measurement (no `pytest-cov` in `requirements.txt`)
|
||||
- `test_lmstudio.py` content not inspected — unclear if it hits a real local endpoint
|
||||
- No CI configuration (no GitHub Actions, no Dockerfile for test runner)
|
||||
- No snapshot or contract tests for API response shapes
|
||||
- Frontend is completely untested
|
||||
*Testing analysis: 2026-06-02*
|
||||
|
||||
Reference in New Issue
Block a user