docs(codebase): refresh codebase map after Phase 06.2 completion
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
+273
-105
@@ -1,116 +1,284 @@
|
||||
# ARCHITECTURE — document-scanner
|
||||
<!-- refreshed: 2026-06-02 -->
|
||||
# Architecture
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
|
||||
|
||||
---
|
||||
**Analysis Date:** 2026-06-02
|
||||
|
||||
## System Overview
|
||||
|
||||
```
|
||||
Browser (Vue 3 SPA)
|
||||
│ HTTP/JSON + multipart
|
||||
▼
|
||||
FastAPI (port 8000)
|
||||
├── api/documents.py – upload, list, get, delete, reclassify
|
||||
├── api/topics.py – CRUD for topic list
|
||||
├── api/settings.py – AI provider config + system prompt
|
||||
│
|
||||
├── services/
|
||||
│ ├── extractor.py – text extraction dispatch
|
||||
│ ├── classifier.py – orchestrates AI call + topic creation
|
||||
│ └── storage.py – flat-file JSON + filesystem persistence
|
||||
│
|
||||
└── ai/ – provider abstraction layer
|
||||
├── base.py – AIProvider ABC + ClassificationResult
|
||||
├── __init__.py – get_provider() factory
|
||||
├── anthropic_provider.py
|
||||
├── openai_provider.py
|
||||
├── ollama_provider.py (subclasses OpenAIProvider)
|
||||
└── lmstudio_provider.py (subclasses OpenAIProvider)
|
||||
│
|
||||
▼
|
||||
External AI service (Anthropic API / OpenAI API /
|
||||
Ollama / LM Studio — host.docker.internal)
|
||||
```text
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Browser (Vue 3 SPA) │
|
||||
│ Pinia stores: auth · documents · folders · topics · cloudConnections │
|
||||
│ Router: / /folders/:id /document/:id /cloud /admin /shared │
|
||||
└─────────────────────┬──────────────────────────────────┬────────────────┘
|
||||
│ fetch() + Bearer JWT │ PUT (presigned)
|
||||
▼ ▼
|
||||
┌──────────────────────────────────┐ ┌───────────────────────────────┐
|
||||
│ FastAPI Backend :8000 │ │ MinIO :9000 │
|
||||
│ api/auth api/documents │ │ Bucket: docuvault │
|
||||
│ api/folders api/shares │ │ Keys: {uid}/{did}/{uuid}{e} │
|
||||
│ api/cloud api/admin │ └───────────────────────────────┘
|
||||
│ api/audit api/topics │
|
||||
│ │ ┌───────────────────────────────┐
|
||||
│ Middleware stack (per request):│ │ Cloud Backends │
|
||||
│ OriginValidation (first) │ │ Google Drive / OneDrive │
|
||||
│ CORS │ │ Nextcloud / WebDAV │
|
||||
│ SecurityHeaders (CSP, etc.) │ └───────────────────────────────┘
|
||||
│ SlowAPI rate limiter │
|
||||
│ │ ┌───────────────────────────────┐
|
||||
│ Deps layer: │ │ Celery Worker │
|
||||
│ get_db (AsyncSession) │◄────► tasks/document_tasks.py │
|
||||
│ get_current_user (JWT) │ │ tasks/email_tasks.py │
|
||||
│ get_current_admin │ │ tasks/audit_tasks.py │
|
||||
│ get_regular_user │ └───────────────────────────────┘
|
||||
└────────────┬─────────────────────┘
|
||||
│ SQLAlchemy async ┌───────────────────────────────┐
|
||||
▼ │ Redis :6379 │
|
||||
┌──────────────────────────┐ │ Rate limiting (slowapi) │
|
||||
│ PostgreSQL :5432 │ │ TOTP replay cache │
|
||||
│ 11 tables: │◄──────────► Celery broker + results │
|
||||
│ users · quotas │ │ OAuth state tokens (TTL) │
|
||||
│ refresh_tokens │ └───────────────────────────────┘
|
||||
│ backup_codes · folders │
|
||||
│ documents · topics │ ┌───────────────────────────────┐
|
||||
│ document_topics │ │ AI Providers (pluggable) │
|
||||
│ shares · audit_log │ │ Ollama · OpenAI · Anthropic │
|
||||
│ cloud_connections │ │ LMStudio │
|
||||
│ groups (v2 stub) │ │ ai/base.py → AIProvider ABC │
|
||||
└──────────────────────────┘ └───────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
## Component Responsibilities
|
||||
|
||||
## Request Flow — Document Upload + Classification
|
||||
| Component | Responsibility | Key File |
|
||||
|-----------|----------------|----------|
|
||||
| FastAPI app | ASGI entry point, middleware, router registration | `backend/main.py` |
|
||||
| Auth API | Register, login (TOTP/backup), refresh, logout, password reset | `backend/api/auth.py` |
|
||||
| Documents API | Upload URL, confirm, list, delete, classify, stream content | `backend/api/documents.py` |
|
||||
| Folders API | CRUD folders, move documents between folders | `backend/api/folders.py` |
|
||||
| Shares API | Grant/revoke/list document shares between users | `backend/api/shares.py` |
|
||||
| Cloud API | OAuth flows, WebDAV connect, folder listing, default storage | `backend/api/cloud.py` |
|
||||
| Admin API | User CRUD, quota, AI config, audit log, delete user | `backend/api/admin.py` |
|
||||
| Audit API | Paginated audit log viewer + CSV export | `backend/api/audit.py` |
|
||||
| Topics API | CRUD topics, topic suggestions | `backend/api/topics.py` |
|
||||
| Auth service | Password hashing, JWT, refresh token family, TOTP, HIBP | `backend/services/auth.py` |
|
||||
| Audit service | `write_audit_log()` — flushed within caller's transaction | `backend/services/audit.py` |
|
||||
| Classifier service | Selects AI provider, assigns topics, auto-creates suggestions | `backend/services/classifier.py` |
|
||||
| Extractor service | PDF/DOCX/image/text extraction | `backend/services/extractor.py` |
|
||||
| Storage service | ORM queries for documents + topic resolution | `backend/services/storage.py` |
|
||||
| StorageBackend ABC | Interface for all object storage backends | `backend/storage/base.py` |
|
||||
| Storage factory | Returns MinIOBackend or cloud backend from document record | `backend/storage/__init__.py` |
|
||||
| MinIO backend | Presigned URL, put/get/delete, stat | `backend/storage/minio_backend.py` |
|
||||
| Cloud backends | Google Drive, OneDrive, Nextcloud, WebDAV implementations | `backend/storage/*_backend.py` |
|
||||
| AIProvider ABC | Interface: classify, suggest_topics, health_check | `backend/ai/base.py` |
|
||||
| AI factory | Returns provider instance from string slug | `backend/ai/__init__.py` |
|
||||
| Celery app | Task routing, beat schedule, JSON serialization | `backend/celery_app.py` |
|
||||
| Document task | extract_and_classify — async bridge from sync Celery worker | `backend/tasks/document_tasks.py` |
|
||||
| ORM models | 11-table schema, all UUID PKs, full index set | `backend/db/models.py` |
|
||||
| DB session | Async engine, session factory (expire_on_commit=False) | `backend/db/session.py` |
|
||||
| FastAPI deps | get_db, get_current_user, get_current_admin, get_regular_user | `backend/deps/` |
|
||||
| Auth store | accessToken (memory only), user, quota, refresh deduplication | `frontend/src/stores/auth.js` |
|
||||
| Documents store | CRUD, 3-step MinIO upload with progress, search debounce | `frontend/src/stores/documents.js` |
|
||||
| Folders store | CRUD folders, breadcrumb, rootFolders for sidebar | `frontend/src/stores/folders.js` |
|
||||
| Topics store | CRUD topics | `frontend/src/stores/topics.js` |
|
||||
| CloudConnections store | List/disconnect cloud connections | `frontend/src/stores/cloudConnections.js` |
|
||||
| API client | fetch wrapper, Bearer injection, 401→refresh→retry | `frontend/src/api/client.js` |
|
||||
| Vue Router | SPA routes, beforeEach guard (silent refresh on reload) | `frontend/src/router/index.js` |
|
||||
| FileManagerView | Unified file manager for local folders and documents | `frontend/src/views/FileManagerView.vue` |
|
||||
| StorageBrowser | Reusable file listing component (local + cloud modes) | `frontend/src/components/storage/StorageBrowser.vue` |
|
||||
|
||||
1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
|
||||
2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
|
||||
3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
|
||||
4. If `auto_classify=true`, `classifier.classify_document()` is called:
|
||||
a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
|
||||
b. Passes document text + existing topics to `provider.classify()`
|
||||
c. Any suggested new topics are created via `storage.add_topic()`
|
||||
d. Document metadata is updated with assigned topics
|
||||
5. Full document metadata JSON is returned to the frontend
|
||||
## Pattern Overview
|
||||
|
||||
**Overall:** Layered REST API + SPA with async background processing
|
||||
|
||||
**Key Characteristics:**
|
||||
- API layer is thin — validation via Pydantic, business logic in `services/`
|
||||
- No ORM relationships loaded — explicit queries only (prevents N+1)
|
||||
- Async everywhere in FastAPI; Celery workers bridge to async via `asyncio.run()`
|
||||
- Frontend Pinia stores own data-fetching; views delegate to stores; components emit events upward
|
||||
- One DB session per request (yielded by `get_db` dep), one per Celery task invocation
|
||||
- All resource ownership checked inline in handlers (`resource.user_id == current_user.id`)
|
||||
|
||||
## Layers
|
||||
|
||||
**API Layer:**
|
||||
- Purpose: HTTP routing, request validation, response serialization
|
||||
- Location: `backend/api/`
|
||||
- Contains: APIRouter instances, Pydantic request/response models, FastAPI dep injection
|
||||
- Depends on: `services/`, `deps/`, `db/models.py`
|
||||
- Used by: Frontend via HTTP; not called from other backend modules
|
||||
|
||||
**Service Layer:**
|
||||
- Purpose: Business logic with no FastAPI coupling (pure Python async functions)
|
||||
- Location: `backend/services/`
|
||||
- Contains: `auth.py`, `audit.py`, `classifier.py`, `extractor.py`, `storage.py`, `cloud_cache.py`, `email.py`
|
||||
- Depends on: `db/models.py`, `storage/`, `ai/`, `config`
|
||||
- Used by: `api/` layer and Celery tasks
|
||||
|
||||
**Storage Abstraction Layer:**
|
||||
- Purpose: Backend-agnostic object storage interface
|
||||
- Location: `backend/storage/`
|
||||
- Contains: `base.py` (ABC), `minio_backend.py`, `google_drive_backend.py`, `onedrive_backend.py`, `nextcloud_backend.py`, `webdav_backend.py`, `cloud_utils.py` (HKDF encryption), `exceptions.py`
|
||||
- Depends on: `config`, `db/models.py` (for cloud credential lookup)
|
||||
- Used by: `services/storage.py`, `api/documents.py`, Celery tasks
|
||||
|
||||
**AI Abstraction Layer:**
|
||||
- Purpose: Pluggable AI provider interface for document classification
|
||||
- Location: `backend/ai/`
|
||||
- Contains: `base.py` (ABC), `ollama_provider.py`, `openai_provider.py`, `anthropic_provider.py`, `lmstudio_provider.py`, `utils.py`
|
||||
- Depends on: External AI APIs via httpx
|
||||
- Used by: `services/classifier.py`
|
||||
|
||||
**Dependency Layer:**
|
||||
- Purpose: FastAPI reusable dependencies (DI)
|
||||
- Location: `backend/deps/`
|
||||
- Contains: `db.py` (get_db), `auth.py` (get_current_user, get_current_admin, get_regular_user), `utils.py` (get_client_ip)
|
||||
- Used by: All `api/` handlers
|
||||
|
||||
**Frontend Store Layer:**
|
||||
- Purpose: Application state + async API calls
|
||||
- Location: `frontend/src/stores/`
|
||||
- Contains: `auth.js`, `documents.js`, `folders.js`, `topics.js`, `cloudConnections.js`
|
||||
- Depends on: `api/client.js`
|
||||
- Used by: Views and components
|
||||
|
||||
## Data Flow
|
||||
|
||||
### Document Upload (MinIO presigned URL path)
|
||||
|
||||
1. User drops file in `DropZone` → `StorageBrowser` emits `upload` → `FileManagerView.onFilesSelected` (`frontend/src/views/FileManagerView.vue`)
|
||||
2. `documentsStore.upload(file, autoClassify, folderId)` (`frontend/src/stores/documents.js`)
|
||||
3. `POST /api/documents/upload-url` → creates pending `Document` row, returns presigned PUT URL + `document_id` (`backend/api/documents.py`)
|
||||
4. XHR `PUT` bytes directly from browser to MinIO presigned URL (no backend proxy, no auth header needed — URL is self-authenticating)
|
||||
5. `POST /api/documents/{id}/confirm` → `stat_object()` for authoritative size → atomic quota `UPDATE … RETURNING` → status set to `'ready'` (`backend/api/documents.py`)
|
||||
6. If `folderId != null`: `PATCH /api/documents/{id}/folder` → places document in folder
|
||||
7. Celery task `extract_and_classify.delay(document_id)` enqueued → text extraction → AI classification → topic assignment (`backend/tasks/document_tasks.py`)
|
||||
8. `authStore.fetchQuota()` called on frontend to refresh sidebar quota bar
|
||||
|
||||
### Authentication Flow
|
||||
|
||||
1. `POST /api/auth/login` with `{email, password}` — per-account Redis rate limit checked first (`backend/api/auth.py`)
|
||||
2. Password verified with Argon2 (constant-time via pwdlib)
|
||||
3. If TOTP enabled and no code provided → returns `{requires_totp: true}` challenge
|
||||
4. If TOTP code provided → verified against pyotp + Redis replay prevention window
|
||||
5. On success: `create_access_token()` (HS256 JWT, 15-min TTL) + `create_refresh_token()` (SHA-256 hashed, stored in DB) (`backend/services/auth.py`)
|
||||
6. Access token returned in JSON body; refresh token set as `httpOnly; Secure; SameSite=Strict` cookie scoped to `/api/auth/refresh` path only
|
||||
7. Frontend stores access token in `authStore.accessToken` (Pinia `ref()` — memory only, never localStorage)
|
||||
8. On page reload: router `beforeEach` guard calls `authStore.refresh()` → `POST /api/auth/refresh` sends httpOnly cookie → new access token returned
|
||||
9. `api/client.js` intercepts any 401 → calls `authStore.refresh()` → retries request once (`frontend/src/api/client.js`)
|
||||
|
||||
### Refresh Token Rotation + Family Revocation
|
||||
|
||||
1. `POST /api/auth/refresh` reads httpOnly cookie, looks up `RefreshToken` row by SHA-256 hash
|
||||
2. If token already revoked → all user's refresh tokens revoked → 401 + security alert email enqueued via Celery
|
||||
3. If valid: old token marked `revoked=True`, new raw token generated and stored (hashed), rotated cookie set
|
||||
|
||||
### Cloud Storage OAuth Flow
|
||||
|
||||
1. `GET /api/cloud/oauth/initiate/{provider}` → state token stored in Redis (TTL 1800s, single-use) → authorization URL returned
|
||||
2. Browser navigates to OAuth provider → callback to `GET /api/cloud/oauth/callback/{provider}`
|
||||
3. State token validated (single-use consumed from Redis), authorization code exchanged for credentials
|
||||
4. Credentials encrypted with HKDF-derived per-user Fernet key → stored in `cloud_connections.credentials_enc`
|
||||
5. On document operations: `get_storage_backend_for_document()` decrypts credentials, instantiates cloud backend — transparent to API handlers (`backend/storage/__init__.py`)
|
||||
|
||||
**State Management (frontend):**
|
||||
- Access token: `authStore.accessToken` — Pinia `ref(null)`, JS memory only, cleared on logout/error
|
||||
- User profile: `authStore.user` — Pinia `ref(null)`
|
||||
- Quota: `authStore.quota` — fetched after upload/delete, displayed in `QuotaBar`
|
||||
- Documents: `documentsStore.documents` — local array, kept in sync via explicit `fetchDocuments()` calls
|
||||
- Folder tree: `foldersStore.rootFolders` (sidebar) + `foldersStore.folders` (current level)
|
||||
- Upload progress: `documentsStore.uploadProgress` — keyed `${filename}__${Date.now()}` to prevent key collision
|
||||
|
||||
## Key Abstractions
|
||||
|
||||
**StorageBackend ABC (`backend/storage/base.py`):**
|
||||
- Purpose: Uniform interface over MinIO and all cloud providers
|
||||
- Methods: `put_object`, `get_object`, `delete_object`, `presigned_get_url`, `health_check`, `generate_presigned_put_url`, `stat_object`
|
||||
- Implementations: `MinIOBackend`, `GoogleDriveBackend`, `OneDriveBackend`, `NextcloudBackend`, `WebDAVBackend`
|
||||
- Selected by: `get_storage_backend_for_document()` in `backend/storage/__init__.py`
|
||||
|
||||
**AIProvider ABC (`backend/ai/base.py`):**
|
||||
- Purpose: Pluggable classification backend
|
||||
- Methods: `classify`, `suggest_topics`, `health_check`
|
||||
- Returns: `ClassificationResult(topics, suggested_new_topics, reasoning)`
|
||||
- Implementations: `OllamaProvider`, `OpenAIProvider`, `AnthropicProvider`, `LMStudioProvider`
|
||||
- Selected by: `ai/__init__.py` factory, keyed to per-user `ai_provider`/`ai_model` from DB
|
||||
|
||||
**Dependency Chain:**
|
||||
- `get_current_user` → parses Bearer JWT → loads `User` from DB, checks `is_active`
|
||||
- `get_current_admin` → wraps `get_current_user` + `role == 'admin'` check (raises 403)
|
||||
- `get_regular_user` → wraps `get_current_user` + rejects `role == 'admin'` (admins get 403 on document endpoints)
|
||||
|
||||
## Entry Points
|
||||
|
||||
**Backend:**
|
||||
- Location: `backend/main.py`
|
||||
- Triggers: `uvicorn main:app`
|
||||
- Responsibilities: FastAPI app factory, lifespan (MinIO bucket init, Redis connection, admin bootstrap), middleware registration in correct order, router inclusion
|
||||
|
||||
**Celery Worker:**
|
||||
- Location: `backend/celery_app.py` (factory) + `backend/tasks/`
|
||||
- Triggers: `celery -A celery_app worker -Q documents`
|
||||
- Responsibilities: Async document text extraction + classification, email delivery, scheduled nightly audit CSV export
|
||||
|
||||
**Frontend:**
|
||||
- Location: `frontend/src/main.js`
|
||||
- Triggers: Vite dev server (`npm run dev`) or built static files served by frontend container
|
||||
- Responsibilities: Mount Vue app with Pinia and Router
|
||||
|
||||
## Architectural Constraints
|
||||
|
||||
- **Threading:** FastAPI runs on a single-threaded asyncio event loop (uvicorn). Blocking MinIO SDK calls use `asyncio.to_thread()`. Celery workers are separate sync processes that bridge to async via `asyncio.run()` — they never share an event loop with FastAPI.
|
||||
- **Global state:** `backend/services/storage.py` holds a module-level `_storage` singleton for the default MinIO backend. `backend/main.py` stores MinIO client on `app.state.minio` and Redis client on `app.state.redis`.
|
||||
- **Circular imports:** Celery task modules must never import from `main.py` or router modules. `backend/celery_app.py` intentionally avoids importing `config` — reads `REDIS_URL` directly from `os.environ` to avoid pydantic-settings side effects.
|
||||
- **Admin isolation:** Admin accounts cannot access document content — enforced by `get_regular_user` dep on all document/folder/share endpoints. No impersonation code path exists (`backend/deps/auth.py`).
|
||||
- **Quota atomicity:** Quota enforcement uses a single atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — no read-then-write in Python.
|
||||
- **Object key privacy:** MinIO keys are `{user_id}/{document_id}/{uuid4()}{ext}` — original filenames stored only in the DB `filename` column, never in the storage key.
|
||||
|
||||
## Anti-Patterns
|
||||
|
||||
### Accessing document content via unauthenticated iframe src
|
||||
|
||||
**What happens:** Setting `<iframe src="/api/documents/{id}/content">` directly would bypass Bearer token auth in browsers that do not send cookies cross-origin.
|
||||
**Why it's wrong:** The document content endpoint requires `Authorization: Bearer` header; browser `src=` attributes do not send custom headers.
|
||||
**Do this instead:** Use `fetchDocumentContent(docId)` in `frontend/src/api/client.js` — it injects Bearer + handles 401-refresh-retry, then builds an object URL from the Blob response.
|
||||
|
||||
### Committing inside `write_audit_log`
|
||||
|
||||
**What happens:** Calling `session.commit()` inside `write_audit_log` creates a separate transaction for the audit entry.
|
||||
**Why it's wrong:** The audit entry would commit even if the primary operation subsequently fails, creating phantom audit records.
|
||||
**Do this instead:** `write_audit_log` calls `session.flush()` only. The caller owns `session.commit()` — `backend/services/audit.py`.
|
||||
|
||||
### CloudConnection query without user scope
|
||||
|
||||
**What happens:** Querying `CloudConnection` without filtering `user_id == current_user.id` would allow one user's cloud credentials to service another user's request.
|
||||
**Why it's wrong:** IDOR — cross-user credential access.
|
||||
**Do this instead:** Always filter `CloudConnection.user_id == user.id` as enforced in `get_storage_backend_for_document()` in `backend/storage/__init__.py`.
|
||||
|
||||
## Error Handling
|
||||
|
||||
**Strategy:** Services raise `ValueError`; API handlers catch and re-raise as `HTTPException`. No service module imports FastAPI.
|
||||
|
||||
**Patterns:**
|
||||
- Auth service raises `ValueError` → API layer maps to 401/422/400
|
||||
- Storage errors (`S3Error`, cloud provider errors) wrapped in `backend/storage/exceptions.py` → 503 or 404
|
||||
- `write_audit_log` never raises — silently logs and swallows to protect primary operations
|
||||
- `CloudConnectionError` (`backend/storage/exceptions.py`) used for cloud-specific failures
|
||||
|
||||
## Cross-Cutting Concerns
|
||||
|
||||
**Logging:** Python `logging` module with `logger = logging.getLogger(__name__)` in each module. No structured logging framework.
|
||||
|
||||
**Validation:** Pydantic models at API boundary. Field validators on sensitive fields (filename rejects path separators, permission allowlists, non-negative quota). No model accepts `**kwargs`.
|
||||
|
||||
**Authentication:** Every non-public endpoint injects `get_current_user`, `get_current_admin`, or `get_regular_user` via FastAPI `Depends`. No endpoint bypasses the dependency chain.
|
||||
|
||||
**Rate Limiting:** slowapi (wraps limits-library) on all auth endpoints. Per-IP limits via `@limiter.limit("10/minute")`. Per-account Redis counter on login: `login_attempts:{email}`, 10 attempts per 15-minute window.
|
||||
|
||||
**Audit Logging:** `write_audit_log()` called inline in API handlers for all auth events, document operations, admin actions, and cloud connections. Written within the handler's transaction via `session.flush()`.
|
||||
|
||||
**HKDF Credential Encryption:** Cloud credentials encrypted with `Fernet(HKDF-SHA256(master_key, salt=user_id, purpose="cloud-creds"))` before DB storage. Implementation in `backend/storage/cloud_utils.py`.
|
||||
|
||||
---
|
||||
|
||||
## AI Provider Abstraction
|
||||
|
||||
- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
|
||||
- `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
|
||||
- `suggest_topics(document_text, system_prompt) → list[str]`
|
||||
- `health_check() → bool`
|
||||
- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
|
||||
- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
|
||||
- Provider is re-instantiated on every request (stateless; no connection pooling)
|
||||
|
||||
---
|
||||
|
||||
## Data Persistence
|
||||
|
||||
All state is stored on the local filesystem — no database:
|
||||
|
||||
| Store | Path | Format | Access |
|
||||
|---|---|---|---|
|
||||
| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
|
||||
| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
|
||||
| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
|
||||
| Settings | `data/settings.json` | JSON object | `filelock` protected |
|
||||
|
||||
`filelock` is used to prevent concurrent write corruption on JSON files.
|
||||
|
||||
---
|
||||
|
||||
## Frontend Architecture
|
||||
|
||||
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
|
||||
- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
|
||||
- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
|
||||
- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
|
||||
|
||||
---
|
||||
|
||||
## Key Patterns
|
||||
|
||||
- **Provider Pattern** — AI backends are interchangeable at runtime via settings
|
||||
- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
|
||||
- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
|
||||
|
||||
---
|
||||
|
||||
## Constraints & Notable Decisions
|
||||
|
||||
- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
|
||||
- **Auth dependency chain (Phase 2+):** `get_current_user` (validates JWT, returns User) → `get_current_admin` (requires role=admin) / `get_regular_user` (requires role!=admin, 403 for admin accounts on document endpoints). `get_regular_user` enforces SEC-04: admin accounts cannot read document content (CLAUDE.md).
|
||||
- **Ownership assertion pattern (Phase 3+):** Every `/api/documents/*` handler asserts `doc.user_id == current_user.id` before returning — raises 404 (not 403) to prevent information leakage (D-16, T-03-11). Cross-user access and non-existence are indistinguishable.
|
||||
- **Topic namespace model (Phase 3+):** `user_id=NULL` = system topic (visible to all); `user_id=<uuid>` = per-user topic. `load_topics_for_user(session, user_id)` returns union via `or_(Topic.user_id == user_id, Topic.user_id.is_(None))`. Admin creates system topics via `POST /api/admin/topics`.
|
||||
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
|
||||
- AI provider re-instantiated per request (no connection reuse)
|
||||
- Data directory is volume-mounted in Docker; no backup or migration strategy
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No API versioning strategy visible
|
||||
- Frontend has no error boundary or global error handling component
|
||||
- No pagination on document list endpoint (could be a scaling concern)
|
||||
*Architecture analysis: 2026-06-02*
|
||||
|
||||
Reference in New Issue
Block a user