aadc69fea0
- 03-03-SUMMARY.md: documents all endpoint auth guards, ownership assertions, namespace isolation pattern, and SQLite compat deviations - STATE.md: advance to Plan 3/5 complete, add 6 key decisions (get_regular_user, 404-not-403, CASE WHEN, or_/is_(None), AI user namespace) - ROADMAP.md: mark 03-03-PLAN.md complete - REQUIREMENTS.md: mark SEC-04 and DOC-04 complete
117 lines
5.4 KiB
Markdown
117 lines
5.4 KiB
Markdown
# ARCHITECTURE — document-scanner
|
||
|
||
_Last updated: 2026-05-21_
|
||
|
||
## Summary
|
||
|
||
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
|
||
|
||
---
|
||
|
||
## System Overview
|
||
|
||
```
|
||
Browser (Vue 3 SPA)
|
||
│ HTTP/JSON + multipart
|
||
▼
|
||
FastAPI (port 8000)
|
||
├── api/documents.py – upload, list, get, delete, reclassify
|
||
├── api/topics.py – CRUD for topic list
|
||
├── api/settings.py – AI provider config + system prompt
|
||
│
|
||
├── services/
|
||
│ ├── extractor.py – text extraction dispatch
|
||
│ ├── classifier.py – orchestrates AI call + topic creation
|
||
│ └── storage.py – flat-file JSON + filesystem persistence
|
||
│
|
||
└── ai/ – provider abstraction layer
|
||
├── base.py – AIProvider ABC + ClassificationResult
|
||
├── __init__.py – get_provider() factory
|
||
├── anthropic_provider.py
|
||
├── openai_provider.py
|
||
├── ollama_provider.py (subclasses OpenAIProvider)
|
||
└── lmstudio_provider.py (subclasses OpenAIProvider)
|
||
│
|
||
▼
|
||
External AI service (Anthropic API / OpenAI API /
|
||
Ollama / LM Studio — host.docker.internal)
|
||
```
|
||
|
||
---
|
||
|
||
## Request Flow — Document Upload + Classification
|
||
|
||
1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
|
||
2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
|
||
3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
|
||
4. If `auto_classify=true`, `classifier.classify_document()` is called:
|
||
a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
|
||
b. Passes document text + existing topics to `provider.classify()`
|
||
c. Any suggested new topics are created via `storage.add_topic()`
|
||
d. Document metadata is updated with assigned topics
|
||
5. Full document metadata JSON is returned to the frontend
|
||
|
||
---
|
||
|
||
## AI Provider Abstraction
|
||
|
||
- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
|
||
- `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
|
||
- `suggest_topics(document_text, system_prompt) → list[str]`
|
||
- `health_check() → bool`
|
||
- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
|
||
- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
|
||
- Provider is re-instantiated on every request (stateless; no connection pooling)
|
||
|
||
---
|
||
|
||
## Data Persistence
|
||
|
||
All state is stored on the local filesystem — no database:
|
||
|
||
| Store | Path | Format | Access |
|
||
|---|---|---|---|
|
||
| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
|
||
| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
|
||
| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
|
||
| Settings | `data/settings.json` | JSON object | `filelock` protected |
|
||
|
||
`filelock` is used to prevent concurrent write corruption on JSON files.
|
||
|
||
---
|
||
|
||
## Frontend Architecture
|
||
|
||
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
|
||
- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
|
||
- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
|
||
- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
|
||
|
||
---
|
||
|
||
## Key Patterns
|
||
|
||
- **Provider Pattern** — AI backends are interchangeable at runtime via settings
|
||
- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
|
||
- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
|
||
|
||
---
|
||
|
||
## Constraints & Notable Decisions
|
||
|
||
- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
|
||
- **Auth dependency chain (Phase 2+):** `get_current_user` (validates JWT, returns User) → `get_current_admin` (requires role=admin) / `get_regular_user` (requires role!=admin, 403 for admin accounts on document endpoints). `get_regular_user` enforces SEC-04: admin accounts cannot read document content (CLAUDE.md).
|
||
- **Ownership assertion pattern (Phase 3+):** Every `/api/documents/*` handler asserts `doc.user_id == current_user.id` before returning — raises 404 (not 403) to prevent information leakage (D-16, T-03-11). Cross-user access and non-existence are indistinguishable.
|
||
- **Topic namespace model (Phase 3+):** `user_id=NULL` = system topic (visible to all); `user_id=<uuid>` = per-user topic. `load_topics_for_user(session, user_id)` returns union via `or_(Topic.user_id == user_id, Topic.user_id.is_(None))`. Admin creates system topics via `POST /api/admin/topics`.
|
||
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
|
||
- AI provider re-instantiated per request (no connection reuse)
|
||
- Data directory is volume-mounted in Docker; no backup or migration strategy
|
||
|
||
---
|
||
|
||
## Gaps / Unknowns
|
||
|
||
- No API versioning strategy visible
|
||
- Frontend has no error boundary or global error handling component
|
||
- No pagination on document list endpoint (could be a scaling concern)
|