aadc69fea0
- 03-03-SUMMARY.md: documents all endpoint auth guards, ownership assertions, namespace isolation pattern, and SQLite compat deviations - STATE.md: advance to Plan 3/5 complete, add 6 key decisions (get_regular_user, 404-not-403, CASE WHEN, or_/is_(None), AI user namespace) - ROADMAP.md: mark 03-03-PLAN.md complete - REQUIREMENTS.md: mark SEC-04 and DOC-04 complete
5.4 KiB
5.4 KiB
ARCHITECTURE — document-scanner
Last updated: 2026-05-21
Summary
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
System Overview
Browser (Vue 3 SPA)
│ HTTP/JSON + multipart
▼
FastAPI (port 8000)
├── api/documents.py – upload, list, get, delete, reclassify
├── api/topics.py – CRUD for topic list
├── api/settings.py – AI provider config + system prompt
│
├── services/
│ ├── extractor.py – text extraction dispatch
│ ├── classifier.py – orchestrates AI call + topic creation
│ └── storage.py – flat-file JSON + filesystem persistence
│
└── ai/ – provider abstraction layer
├── base.py – AIProvider ABC + ClassificationResult
├── __init__.py – get_provider() factory
├── anthropic_provider.py
├── openai_provider.py
├── ollama_provider.py (subclasses OpenAIProvider)
└── lmstudio_provider.py (subclasses OpenAIProvider)
│
▼
External AI service (Anthropic API / OpenAI API /
Ollama / LM Studio — host.docker.internal)
Request Flow — Document Upload + Classification
- Frontend POSTs
multipart/form-datatoPOST /api/documents/upload documents.pysaves the file todata/uploads/, callsextractor.extract_text()- Extracted text (truncated to 50,000 chars) is stored in
data/metadata/<id>.json - If
auto_classify=true,classifier.classify_document()is called: a. Loads current settings fromdata/settings.json→ callsget_provider(settings)b. Passes document text + existing topics toprovider.classify()c. Any suggested new topics are created viastorage.add_topic()d. Document metadata is updated with assigned topics - Full document metadata JSON is returned to the frontend
AI Provider Abstraction
AIProvider(ABC inai/base.py) defines three async methods:classify(document_text, existing_topics, system_prompt) → ClassificationResultsuggest_topics(document_text, system_prompt) → list[str]health_check() → bool
get_provider(settings: dict)factory inai/__init__.pyreadssettings["active_provider"]and instantiates the correct classOllamaProviderandLMStudioProviderextendOpenAIProvider(both expose OpenAI-compatible endpoints)- Provider is re-instantiated on every request (stateless; no connection pooling)
Data Persistence
All state is stored on the local filesystem — no database:
| Store | Path | Format | Access |
|---|---|---|---|
| Uploaded files | data/uploads/<id>.<ext> |
Original binary | Direct filesystem |
| Document metadata | data/metadata/<id>.json |
JSON per document | filelock protected |
| Topic list | data/topics.json |
{"topics": [...]} |
filelock protected |
| Settings | data/settings.json |
JSON object | filelock protected |
filelock is used to prevent concurrent write corruption on JSON files.
Frontend Architecture
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
- Three Pinia stores (
documents,topics,settings) act as the sole data access layer — components never call the API directly src/api/client.jsis the single HTTP adapter (wrapsfetch)- Vite proxies
/api/*tohttp://localhost:8000in dev mode
Key Patterns
- Provider Pattern — AI backends are interchangeable at runtime via settings
- Service Layer —
extractor,classifier,storageare pure Python modules; no FastAPI coupling - Pinia-as-Facade — stores encapsulate all async API calls; views stay declarative
Constraints & Notable Decisions
- All CORS origins allowed (
allow_origins=["*"]) — suitable for local dev, not production - Auth dependency chain (Phase 2+):
get_current_user(validates JWT, returns User) →get_current_admin(requires role=admin) /get_regular_user(requires role!=admin, 403 for admin accounts on document endpoints).get_regular_userenforces SEC-04: admin accounts cannot read document content (CLAUDE.md). - Ownership assertion pattern (Phase 3+): Every
/api/documents/*handler assertsdoc.user_id == current_user.idbefore returning — raises 404 (not 403) to prevent information leakage (D-16, T-03-11). Cross-user access and non-existence are indistinguishable. - Topic namespace model (Phase 3+):
user_id=NULL= system topic (visible to all);user_id=<uuid>= per-user topic.load_topics_for_user(session, user_id)returns union viaor_(Topic.user_id == user_id, Topic.user_id.is_(None)). Admin creates system topics viaPOST /api/admin/topics. - Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
- AI provider re-instantiated per request (no connection reuse)
- Data directory is volume-mounted in Docker; no backup or migration strategy
Gaps / Unknowns
- No API versioning strategy visible
- Frontend has no error boundary or global error handling component
- No pagination on document list endpoint (could be a scaling concern)