7a34807fa0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.6 KiB
4.6 KiB
ARCHITECTURE — document-scanner
Last updated: 2026-05-21
Summary
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
System Overview
Browser (Vue 3 SPA)
│ HTTP/JSON + multipart
▼
FastAPI (port 8000)
├── api/documents.py – upload, list, get, delete, reclassify
├── api/topics.py – CRUD for topic list
├── api/settings.py – AI provider config + system prompt
│
├── services/
│ ├── extractor.py – text extraction dispatch
│ ├── classifier.py – orchestrates AI call + topic creation
│ └── storage.py – flat-file JSON + filesystem persistence
│
└── ai/ – provider abstraction layer
├── base.py – AIProvider ABC + ClassificationResult
├── __init__.py – get_provider() factory
├── anthropic_provider.py
├── openai_provider.py
├── ollama_provider.py (subclasses OpenAIProvider)
└── lmstudio_provider.py (subclasses OpenAIProvider)
│
▼
External AI service (Anthropic API / OpenAI API /
Ollama / LM Studio — host.docker.internal)
Request Flow — Document Upload + Classification
- Frontend POSTs
multipart/form-datatoPOST /api/documents/upload documents.pysaves the file todata/uploads/, callsextractor.extract_text()- Extracted text (truncated to 50,000 chars) is stored in
data/metadata/<id>.json - If
auto_classify=true,classifier.classify_document()is called: a. Loads current settings fromdata/settings.json→ callsget_provider(settings)b. Passes document text + existing topics toprovider.classify()c. Any suggested new topics are created viastorage.add_topic()d. Document metadata is updated with assigned topics - Full document metadata JSON is returned to the frontend
AI Provider Abstraction
AIProvider(ABC inai/base.py) defines three async methods:classify(document_text, existing_topics, system_prompt) → ClassificationResultsuggest_topics(document_text, system_prompt) → list[str]health_check() → bool
get_provider(settings: dict)factory inai/__init__.pyreadssettings["active_provider"]and instantiates the correct classOllamaProviderandLMStudioProviderextendOpenAIProvider(both expose OpenAI-compatible endpoints)- Provider is re-instantiated on every request (stateless; no connection pooling)
Data Persistence
All state is stored on the local filesystem — no database:
| Store | Path | Format | Access |
|---|---|---|---|
| Uploaded files | data/uploads/<id>.<ext> |
Original binary | Direct filesystem |
| Document metadata | data/metadata/<id>.json |
JSON per document | filelock protected |
| Topic list | data/topics.json |
{"topics": [...]} |
filelock protected |
| Settings | data/settings.json |
JSON object | filelock protected |
filelock is used to prevent concurrent write corruption on JSON files.
Frontend Architecture
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
- Three Pinia stores (
documents,topics,settings) act as the sole data access layer — components never call the API directly src/api/client.jsis the single HTTP adapter (wrapsfetch)- Vite proxies
/api/*tohttp://localhost:8000in dev mode
Key Patterns
- Provider Pattern — AI backends are interchangeable at runtime via settings
- Service Layer —
extractor,classifier,storageare pure Python modules; no FastAPI coupling - Pinia-as-Facade — stores encapsulate all async API calls; views stay declarative
Constraints & Notable Decisions
- All CORS origins allowed (
allow_origins=["*"]) — suitable for local dev, not production - No authentication or user model
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
- AI provider re-instantiated per request (no connection reuse)
- Data directory is volume-mounted in Docker; no backup or migration strategy
Gaps / Unknowns
- No API versioning strategy visible
- Frontend has no error boundary or global error handling component
- No pagination on document list endpoint (could be a scaling concern)