# ARCHITECTURE — document-scanner _Last updated: 2026-05-21_ ## Summary Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction. --- ## System Overview ``` Browser (Vue 3 SPA) │ HTTP/JSON + multipart ▼ FastAPI (port 8000) ├── api/documents.py – upload, list, get, delete, reclassify ├── api/topics.py – CRUD for topic list ├── api/settings.py – AI provider config + system prompt │ ├── services/ │ ├── extractor.py – text extraction dispatch │ ├── classifier.py – orchestrates AI call + topic creation │ └── storage.py – flat-file JSON + filesystem persistence │ └── ai/ – provider abstraction layer ├── base.py – AIProvider ABC + ClassificationResult ├── __init__.py – get_provider() factory ├── anthropic_provider.py ├── openai_provider.py ├── ollama_provider.py (subclasses OpenAIProvider) └── lmstudio_provider.py (subclasses OpenAIProvider) │ ▼ External AI service (Anthropic API / OpenAI API / Ollama / LM Studio — host.docker.internal) ``` --- ## Request Flow — Document Upload + Classification 1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload` 2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()` 3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/.json` 4. If `auto_classify=true`, `classifier.classify_document()` is called: a. Loads current settings from `data/settings.json` → calls `get_provider(settings)` b. Passes document text + existing topics to `provider.classify()` c. Any suggested new topics are created via `storage.add_topic()` d. Document metadata is updated with assigned topics 5. Full document metadata JSON is returned to the frontend --- ## AI Provider Abstraction - `AIProvider` (ABC in `ai/base.py`) defines three async methods: - `classify(document_text, existing_topics, system_prompt) → ClassificationResult` - `suggest_topics(document_text, system_prompt) → list[str]` - `health_check() → bool` - `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class - `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints) - Provider is re-instantiated on every request (stateless; no connection pooling) --- ## Data Persistence All state is stored on the local filesystem — no database: | Store | Path | Format | Access | |---|---|---|---| | Uploaded files | `data/uploads/.` | Original binary | Direct filesystem | | Document metadata | `data/metadata/.json` | JSON per document | `filelock` protected | | Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected | | Settings | `data/settings.json` | JSON object | `filelock` protected | `filelock` is used to prevent concurrent write corruption on JSON files. --- ## Frontend Architecture - Vue 3 SPA (Options API), Pinia stores, Vue Router 4 - Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly - `src/api/client.js` is the single HTTP adapter (wraps `fetch`) - Vite proxies `/api/*` to `http://localhost:8000` in dev mode --- ## Key Patterns - **Provider Pattern** — AI backends are interchangeable at runtime via settings - **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling - **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative --- ## Constraints & Notable Decisions - All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production - No authentication or user model - Single-worker assumption for file locking (does not scale to multiple uvicorn workers) - AI provider re-instantiated per request (no connection reuse) - Data directory is volume-mounted in Docker; no backup or migration strategy --- ## Gaps / Unknowns - No API versioning strategy visible - Frontend has no error boundary or global error handling component - No pagination on document list endpoint (could be a scaling concern)