chore: initial commit — existing single-user document scanner codebase

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-22 08:53:28 +02:00
parent 6fed5ba531
commit 7a34807fa0
71 changed files with 16408 additions and 0 deletions
@@ -0,0 +1,114 @@
+# ARCHITECTURE — document-scanner
+
+_Last updated: 2026-05-21_
+
+## Summary
+
+Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
+
+---
+
+## System Overview
+
+```
+Browser (Vue 3 SPA)
+      │  HTTP/JSON + multipart
+      ▼
+FastAPI  (port 8000)
+  ├── api/documents.py   – upload, list, get, delete, reclassify
+  ├── api/topics.py      – CRUD for topic list
+  ├── api/settings.py    – AI provider config + system prompt
+  │
+  ├── services/
+  │   ├── extractor.py   – text extraction dispatch
+  │   ├── classifier.py  – orchestrates AI call + topic creation
+  │   └── storage.py     – flat-file JSON + filesystem persistence
+  │
+  └── ai/                – provider abstraction layer
+      ├── base.py        – AIProvider ABC + ClassificationResult
+      ├── __init__.py    – get_provider() factory
+      ├── anthropic_provider.py
+      ├── openai_provider.py
+      ├── ollama_provider.py   (subclasses OpenAIProvider)
+      └── lmstudio_provider.py (subclasses OpenAIProvider)
+                │
+                ▼
+     External AI service (Anthropic API / OpenAI API /
+                          Ollama / LM Studio — host.docker.internal)
+```
+
+---
+
+## Request Flow — Document Upload + Classification
+
+1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
+2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
+3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
+4. If `auto_classify=true`, `classifier.classify_document()` is called:
+   a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
+   b. Passes document text + existing topics to `provider.classify()`
+   c. Any suggested new topics are created via `storage.add_topic()`
+   d. Document metadata is updated with assigned topics
+5. Full document metadata JSON is returned to the frontend
+
+---
+
+## AI Provider Abstraction
+
+- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
+  - `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
+  - `suggest_topics(document_text, system_prompt) → list[str]`
+  - `health_check() → bool`
+- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
+- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
+- Provider is re-instantiated on every request (stateless; no connection pooling)
+
+---
+
+## Data Persistence
+
+All state is stored on the local filesystem — no database:
+
+| Store | Path | Format | Access |
+|---|---|---|---|
+| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
+| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
+| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
+| Settings | `data/settings.json` | JSON object | `filelock` protected |
+
+`filelock` is used to prevent concurrent write corruption on JSON files.
+
+---
+
+## Frontend Architecture
+
+- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
+- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
+- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
+- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
+
+---
+
+## Key Patterns
+
+- **Provider Pattern** — AI backends are interchangeable at runtime via settings
+- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
+- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
+
+---
+
+## Constraints & Notable Decisions
+
+- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
+- No authentication or user model
+- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
+- AI provider re-instantiated per request (no connection reuse)
+- Data directory is volume-mounted in Docker; no backup or migration strategy
+
+---
+
+## Gaps / Unknowns
+
+- No API versioning strategy visible
+- Frontend has no error boundary or global error handling component
+- No pagination on document list endpoint (could be a scaling concern)