chore: initial commit — existing single-user document scanner codebase
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,114 @@
|
||||
# ARCHITECTURE — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
|
||||
|
||||
---
|
||||
|
||||
## System Overview
|
||||
|
||||
```
|
||||
Browser (Vue 3 SPA)
|
||||
│ HTTP/JSON + multipart
|
||||
▼
|
||||
FastAPI (port 8000)
|
||||
├── api/documents.py – upload, list, get, delete, reclassify
|
||||
├── api/topics.py – CRUD for topic list
|
||||
├── api/settings.py – AI provider config + system prompt
|
||||
│
|
||||
├── services/
|
||||
│ ├── extractor.py – text extraction dispatch
|
||||
│ ├── classifier.py – orchestrates AI call + topic creation
|
||||
│ └── storage.py – flat-file JSON + filesystem persistence
|
||||
│
|
||||
└── ai/ – provider abstraction layer
|
||||
├── base.py – AIProvider ABC + ClassificationResult
|
||||
├── __init__.py – get_provider() factory
|
||||
├── anthropic_provider.py
|
||||
├── openai_provider.py
|
||||
├── ollama_provider.py (subclasses OpenAIProvider)
|
||||
└── lmstudio_provider.py (subclasses OpenAIProvider)
|
||||
│
|
||||
▼
|
||||
External AI service (Anthropic API / OpenAI API /
|
||||
Ollama / LM Studio — host.docker.internal)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Request Flow — Document Upload + Classification
|
||||
|
||||
1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
|
||||
2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
|
||||
3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
|
||||
4. If `auto_classify=true`, `classifier.classify_document()` is called:
|
||||
a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
|
||||
b. Passes document text + existing topics to `provider.classify()`
|
||||
c. Any suggested new topics are created via `storage.add_topic()`
|
||||
d. Document metadata is updated with assigned topics
|
||||
5. Full document metadata JSON is returned to the frontend
|
||||
|
||||
---
|
||||
|
||||
## AI Provider Abstraction
|
||||
|
||||
- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
|
||||
- `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
|
||||
- `suggest_topics(document_text, system_prompt) → list[str]`
|
||||
- `health_check() → bool`
|
||||
- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
|
||||
- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
|
||||
- Provider is re-instantiated on every request (stateless; no connection pooling)
|
||||
|
||||
---
|
||||
|
||||
## Data Persistence
|
||||
|
||||
All state is stored on the local filesystem — no database:
|
||||
|
||||
| Store | Path | Format | Access |
|
||||
|---|---|---|---|
|
||||
| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
|
||||
| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
|
||||
| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
|
||||
| Settings | `data/settings.json` | JSON object | `filelock` protected |
|
||||
|
||||
`filelock` is used to prevent concurrent write corruption on JSON files.
|
||||
|
||||
---
|
||||
|
||||
## Frontend Architecture
|
||||
|
||||
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
|
||||
- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
|
||||
- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
|
||||
- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
|
||||
|
||||
---
|
||||
|
||||
## Key Patterns
|
||||
|
||||
- **Provider Pattern** — AI backends are interchangeable at runtime via settings
|
||||
- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
|
||||
- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
|
||||
|
||||
---
|
||||
|
||||
## Constraints & Notable Decisions
|
||||
|
||||
- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
|
||||
- No authentication or user model
|
||||
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
|
||||
- AI provider re-instantiated per request (no connection reuse)
|
||||
- Data directory is volume-mounted in Docker; no backup or migration strategy
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No API versioning strategy visible
|
||||
- Frontend has no error boundary or global error handling component
|
||||
- No pagination on document list endpoint (could be a scaling concern)
|
||||
@@ -0,0 +1,87 @@
|
||||
# CONCERNS — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment.
|
||||
|
||||
---
|
||||
|
||||
## Concerns by Severity
|
||||
|
||||
### HIGH
|
||||
|
||||
**1. File type validation is defined but never enforced**
|
||||
`ALLOWED_MIME_TYPES` is defined in `backend/api/documents.py` but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives.
|
||||
|
||||
**2. No file size limit on uploads**
|
||||
The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No `MAX_UPLOAD_SIZE` check exists at the HTTP boundary.
|
||||
|
||||
**3. API keys stored in plain-text JSON**
|
||||
`backend/data/settings.json` stores API keys in plaintext. The volume mount in `docker-compose.yml` (`./backend/data:/app/data`) means any process with Docker access can read them. Masking only applies to API responses, not to disk.
|
||||
|
||||
**4. CORS fully open**
|
||||
`allow_origins=["*"]` in `main.py` means any website can make cross-origin requests to the API, including with credentials if ever added.
|
||||
|
||||
**5. Docker Compose mounts entire backend source as writable volume**
|
||||
`./backend:/app` gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files.
|
||||
|
||||
---
|
||||
|
||||
### MEDIUM
|
||||
|
||||
**6. Blocking I/O in async FastAPI handlers**
|
||||
`storage.py` uses synchronous file reads/writes and `filelock` blocking calls inside `async def` endpoints. This blocks the uvicorn event loop during every request. Should use `asyncio.to_thread()` or `aiofiles` (which is already in requirements but unused).
|
||||
|
||||
**7. Topic rename does not cascade to documents**
|
||||
Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations.
|
||||
|
||||
**8. `list_metadata` loads all documents before filtering**
|
||||
`storage.list_metadata()` reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows.
|
||||
|
||||
**9. `topic_doc_counts()` scans all metadata on every topic request**
|
||||
Every `GET /api/topics` call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly.
|
||||
|
||||
**10. `MAX_AI_CHARS` duplicated across 3 files**
|
||||
The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since `extractor.py` already truncates to `MAX_STORED_CHARS` (50,000).
|
||||
|
||||
**11. `_parse_classification` / `_parse_suggestions` duplicated between providers**
|
||||
`anthropic_provider.py` and `openai_provider.py` each define their own JSON parsing helpers for AI responses. `test_classifier.py` only imports from `openai_provider`, meaning the Anthropic variants are untested.
|
||||
|
||||
**12. `health_check()` makes real billed API calls**
|
||||
The "Test Connection" UI action calls `provider.health_check()`, which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status).
|
||||
|
||||
---
|
||||
|
||||
### LOW
|
||||
|
||||
**13. `uvicorn --reload` hardcoded in docker-compose.yml**
|
||||
Hot-reload is hardcoded in the production compose file. There is no separate `docker-compose.prod.yml` or build-arg to disable it.
|
||||
|
||||
**14. Unused `shutil` import in `storage.py`**
|
||||
`import shutil` appears in `storage.py` but is never used.
|
||||
|
||||
**15. Topic IDs are 8-character UUID prefixes**
|
||||
`str(uuid.uuid4())[:8]` generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers.
|
||||
|
||||
**16. `classify_document` request body uses raw `dict`, not a Pydantic model**
|
||||
The reclassify endpoint accepts an unvalidated `dict` body. Invalid input causes an unformatted 500 rather than a clean 422 validation error.
|
||||
|
||||
**17. No global frontend error handling**
|
||||
There is no Vue error boundary or global `window.onerror` / `app.config.errorHandler`. Failed API calls in stores may surface as silent failures or unhandled promise rejections.
|
||||
|
||||
**18. No document download endpoint**
|
||||
Uploaded files are stored in `data/uploads/` but there is no `GET /api/documents/:id/file` endpoint to retrieve the original binary. Files are effectively write-only through the UI.
|
||||
|
||||
**19. `aiofiles` in requirements but never used**
|
||||
`aiofiles>=23.2` is listed in `requirements.txt` but no code imports it. The blocking I/O concern (item 6) should use it.
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- Production deployment path is undefined (no nginx, no TLS, no auth)
|
||||
- OCR language support for pytesseract is not configured (defaults to English only)
|
||||
- `suggest_topics` method on all providers is untested — unclear if it is used in the current UI flow
|
||||
- No backup or recovery strategy for `data/` volume
|
||||
@@ -0,0 +1,94 @@
|
||||
# CONVENTIONS — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
The codebase follows standard Python and Vue 3 conventions without heavy tooling enforcement. Backend uses async/await throughout with type hints on public interfaces. Frontend uses Vue Options API with Pinia stores as the data layer. No linter or formatter configuration is committed.
|
||||
|
||||
---
|
||||
|
||||
## Python Conventions (Backend)
|
||||
|
||||
### Naming
|
||||
- Files: `snake_case.py`
|
||||
- Classes: `PascalCase` (e.g., `AnthropicProvider`, `ClassificationResult`)
|
||||
- Functions/variables: `snake_case`
|
||||
- Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_STORED_CHARS`, `DATA_DIR`)
|
||||
- Private helpers: leading underscore (e.g., `_extract_pdf`, `_parse_classification`)
|
||||
|
||||
### Async
|
||||
- All API endpoint functions are `async def`
|
||||
- All `AIProvider` methods are `async def`
|
||||
- `pytest-asyncio` with `asyncio_mode=auto` (set in `pytest.ini`)
|
||||
|
||||
### Type Hints
|
||||
- Used on public function signatures in `ai/` layer and `services/`
|
||||
- Dataclass used for `ClassificationResult` (`@dataclass` with `field(default_factory=...)`)
|
||||
- Not used consistently in `api/` routers (rely on FastAPI/Pydantic implicit validation)
|
||||
|
||||
### Error Handling
|
||||
- `extractor.py` wraps all extraction in `try/except Exception` and returns error strings (never raises)
|
||||
- AI providers raise on hard failures; caller (`classifier.py`) is responsible for propagating
|
||||
- No global exception handler registered in `main.py`
|
||||
|
||||
### Imports
|
||||
- Standard library first, then third-party, then local — not enforced by isort
|
||||
- Heavy library imports (`fitz`, `pytesseract`, `docx`) are deferred inside functions to avoid import-time cost when unused
|
||||
|
||||
### Module Docstrings
|
||||
- Present on `extractor.py` and `test_classifier.py`; absent elsewhere
|
||||
|
||||
---
|
||||
|
||||
## JavaScript / Vue Conventions (Frontend)
|
||||
|
||||
### Naming
|
||||
- Vue files: `PascalCase.vue` (e.g., `DocumentCard.vue`, `AppSidebar.vue`)
|
||||
- Pinia stores: `camelCase` filename matching store ID (e.g., `documents.js` → `useDocumentsStore`)
|
||||
- Views: `<Name>View.vue` suffix
|
||||
- Components grouped by domain in subdirectories: `documents/`, `topics/`, `upload/`, `layout/`
|
||||
|
||||
### Vue Style
|
||||
- Options API used throughout (not Composition API)
|
||||
- Props defined with type and default; no `defineProps` (Options API syntax)
|
||||
- `v-model`, `v-for`, `v-if` used directly in templates
|
||||
|
||||
### Pinia Pattern
|
||||
- Each store encapsulates `state`, `getters`, and `actions`
|
||||
- Actions call `src/api/client.js` — components never import `client.js` directly
|
||||
- Stores are the single source of truth; views read from store state
|
||||
|
||||
### API Client
|
||||
- `src/api/client.js` is the sole HTTP adapter
|
||||
- All paths are prefixed `/api/` (proxied to backend in dev via Vite config)
|
||||
|
||||
### Styling
|
||||
- Tailwind CSS utility classes used directly in templates
|
||||
- No scoped `<style>` blocks observed in component list
|
||||
- Global styles in `src/style.css`
|
||||
|
||||
---
|
||||
|
||||
## API Design Conventions (Backend)
|
||||
|
||||
- All endpoints prefixed `/api/` (set per router)
|
||||
- JSON responses; multipart for file upload
|
||||
- HTTP verbs follow REST: GET list, GET by ID, POST create, PUT/PATCH update, DELETE remove
|
||||
- No versioning (`/api/v1/`) — flat namespace
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
- Runtime paths controlled entirely by `DATA_DIR` env var (defaults to `/app/data`)
|
||||
- AI settings persisted in `data/settings.json` — no env var overrides at runtime for provider config (except `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` noted in `.env.example`)
|
||||
- No `.env` loading in backend code — env vars passed via Docker Compose `environment:` block
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No ESLint, Prettier, Black, or Ruff configuration committed
|
||||
- No pre-commit hooks
|
||||
- No consistent JSDoc or Python docstring coverage
|
||||
@@ -0,0 +1,144 @@
|
||||
# INTEGRATIONS — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
The backend integrates with four interchangeable AI providers for document classification: Anthropic Claude, OpenAI (and any OpenAI-compatible endpoint), Ollama, and LM Studio. There are no external databases, auth services, or cloud storage integrations — all persistence is local filesystem. The active provider is selected at runtime via settings persisted in `backend/data/settings.json`.
|
||||
|
||||
---
|
||||
|
||||
## AI Providers
|
||||
|
||||
All providers implement the `AIProvider` abstract interface defined in `backend/ai/base.py`. The active provider is resolved at request time in `backend/ai/__init__.py:get_provider()`.
|
||||
|
||||
### Anthropic
|
||||
|
||||
- **SDK:** `anthropic>=0.26` — `backend/ai/anthropic_provider.py`
|
||||
- **Client:** `anthropic.AsyncAnthropic`
|
||||
- **API:** Messages API (`client.messages.create`)
|
||||
- **Default model:** `claude-sonnet-4-6`
|
||||
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.anthropic.api_key`; optionally seeded from env var `ANTHROPIC_API_KEY` (`.env.example`)
|
||||
- **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
|
||||
- **Text limit:** 8,000 characters per request (`MAX_AI_CHARS = 8_000`)
|
||||
|
||||
### OpenAI
|
||||
|
||||
- **SDK:** `openai>=1.30` — `backend/ai/openai_provider.py`
|
||||
- **Client:** `openai.AsyncOpenAI`
|
||||
- **API:** Chat Completions (`client.chat.completions.create`)
|
||||
- **Default model:** `gpt-4o`
|
||||
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.openai.api_key`; optionally seeded from env var `OPENAI_API_KEY` (`.env.example`)
|
||||
- **Custom base URL:** Supported via `providers.openai.base_url` in settings (allows pointing at any OpenAI-compatible endpoint)
|
||||
|
||||
### Ollama
|
||||
|
||||
- **Provider file:** `backend/ai/ollama_provider.py`
|
||||
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
|
||||
- **Default base URL:** `http://host.docker.internal:11434/v1`
|
||||
- **Default model:** `llama3.2`
|
||||
- **Auth:** Stub key `"ollama"` (no real auth required)
|
||||
- **Network path:** Reaches the host machine's Ollama daemon via Docker's `host.docker.internal` DNS alias (configured in `docker-compose.yml` via `extra_hosts`)
|
||||
|
||||
### LM Studio
|
||||
|
||||
- **Provider file:** `backend/ai/lmstudio_provider.py`
|
||||
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
|
||||
- **Default base URL:** `http://host.docker.internal:1234/v1`
|
||||
- **Default model:** `gemma-4-e4b-it`
|
||||
- **Auth:** Stub key `"lm-studio"` (no real auth required)
|
||||
- **Network path:** Reaches the host machine's LM Studio server via `host.docker.internal` (same `extra_hosts` setting)
|
||||
- **Default active provider** — the app works out of the box with LM Studio and no API keys
|
||||
|
||||
---
|
||||
|
||||
## Provider Selection & Settings Persistence
|
||||
|
||||
- Active provider and all per-provider config (model names, API keys, base URLs) are persisted in `backend/data/settings.json`.
|
||||
- Settings are loaded fresh on each classification request in `backend/services/classifier.py:classify_document()`.
|
||||
- API keys returned from the settings API are masked (last 4 chars shown) via `backend/services/storage.py:mask_api_key()`.
|
||||
- The Settings UI allows switching providers without restart.
|
||||
|
||||
---
|
||||
|
||||
## Frontend ↔ Backend Communication
|
||||
|
||||
- **Protocol:** HTTP REST over JSON (and multipart form for uploads)
|
||||
- **Client:** Native browser `fetch` API — `frontend/src/api/client.js`
|
||||
- **Base path:** All requests go to `/api/*` — no hardcoded backend hostname in the frontend
|
||||
- **Proxy (dev):** Vite dev server proxies `/api` → `http://backend:8000` — `frontend/vite.config.js`
|
||||
- **Proxy (prod):** Comment in `frontend/src/api/client.js` notes nginx is expected; no nginx config is present in the repo
|
||||
|
||||
### API Endpoints consumed by the frontend
|
||||
|
||||
| Method | Path | Purpose |
|
||||
|---|---|---|
|
||||
| POST | `/api/documents/upload` | Upload file with optional auto-classify flag |
|
||||
| GET | `/api/documents` | List documents (paginated, optional topic filter) |
|
||||
| GET | `/api/documents/:id` | Get single document metadata |
|
||||
| DELETE | `/api/documents/:id` | Delete document |
|
||||
| POST | `/api/documents/:id/classify` | (Re)classify document, optional topic list |
|
||||
| GET | `/api/topics` | List all topics |
|
||||
| POST | `/api/topics` | Create topic |
|
||||
| PATCH | `/api/topics/:id` | Update topic |
|
||||
| DELETE | `/api/topics/:id` | Delete topic |
|
||||
| POST | `/api/topics/suggest` | AI topic suggestions for a document |
|
||||
| GET | `/api/settings` | Get settings (keys masked) |
|
||||
| PATCH | `/api/settings` | Update settings |
|
||||
| POST | `/api/settings/test-provider` | Health-check the active or named provider |
|
||||
| GET | `/api/settings/default-prompt` | Retrieve the default classification system prompt |
|
||||
|
||||
---
|
||||
|
||||
## Docker Services
|
||||
|
||||
Defined in `docker-compose.yml`:
|
||||
|
||||
| Service | Image | Port | Notes |
|
||||
|---|---|---|---|
|
||||
| `backend` | Built from `./backend/Dockerfile` | `8000:8000` | Mounts `./backend/data:/app/data` for persistence; `./backend:/app` for hot-reload |
|
||||
| `frontend` | Built from `./frontend/Dockerfile` | `5173:5173` | Mounts `./frontend/src` and `index.html` for hot-reload; depends on `backend` |
|
||||
|
||||
Both services use `extra_hosts: host.docker.internal:host-gateway` on the backend to allow Ollama/LM Studio connections to the host machine.
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable | Required | Where used | Notes |
|
||||
|---|---|---|---|
|
||||
| `DATA_DIR` | No | `backend/config.py` | Root path for uploads/metadata/settings; defaults to `/app/data` |
|
||||
| `ANTHROPIC_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
|
||||
| `OPENAI_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
|
||||
| `PYTHONDONTWRITEBYTECODE` | No | `docker-compose.yml` | Set to `1` to suppress `.pyc` files in Docker |
|
||||
|
||||
---
|
||||
|
||||
## Authentication & Identity
|
||||
|
||||
- No user authentication. The application has no login system, sessions, or identity provider.
|
||||
- API keys for AI providers are stored in plain text in `backend/data/settings.json` (masked only when returned via the settings API).
|
||||
|
||||
---
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
- No error tracking service (no Sentry, Datadog, etc.).
|
||||
- No structured logging framework — FastAPI default stdout logging only.
|
||||
- A `/health` endpoint exists at `backend/main.py` returning `{"status": "ok"}`.
|
||||
- Provider connectivity tested on demand via `POST /api/settings/test-provider`.
|
||||
|
||||
---
|
||||
|
||||
## Webhooks & Callbacks
|
||||
|
||||
- None — the application makes no outbound webhook calls and exposes no webhook receiver endpoints.
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No nginx or reverse-proxy config present for production deployments; the client-side comment references it but no config exists.
|
||||
- No container registry or CI/CD pipeline configuration detected.
|
||||
- API keys are stored in a plain JSON file on disk with no encryption at rest.
|
||||
- The `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` env vars from `.env.example` are noted as bootstrap helpers but no code in the repo reads them directly — they appear to be manual seeding hints only.
|
||||
@@ -0,0 +1,129 @@
|
||||
# STACK — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
Document Scanner is a full-stack application with a Python/FastAPI backend and a Vue 3 frontend, containerised with Docker Compose. The backend handles document ingestion, text extraction, and AI-powered topic classification; the frontend is a single-page app served by Vite. No external database is used — all state is persisted to the local filesystem.
|
||||
|
||||
---
|
||||
|
||||
## Languages
|
||||
|
||||
| Language | Version | Where used |
|
||||
|---|---|---|
|
||||
| Python | 3.12 (pinned in `backend/Dockerfile`) | Backend API, AI providers, services |
|
||||
| JavaScript (ES modules) | ES2022+ (`"type": "module"` in `frontend/package.json`) | Frontend SPA |
|
||||
|
||||
---
|
||||
|
||||
## Runtime
|
||||
|
||||
**Backend:**
|
||||
- CPython 3.12 (Docker image: `python:3.12-slim`)
|
||||
- ASGI server: Uvicorn `>=0.29` with standard extras (websockets, httptools)
|
||||
- Entry point: `backend/main.py` — `uvicorn main:app`
|
||||
|
||||
**Frontend:**
|
||||
- Node.js 20 (Docker image: `node:20-alpine`)
|
||||
- Dev server: Vite 5 on port 5173
|
||||
- Entry point: `frontend/index.html` → `frontend/src/main.js`
|
||||
|
||||
**Package Manager:**
|
||||
- Backend: `pip` — lockfile: none (ranges only in `backend/requirements.txt`)
|
||||
- Frontend: `npm` — lockfile: `frontend/package-lock.json` (present but not committed, generated on `npm install`)
|
||||
|
||||
---
|
||||
|
||||
## Frameworks
|
||||
|
||||
### Backend
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `fastapi` | `>=0.111` | REST API framework — `backend/main.py` |
|
||||
| `uvicorn[standard]` | `>=0.29` | ASGI server |
|
||||
| `pydantic-settings` | `>=2.2` | Settings/config validation |
|
||||
| `python-multipart` | latest | Multipart file upload parsing |
|
||||
|
||||
### Frontend
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `vue` | `^3.4.0` | UI framework — `frontend/src/App.vue` and all components |
|
||||
| `vue-router` | `^4.3.0` | Client-side routing — `frontend/src/router/index.js` |
|
||||
| `pinia` | `^2.1.0` | State management — `frontend/src/stores/` |
|
||||
|
||||
### Build / Dev Tooling
|
||||
|
||||
| Tool | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `vite` | `^5.2.0` | Frontend bundler and dev server — `frontend/vite.config.js` |
|
||||
| `@vitejs/plugin-vue` | `^5.0.0` | Vue SFC support in Vite |
|
||||
| `tailwindcss` | `^3.4.0` | Utility-first CSS — `frontend/tailwind.config.js` |
|
||||
| `postcss` | `^8.4.0` | CSS processing — `frontend/postcss.config.js` |
|
||||
| `autoprefixer` | `^10.4.0` | CSS vendor prefixing |
|
||||
|
||||
---
|
||||
|
||||
## Key Backend Dependencies
|
||||
|
||||
| Package | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `anthropic` | `>=0.26` | Anthropic Claude API client — `backend/ai/anthropic_provider.py` |
|
||||
| `openai` | `>=1.30` | OpenAI / OpenAI-compatible API client — `backend/ai/openai_provider.py`, also used for Ollama and LM Studio via `base_url` override |
|
||||
| `PyMuPDF` (`fitz`) | `>=1.24` | PDF text extraction — `backend/services/extractor.py` |
|
||||
| `python-docx` | `>=1.1` | DOCX text extraction — `backend/services/extractor.py` |
|
||||
| `pytesseract` | `>=0.3` | OCR for image files — `backend/services/extractor.py` |
|
||||
| `Pillow` | `>=10.3` | Image handling for OCR — `backend/services/extractor.py` |
|
||||
| `filelock` | `>=3.14` | File-based concurrency locks — `backend/services/storage.py` |
|
||||
| `aiofiles` | `>=23.2` | Async file I/O support |
|
||||
| `httpx` | `>=0.27` | Async HTTP client (used internally by `anthropic` and `openai` SDKs) |
|
||||
|
||||
---
|
||||
|
||||
## Testing
|
||||
|
||||
| Tool | Version | Purpose |
|
||||
|---|---|---|
|
||||
| `pytest` | `>=8.2` | Test runner — `backend/pytest.ini`, `backend/tests/` |
|
||||
| `pytest-asyncio` | `>=0.23` | Async test support; `asyncio_mode = auto` set in `backend/pytest.ini` |
|
||||
|
||||
No frontend test framework is present.
|
||||
|
||||
---
|
||||
|
||||
## Storage
|
||||
|
||||
- **File system only** — no database engine.
|
||||
- Upload files stored at `backend/data/uploads/` (UUID-named).
|
||||
- Document metadata stored as per-document JSON files at `backend/data/metadata/`.
|
||||
- Topics registry: `backend/data/topics.json`.
|
||||
- App settings: `backend/data/settings.json`.
|
||||
- File-level concurrency managed via `filelock` (`backend/services/storage.py`).
|
||||
|
||||
---
|
||||
|
||||
## System Dependencies (backend Docker image)
|
||||
|
||||
Installed via `apt-get` in `backend/Dockerfile`:
|
||||
- `tesseract-ocr` — OCR binary for `pytesseract`
|
||||
- `libgl1`, `libglib2.0-0` — shared libraries required by PyMuPDF
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
- Environment variable `DATA_DIR` sets the root data path (default: `/app/data`).
|
||||
- AI provider settings (models, API keys, base URLs) are stored in `backend/data/settings.json` and managed through the in-app Settings UI.
|
||||
- Optional bootstrap via `.env` (see `.env.example`): only `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` are referenced.
|
||||
- Default active provider is `lmstudio` (no API key required).
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No Python version pinning file (`.python-version`, `pyproject.toml`) outside the Dockerfile — local dev outside Docker may use a different Python version.
|
||||
- No frontend lockfile committed; exact transitive dependency versions are non-deterministic until `npm install` is run.
|
||||
- No linter or formatter config detected (no `.eslintrc`, `.prettierrc`, `biome.json`, `ruff.toml`, `mypy.ini`, etc.).
|
||||
- No production deployment config beyond Docker Compose (no nginx config, no cloud provider manifests).
|
||||
@@ -0,0 +1,144 @@
|
||||
# STRUCTURE — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
The project is a monorepo with two top-level service directories (`backend/`, `frontend/`) and Docker Compose at the root. Backend is a Python/FastAPI app; frontend is a Vue 3 SPA built with Vite. All persistent data lives under `backend/data/`.
|
||||
|
||||
---
|
||||
|
||||
## Top-Level Layout
|
||||
|
||||
```
|
||||
document_scanner/
|
||||
├── backend/ Python FastAPI service
|
||||
├── frontend/ Vue 3 SPA
|
||||
├── docker-compose.yml Two-service compose (backend + frontend)
|
||||
├── .env.example Optional env vars (API keys)
|
||||
└── .claude/ Claude Code settings
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backend
|
||||
|
||||
```
|
||||
backend/
|
||||
├── main.py FastAPI app: CORS, lifespan, router registration
|
||||
├── config.py Path constants, DEFAULT_SETTINGS, ensure_data_dirs()
|
||||
├── requirements.txt Python dependencies
|
||||
├── pytest.ini pytest config (asyncio_mode=auto)
|
||||
├── Dockerfile
|
||||
│
|
||||
├── api/ FastAPI routers (thin HTTP layer)
|
||||
│ ├── documents.py Upload, list, get, delete, reclassify endpoints
|
||||
│ ├── topics.py Topic CRUD endpoints
|
||||
│ └── settings.py AI provider settings endpoints
|
||||
│
|
||||
├── ai/ AI provider abstraction
|
||||
│ ├── base.py AIProvider ABC + ClassificationResult dataclass
|
||||
│ ├── __init__.py get_provider() factory
|
||||
│ ├── anthropic_provider.py
|
||||
│ ├── openai_provider.py
|
||||
│ ├── ollama_provider.py extends OpenAIProvider
|
||||
│ └── lmstudio_provider.py extends OpenAIProvider
|
||||
│
|
||||
├── services/ Business logic (no FastAPI dependency)
|
||||
│ ├── extractor.py Text extraction: PDF/DOCX/image/text dispatch
|
||||
│ ├── classifier.py Orchestrates AI call + topic auto-creation
|
||||
│ └── storage.py Flat-file JSON CRUD + filelock
|
||||
│
|
||||
├── data/ Runtime data (volume-mounted in Docker)
|
||||
│ ├── uploads/ Uploaded document files
|
||||
│ ├── metadata/ Per-document JSON metadata files
|
||||
│ ├── topics.json Global topic list
|
||||
│ └── settings.json Active AI provider + system prompt config
|
||||
│
|
||||
└── tests/
|
||||
├── conftest.py Fixtures: isolated tmp data dir, TestClient, sample files
|
||||
├── test_health.py
|
||||
├── test_documents.py
|
||||
├── test_topics.py
|
||||
├── test_settings.py
|
||||
├── test_extractor.py
|
||||
├── test_classifier.py
|
||||
└── test_lmstudio.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Frontend
|
||||
|
||||
```
|
||||
frontend/
|
||||
├── index.html Vite entry HTML
|
||||
├── vite.config.js Vite config (Vue plugin, /api proxy)
|
||||
├── tailwind.config.js
|
||||
├── postcss.config.js
|
||||
├── package.json Vue 3, Vue Router 4, Pinia; no test framework
|
||||
├── Dockerfile
|
||||
│
|
||||
└── src/
|
||||
├── main.js App bootstrap: Vue + Pinia + Router
|
||||
├── App.vue Root component (sidebar layout wrapper)
|
||||
├── style.css Global Tailwind imports
|
||||
│
|
||||
├── api/
|
||||
│ └── client.js fetch wrapper; all API calls go through here
|
||||
│
|
||||
├── stores/ Pinia stores (data + actions layer)
|
||||
│ ├── documents.js Document list, upload, classify state
|
||||
│ ├── topics.js Topic list CRUD state
|
||||
│ └── settings.js AI provider settings state
|
||||
│
|
||||
├── router/
|
||||
│ └── index.js Routes: /, /topics, /topics/:name, /document/:id, /settings
|
||||
│
|
||||
├── views/ Page-level components (one per route)
|
||||
│ ├── HomeView.vue
|
||||
│ ├── TopicsView.vue
|
||||
│ ├── DocumentView.vue
|
||||
│ └── SettingsView.vue
|
||||
│
|
||||
└── components/ Reusable UI components
|
||||
├── layout/
|
||||
│ └── AppSidebar.vue
|
||||
├── documents/
|
||||
│ └── DocumentCard.vue
|
||||
├── topics/
|
||||
│ ├── TopicBadge.vue
|
||||
│ └── TopicManager.vue
|
||||
└── upload/
|
||||
├── DropZone.vue
|
||||
└── UploadProgress.vue
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Entry Points
|
||||
|
||||
| File | Purpose |
|
||||
|---|---|
|
||||
| `backend/main.py` | FastAPI app instantiation, middleware, router registration |
|
||||
| `backend/config.py` | All path constants and default settings — change storage paths here |
|
||||
| `backend/ai/__init__.py` | Add a new AI provider here |
|
||||
| `frontend/src/main.js` | Vue app bootstrap |
|
||||
| `frontend/src/api/client.js` | All HTTP calls originate here |
|
||||
|
||||
---
|
||||
|
||||
## Where to Add New Code
|
||||
|
||||
- **New API endpoint**: add router in `backend/api/`, register in `backend/main.py`
|
||||
- **New AI provider**: implement `AIProvider` ABC in `backend/ai/`, add case in `get_provider()`
|
||||
- **New document type**: add extraction branch in `backend/services/extractor.py`
|
||||
- **New frontend page**: add view in `src/views/`, add route in `src/router/index.js`
|
||||
- **New shared UI component**: add to relevant `src/components/<category>/` subdirectory
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No `src/components/settings/` subdirectory — settings UI is entirely in `SettingsView.vue`
|
||||
- No migration or schema versioning for `topics.json` / `settings.json` flat files
|
||||
@@ -0,0 +1,87 @@
|
||||
# TESTING — document-scanner
|
||||
|
||||
_Last updated: 2026-05-21_
|
||||
|
||||
## Summary
|
||||
|
||||
The backend has solid integration test coverage across all API surfaces and services using pytest + FastAPI TestClient. Each test runs in a fully isolated temporary data directory, so there is no shared state between tests. The frontend has no test framework configured at all.
|
||||
|
||||
---
|
||||
|
||||
## Backend Testing
|
||||
|
||||
### Framework
|
||||
- **pytest** + **pytest-asyncio** (`asyncio_mode = auto` in `pytest.ini`)
|
||||
- **FastAPI TestClient** (synchronous ASGI test client from `httpx`)
|
||||
- No mocking library — AI calls are either tested with real parsing logic or the AI layer is swapped via provider mocking
|
||||
|
||||
### Test Isolation Strategy (conftest.py)
|
||||
- `isolated_data_dir` fixture is `autouse=True` — every test automatically gets:
|
||||
- A fresh `tmp_path/data/` directory with `uploads/`, `metadata/`
|
||||
- Clean `topics.json` and `settings.json` initialized from `DEFAULT_SETTINGS`
|
||||
- Monkeypatched `DATA_DIR` env var and all module-level path constants in `config` and `services.storage`
|
||||
- New `FileLock` instances pointing to the tmp dir
|
||||
- `client` fixture wraps FastAPI `TestClient` with the isolated data dir active
|
||||
|
||||
### Test Files
|
||||
|
||||
| File | What it covers |
|
||||
|---|---|
|
||||
| `test_health.py` | `GET /health` returns `{"status": "ok"}` |
|
||||
| `test_documents.py` | Upload TXT/PDF (no-classify), list, get, delete; extracts text correctly |
|
||||
| `test_topics.py` | Create, list, delete topics via API |
|
||||
| `test_settings.py` | Read default settings, update provider config |
|
||||
| `test_extractor.py` | Unit tests for `extract_text()` on TXT, PDF, DOCX, image paths |
|
||||
| `test_classifier.py` | Unit tests for JSON parsing helpers (`_parse_classification`, `_parse_suggestions`, `_strip_code_fences`) — no real AI calls |
|
||||
| `test_lmstudio.py` | LMStudio provider-specific behaviour (likely mocked or uses a local endpoint) |
|
||||
|
||||
### Fixtures Available
|
||||
|
||||
| Fixture | Provides |
|
||||
|---|---|
|
||||
| `isolated_data_dir` | Autouse — clean tmp data dir |
|
||||
| `client` | FastAPI TestClient with isolated data |
|
||||
| `sample_txt` | A `.txt` file with test content |
|
||||
| `sample_pdf` | A minimal valid PDF created with PyMuPDF |
|
||||
|
||||
### What Is NOT Tested
|
||||
|
||||
- Auto-classification flow end-to-end (requires a live AI provider)
|
||||
- Document reclassify endpoint
|
||||
- Anthropic, OpenAI, Ollama provider implementations directly
|
||||
- Any concurrent write / filelock contention scenarios
|
||||
- File size / type validation edge cases
|
||||
- Frontend — no tests exist
|
||||
|
||||
---
|
||||
|
||||
## Frontend Testing
|
||||
|
||||
- **No test framework installed** — `package.json` has no `vitest`, `jest`, or `@testing-library/vue`
|
||||
- No test files found under `frontend/src/`
|
||||
- No Cypress or Playwright configuration
|
||||
|
||||
---
|
||||
|
||||
## Running Tests
|
||||
|
||||
```bash
|
||||
# From backend/
|
||||
pytest
|
||||
|
||||
# With verbose output
|
||||
pytest -v
|
||||
|
||||
# Single file
|
||||
pytest tests/test_documents.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Gaps / Unknowns
|
||||
|
||||
- No test coverage measurement (no `pytest-cov` in `requirements.txt`)
|
||||
- `test_lmstudio.py` content not inspected — unclear if it hits a real local endpoint
|
||||
- No CI configuration (no GitHub Actions, no Dockerfile for test runner)
|
||||
- No snapshot or contract tests for API response shapes
|
||||
- Frontend is completely untested
|
||||
Reference in New Issue
Block a user