chore: initial commit — existing single-user document scanner codebase

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
curo1305
2026-05-22 08:53:28 +02:00
parent 6fed5ba531
commit 7a34807fa0
71 changed files with 16408 additions and 0 deletions
+114
View File
@@ -0,0 +1,114 @@
# ARCHITECTURE — document-scanner
_Last updated: 2026-05-21_
## Summary
Document Scanner is a two-tier web application: a Vue 3 SPA communicates with a FastAPI backend via a Vite dev-proxy (or directly in production). The backend handles document ingestion, text extraction, AI-based classification, and flat-file persistence. AI provider selection is fully runtime-configurable via a provider pattern abstraction.
---
## System Overview
```
Browser (Vue 3 SPA)
│ HTTP/JSON + multipart
FastAPI (port 8000)
├── api/documents.py upload, list, get, delete, reclassify
├── api/topics.py CRUD for topic list
├── api/settings.py AI provider config + system prompt
├── services/
│ ├── extractor.py text extraction dispatch
│ ├── classifier.py orchestrates AI call + topic creation
│ └── storage.py flat-file JSON + filesystem persistence
└── ai/ provider abstraction layer
├── base.py AIProvider ABC + ClassificationResult
├── __init__.py get_provider() factory
├── anthropic_provider.py
├── openai_provider.py
├── ollama_provider.py (subclasses OpenAIProvider)
└── lmstudio_provider.py (subclasses OpenAIProvider)
External AI service (Anthropic API / OpenAI API /
Ollama / LM Studio — host.docker.internal)
```
---
## Request Flow — Document Upload + Classification
1. Frontend POSTs `multipart/form-data` to `POST /api/documents/upload`
2. `documents.py` saves the file to `data/uploads/`, calls `extractor.extract_text()`
3. Extracted text (truncated to 50,000 chars) is stored in `data/metadata/<id>.json`
4. If `auto_classify=true`, `classifier.classify_document()` is called:
a. Loads current settings from `data/settings.json` → calls `get_provider(settings)`
b. Passes document text + existing topics to `provider.classify()`
c. Any suggested new topics are created via `storage.add_topic()`
d. Document metadata is updated with assigned topics
5. Full document metadata JSON is returned to the frontend
---
## AI Provider Abstraction
- `AIProvider` (ABC in `ai/base.py`) defines three async methods:
- `classify(document_text, existing_topics, system_prompt) → ClassificationResult`
- `suggest_topics(document_text, system_prompt) → list[str]`
- `health_check() → bool`
- `get_provider(settings: dict)` factory in `ai/__init__.py` reads `settings["active_provider"]` and instantiates the correct class
- `OllamaProvider` and `LMStudioProvider` extend `OpenAIProvider` (both expose OpenAI-compatible endpoints)
- Provider is re-instantiated on every request (stateless; no connection pooling)
---
## Data Persistence
All state is stored on the local filesystem — no database:
| Store | Path | Format | Access |
|---|---|---|---|
| Uploaded files | `data/uploads/<id>.<ext>` | Original binary | Direct filesystem |
| Document metadata | `data/metadata/<id>.json` | JSON per document | `filelock` protected |
| Topic list | `data/topics.json` | `{"topics": [...]}` | `filelock` protected |
| Settings | `data/settings.json` | JSON object | `filelock` protected |
`filelock` is used to prevent concurrent write corruption on JSON files.
---
## Frontend Architecture
- Vue 3 SPA (Options API), Pinia stores, Vue Router 4
- Three Pinia stores (`documents`, `topics`, `settings`) act as the sole data access layer — components never call the API directly
- `src/api/client.js` is the single HTTP adapter (wraps `fetch`)
- Vite proxies `/api/*` to `http://localhost:8000` in dev mode
---
## Key Patterns
- **Provider Pattern** — AI backends are interchangeable at runtime via settings
- **Service Layer** — `extractor`, `classifier`, `storage` are pure Python modules; no FastAPI coupling
- **Pinia-as-Facade** — stores encapsulate all async API calls; views stay declarative
---
## Constraints & Notable Decisions
- All CORS origins allowed (`allow_origins=["*"]`) — suitable for local dev, not production
- No authentication or user model
- Single-worker assumption for file locking (does not scale to multiple uvicorn workers)
- AI provider re-instantiated per request (no connection reuse)
- Data directory is volume-mounted in Docker; no backup or migration strategy
---
## Gaps / Unknowns
- No API versioning strategy visible
- Frontend has no error boundary or global error handling component
- No pagination on document list endpoint (could be a scaling concern)
+87
View File
@@ -0,0 +1,87 @@
# CONCERNS — document-scanner
_Last updated: 2026-05-21_
## Summary
The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment.
---
## Concerns by Severity
### HIGH
**1. File type validation is defined but never enforced**
`ALLOWED_MIME_TYPES` is defined in `backend/api/documents.py` but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives.
**2. No file size limit on uploads**
The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No `MAX_UPLOAD_SIZE` check exists at the HTTP boundary.
**3. API keys stored in plain-text JSON**
`backend/data/settings.json` stores API keys in plaintext. The volume mount in `docker-compose.yml` (`./backend/data:/app/data`) means any process with Docker access can read them. Masking only applies to API responses, not to disk.
**4. CORS fully open**
`allow_origins=["*"]` in `main.py` means any website can make cross-origin requests to the API, including with credentials if ever added.
**5. Docker Compose mounts entire backend source as writable volume**
`./backend:/app` gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files.
---
### MEDIUM
**6. Blocking I/O in async FastAPI handlers**
`storage.py` uses synchronous file reads/writes and `filelock` blocking calls inside `async def` endpoints. This blocks the uvicorn event loop during every request. Should use `asyncio.to_thread()` or `aiofiles` (which is already in requirements but unused).
**7. Topic rename does not cascade to documents**
Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations.
**8. `list_metadata` loads all documents before filtering**
`storage.list_metadata()` reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows.
**9. `topic_doc_counts()` scans all metadata on every topic request**
Every `GET /api/topics` call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly.
**10. `MAX_AI_CHARS` duplicated across 3 files**
The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since `extractor.py` already truncates to `MAX_STORED_CHARS` (50,000).
**11. `_parse_classification` / `_parse_suggestions` duplicated between providers**
`anthropic_provider.py` and `openai_provider.py` each define their own JSON parsing helpers for AI responses. `test_classifier.py` only imports from `openai_provider`, meaning the Anthropic variants are untested.
**12. `health_check()` makes real billed API calls**
The "Test Connection" UI action calls `provider.health_check()`, which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status).
---
### LOW
**13. `uvicorn --reload` hardcoded in docker-compose.yml**
Hot-reload is hardcoded in the production compose file. There is no separate `docker-compose.prod.yml` or build-arg to disable it.
**14. Unused `shutil` import in `storage.py`**
`import shutil` appears in `storage.py` but is never used.
**15. Topic IDs are 8-character UUID prefixes**
`str(uuid.uuid4())[:8]` generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers.
**16. `classify_document` request body uses raw `dict`, not a Pydantic model**
The reclassify endpoint accepts an unvalidated `dict` body. Invalid input causes an unformatted 500 rather than a clean 422 validation error.
**17. No global frontend error handling**
There is no Vue error boundary or global `window.onerror` / `app.config.errorHandler`. Failed API calls in stores may surface as silent failures or unhandled promise rejections.
**18. No document download endpoint**
Uploaded files are stored in `data/uploads/` but there is no `GET /api/documents/:id/file` endpoint to retrieve the original binary. Files are effectively write-only through the UI.
**19. `aiofiles` in requirements but never used**
`aiofiles>=23.2` is listed in `requirements.txt` but no code imports it. The blocking I/O concern (item 6) should use it.
---
## Gaps / Unknowns
- Production deployment path is undefined (no nginx, no TLS, no auth)
- OCR language support for pytesseract is not configured (defaults to English only)
- `suggest_topics` method on all providers is untested — unclear if it is used in the current UI flow
- No backup or recovery strategy for `data/` volume
+94
View File
@@ -0,0 +1,94 @@
# CONVENTIONS — document-scanner
_Last updated: 2026-05-21_
## Summary
The codebase follows standard Python and Vue 3 conventions without heavy tooling enforcement. Backend uses async/await throughout with type hints on public interfaces. Frontend uses Vue Options API with Pinia stores as the data layer. No linter or formatter configuration is committed.
---
## Python Conventions (Backend)
### Naming
- Files: `snake_case.py`
- Classes: `PascalCase` (e.g., `AnthropicProvider`, `ClassificationResult`)
- Functions/variables: `snake_case`
- Constants: `UPPER_SNAKE_CASE` (e.g., `MAX_STORED_CHARS`, `DATA_DIR`)
- Private helpers: leading underscore (e.g., `_extract_pdf`, `_parse_classification`)
### Async
- All API endpoint functions are `async def`
- All `AIProvider` methods are `async def`
- `pytest-asyncio` with `asyncio_mode=auto` (set in `pytest.ini`)
### Type Hints
- Used on public function signatures in `ai/` layer and `services/`
- Dataclass used for `ClassificationResult` (`@dataclass` with `field(default_factory=...)`)
- Not used consistently in `api/` routers (rely on FastAPI/Pydantic implicit validation)
### Error Handling
- `extractor.py` wraps all extraction in `try/except Exception` and returns error strings (never raises)
- AI providers raise on hard failures; caller (`classifier.py`) is responsible for propagating
- No global exception handler registered in `main.py`
### Imports
- Standard library first, then third-party, then local — not enforced by isort
- Heavy library imports (`fitz`, `pytesseract`, `docx`) are deferred inside functions to avoid import-time cost when unused
### Module Docstrings
- Present on `extractor.py` and `test_classifier.py`; absent elsewhere
---
## JavaScript / Vue Conventions (Frontend)
### Naming
- Vue files: `PascalCase.vue` (e.g., `DocumentCard.vue`, `AppSidebar.vue`)
- Pinia stores: `camelCase` filename matching store ID (e.g., `documents.js``useDocumentsStore`)
- Views: `<Name>View.vue` suffix
- Components grouped by domain in subdirectories: `documents/`, `topics/`, `upload/`, `layout/`
### Vue Style
- Options API used throughout (not Composition API)
- Props defined with type and default; no `defineProps` (Options API syntax)
- `v-model`, `v-for`, `v-if` used directly in templates
### Pinia Pattern
- Each store encapsulates `state`, `getters`, and `actions`
- Actions call `src/api/client.js` — components never import `client.js` directly
- Stores are the single source of truth; views read from store state
### API Client
- `src/api/client.js` is the sole HTTP adapter
- All paths are prefixed `/api/` (proxied to backend in dev via Vite config)
### Styling
- Tailwind CSS utility classes used directly in templates
- No scoped `<style>` blocks observed in component list
- Global styles in `src/style.css`
---
## API Design Conventions (Backend)
- All endpoints prefixed `/api/` (set per router)
- JSON responses; multipart for file upload
- HTTP verbs follow REST: GET list, GET by ID, POST create, PUT/PATCH update, DELETE remove
- No versioning (`/api/v1/`) — flat namespace
---
## Configuration
- Runtime paths controlled entirely by `DATA_DIR` env var (defaults to `/app/data`)
- AI settings persisted in `data/settings.json` — no env var overrides at runtime for provider config (except `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` noted in `.env.example`)
- No `.env` loading in backend code — env vars passed via Docker Compose `environment:` block
---
## Gaps / Unknowns
- No ESLint, Prettier, Black, or Ruff configuration committed
- No pre-commit hooks
- No consistent JSDoc or Python docstring coverage
+144
View File
@@ -0,0 +1,144 @@
# INTEGRATIONS — document-scanner
_Last updated: 2026-05-21_
## Summary
The backend integrates with four interchangeable AI providers for document classification: Anthropic Claude, OpenAI (and any OpenAI-compatible endpoint), Ollama, and LM Studio. There are no external databases, auth services, or cloud storage integrations — all persistence is local filesystem. The active provider is selected at runtime via settings persisted in `backend/data/settings.json`.
---
## AI Providers
All providers implement the `AIProvider` abstract interface defined in `backend/ai/base.py`. The active provider is resolved at request time in `backend/ai/__init__.py:get_provider()`.
### Anthropic
- **SDK:** `anthropic>=0.26``backend/ai/anthropic_provider.py`
- **Client:** `anthropic.AsyncAnthropic`
- **API:** Messages API (`client.messages.create`)
- **Default model:** `claude-sonnet-4-6`
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.anthropic.api_key`; optionally seeded from env var `ANTHROPIC_API_KEY` (`.env.example`)
- **Calls made:** `classify` (max_tokens=1024), `suggest_topics` (max_tokens=256), `health_check` (max_tokens=5)
- **Text limit:** 8,000 characters per request (`MAX_AI_CHARS = 8_000`)
### OpenAI
- **SDK:** `openai>=1.30``backend/ai/openai_provider.py`
- **Client:** `openai.AsyncOpenAI`
- **API:** Chat Completions (`client.chat.completions.create`)
- **Default model:** `gpt-4o`
- **Auth:** `api_key` stored in `backend/data/settings.json` under `providers.openai.api_key`; optionally seeded from env var `OPENAI_API_KEY` (`.env.example`)
- **Custom base URL:** Supported via `providers.openai.base_url` in settings (allows pointing at any OpenAI-compatible endpoint)
### Ollama
- **Provider file:** `backend/ai/ollama_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
- **Default base URL:** `http://host.docker.internal:11434/v1`
- **Default model:** `llama3.2`
- **Auth:** Stub key `"ollama"` (no real auth required)
- **Network path:** Reaches the host machine's Ollama daemon via Docker's `host.docker.internal` DNS alias (configured in `docker-compose.yml` via `extra_hosts`)
### LM Studio
- **Provider file:** `backend/ai/lmstudio_provider.py`
- **Implementation:** Subclass of `OpenAIProvider` — uses the OpenAI SDK with a custom `base_url`
- **Default base URL:** `http://host.docker.internal:1234/v1`
- **Default model:** `gemma-4-e4b-it`
- **Auth:** Stub key `"lm-studio"` (no real auth required)
- **Network path:** Reaches the host machine's LM Studio server via `host.docker.internal` (same `extra_hosts` setting)
- **Default active provider** — the app works out of the box with LM Studio and no API keys
---
## Provider Selection & Settings Persistence
- Active provider and all per-provider config (model names, API keys, base URLs) are persisted in `backend/data/settings.json`.
- Settings are loaded fresh on each classification request in `backend/services/classifier.py:classify_document()`.
- API keys returned from the settings API are masked (last 4 chars shown) via `backend/services/storage.py:mask_api_key()`.
- The Settings UI allows switching providers without restart.
---
## Frontend ↔ Backend Communication
- **Protocol:** HTTP REST over JSON (and multipart form for uploads)
- **Client:** Native browser `fetch` API — `frontend/src/api/client.js`
- **Base path:** All requests go to `/api/*` — no hardcoded backend hostname in the frontend
- **Proxy (dev):** Vite dev server proxies `/api``http://backend:8000``frontend/vite.config.js`
- **Proxy (prod):** Comment in `frontend/src/api/client.js` notes nginx is expected; no nginx config is present in the repo
### API Endpoints consumed by the frontend
| Method | Path | Purpose |
|---|---|---|
| POST | `/api/documents/upload` | Upload file with optional auto-classify flag |
| GET | `/api/documents` | List documents (paginated, optional topic filter) |
| GET | `/api/documents/:id` | Get single document metadata |
| DELETE | `/api/documents/:id` | Delete document |
| POST | `/api/documents/:id/classify` | (Re)classify document, optional topic list |
| GET | `/api/topics` | List all topics |
| POST | `/api/topics` | Create topic |
| PATCH | `/api/topics/:id` | Update topic |
| DELETE | `/api/topics/:id` | Delete topic |
| POST | `/api/topics/suggest` | AI topic suggestions for a document |
| GET | `/api/settings` | Get settings (keys masked) |
| PATCH | `/api/settings` | Update settings |
| POST | `/api/settings/test-provider` | Health-check the active or named provider |
| GET | `/api/settings/default-prompt` | Retrieve the default classification system prompt |
---
## Docker Services
Defined in `docker-compose.yml`:
| Service | Image | Port | Notes |
|---|---|---|---|
| `backend` | Built from `./backend/Dockerfile` | `8000:8000` | Mounts `./backend/data:/app/data` for persistence; `./backend:/app` for hot-reload |
| `frontend` | Built from `./frontend/Dockerfile` | `5173:5173` | Mounts `./frontend/src` and `index.html` for hot-reload; depends on `backend` |
Both services use `extra_hosts: host.docker.internal:host-gateway` on the backend to allow Ollama/LM Studio connections to the host machine.
---
## Environment Variables
| Variable | Required | Where used | Notes |
|---|---|---|---|
| `DATA_DIR` | No | `backend/config.py` | Root path for uploads/metadata/settings; defaults to `/app/data` |
| `ANTHROPIC_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
| `OPENAI_API_KEY` | No | `.env.example` | Bootstrap only — app manages keys via settings UI |
| `PYTHONDONTWRITEBYTECODE` | No | `docker-compose.yml` | Set to `1` to suppress `.pyc` files in Docker |
---
## Authentication & Identity
- No user authentication. The application has no login system, sessions, or identity provider.
- API keys for AI providers are stored in plain text in `backend/data/settings.json` (masked only when returned via the settings API).
---
## Monitoring & Observability
- No error tracking service (no Sentry, Datadog, etc.).
- No structured logging framework — FastAPI default stdout logging only.
- A `/health` endpoint exists at `backend/main.py` returning `{"status": "ok"}`.
- Provider connectivity tested on demand via `POST /api/settings/test-provider`.
---
## Webhooks & Callbacks
- None — the application makes no outbound webhook calls and exposes no webhook receiver endpoints.
---
## Gaps / Unknowns
- No nginx or reverse-proxy config present for production deployments; the client-side comment references it but no config exists.
- No container registry or CI/CD pipeline configuration detected.
- API keys are stored in a plain JSON file on disk with no encryption at rest.
- The `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` env vars from `.env.example` are noted as bootstrap helpers but no code in the repo reads them directly — they appear to be manual seeding hints only.
+129
View File
@@ -0,0 +1,129 @@
# STACK — document-scanner
_Last updated: 2026-05-21_
## Summary
Document Scanner is a full-stack application with a Python/FastAPI backend and a Vue 3 frontend, containerised with Docker Compose. The backend handles document ingestion, text extraction, and AI-powered topic classification; the frontend is a single-page app served by Vite. No external database is used — all state is persisted to the local filesystem.
---
## Languages
| Language | Version | Where used |
|---|---|---|
| Python | 3.12 (pinned in `backend/Dockerfile`) | Backend API, AI providers, services |
| JavaScript (ES modules) | ES2022+ (`"type": "module"` in `frontend/package.json`) | Frontend SPA |
---
## Runtime
**Backend:**
- CPython 3.12 (Docker image: `python:3.12-slim`)
- ASGI server: Uvicorn `>=0.29` with standard extras (websockets, httptools)
- Entry point: `backend/main.py``uvicorn main:app`
**Frontend:**
- Node.js 20 (Docker image: `node:20-alpine`)
- Dev server: Vite 5 on port 5173
- Entry point: `frontend/index.html``frontend/src/main.js`
**Package Manager:**
- Backend: `pip` — lockfile: none (ranges only in `backend/requirements.txt`)
- Frontend: `npm` — lockfile: `frontend/package-lock.json` (present but not committed, generated on `npm install`)
---
## Frameworks
### Backend
| Package | Version | Purpose |
|---|---|---|
| `fastapi` | `>=0.111` | REST API framework — `backend/main.py` |
| `uvicorn[standard]` | `>=0.29` | ASGI server |
| `pydantic-settings` | `>=2.2` | Settings/config validation |
| `python-multipart` | latest | Multipart file upload parsing |
### Frontend
| Package | Version | Purpose |
|---|---|---|
| `vue` | `^3.4.0` | UI framework — `frontend/src/App.vue` and all components |
| `vue-router` | `^4.3.0` | Client-side routing — `frontend/src/router/index.js` |
| `pinia` | `^2.1.0` | State management — `frontend/src/stores/` |
### Build / Dev Tooling
| Tool | Version | Purpose |
|---|---|---|
| `vite` | `^5.2.0` | Frontend bundler and dev server — `frontend/vite.config.js` |
| `@vitejs/plugin-vue` | `^5.0.0` | Vue SFC support in Vite |
| `tailwindcss` | `^3.4.0` | Utility-first CSS — `frontend/tailwind.config.js` |
| `postcss` | `^8.4.0` | CSS processing — `frontend/postcss.config.js` |
| `autoprefixer` | `^10.4.0` | CSS vendor prefixing |
---
## Key Backend Dependencies
| Package | Version | Purpose |
|---|---|---|
| `anthropic` | `>=0.26` | Anthropic Claude API client — `backend/ai/anthropic_provider.py` |
| `openai` | `>=1.30` | OpenAI / OpenAI-compatible API client — `backend/ai/openai_provider.py`, also used for Ollama and LM Studio via `base_url` override |
| `PyMuPDF` (`fitz`) | `>=1.24` | PDF text extraction — `backend/services/extractor.py` |
| `python-docx` | `>=1.1` | DOCX text extraction — `backend/services/extractor.py` |
| `pytesseract` | `>=0.3` | OCR for image files — `backend/services/extractor.py` |
| `Pillow` | `>=10.3` | Image handling for OCR — `backend/services/extractor.py` |
| `filelock` | `>=3.14` | File-based concurrency locks — `backend/services/storage.py` |
| `aiofiles` | `>=23.2` | Async file I/O support |
| `httpx` | `>=0.27` | Async HTTP client (used internally by `anthropic` and `openai` SDKs) |
---
## Testing
| Tool | Version | Purpose |
|---|---|---|
| `pytest` | `>=8.2` | Test runner — `backend/pytest.ini`, `backend/tests/` |
| `pytest-asyncio` | `>=0.23` | Async test support; `asyncio_mode = auto` set in `backend/pytest.ini` |
No frontend test framework is present.
---
## Storage
- **File system only** — no database engine.
- Upload files stored at `backend/data/uploads/` (UUID-named).
- Document metadata stored as per-document JSON files at `backend/data/metadata/`.
- Topics registry: `backend/data/topics.json`.
- App settings: `backend/data/settings.json`.
- File-level concurrency managed via `filelock` (`backend/services/storage.py`).
---
## System Dependencies (backend Docker image)
Installed via `apt-get` in `backend/Dockerfile`:
- `tesseract-ocr` — OCR binary for `pytesseract`
- `libgl1`, `libglib2.0-0` — shared libraries required by PyMuPDF
---
## Configuration
- Environment variable `DATA_DIR` sets the root data path (default: `/app/data`).
- AI provider settings (models, API keys, base URLs) are stored in `backend/data/settings.json` and managed through the in-app Settings UI.
- Optional bootstrap via `.env` (see `.env.example`): only `ANTHROPIC_API_KEY` and `OPENAI_API_KEY` are referenced.
- Default active provider is `lmstudio` (no API key required).
---
## Gaps / Unknowns
- No Python version pinning file (`.python-version`, `pyproject.toml`) outside the Dockerfile — local dev outside Docker may use a different Python version.
- No frontend lockfile committed; exact transitive dependency versions are non-deterministic until `npm install` is run.
- No linter or formatter config detected (no `.eslintrc`, `.prettierrc`, `biome.json`, `ruff.toml`, `mypy.ini`, etc.).
- No production deployment config beyond Docker Compose (no nginx config, no cloud provider manifests).
+144
View File
@@ -0,0 +1,144 @@
# STRUCTURE — document-scanner
_Last updated: 2026-05-21_
## Summary
The project is a monorepo with two top-level service directories (`backend/`, `frontend/`) and Docker Compose at the root. Backend is a Python/FastAPI app; frontend is a Vue 3 SPA built with Vite. All persistent data lives under `backend/data/`.
---
## Top-Level Layout
```
document_scanner/
├── backend/ Python FastAPI service
├── frontend/ Vue 3 SPA
├── docker-compose.yml Two-service compose (backend + frontend)
├── .env.example Optional env vars (API keys)
└── .claude/ Claude Code settings
```
---
## Backend
```
backend/
├── main.py FastAPI app: CORS, lifespan, router registration
├── config.py Path constants, DEFAULT_SETTINGS, ensure_data_dirs()
├── requirements.txt Python dependencies
├── pytest.ini pytest config (asyncio_mode=auto)
├── Dockerfile
├── api/ FastAPI routers (thin HTTP layer)
│ ├── documents.py Upload, list, get, delete, reclassify endpoints
│ ├── topics.py Topic CRUD endpoints
│ └── settings.py AI provider settings endpoints
├── ai/ AI provider abstraction
│ ├── base.py AIProvider ABC + ClassificationResult dataclass
│ ├── __init__.py get_provider() factory
│ ├── anthropic_provider.py
│ ├── openai_provider.py
│ ├── ollama_provider.py extends OpenAIProvider
│ └── lmstudio_provider.py extends OpenAIProvider
├── services/ Business logic (no FastAPI dependency)
│ ├── extractor.py Text extraction: PDF/DOCX/image/text dispatch
│ ├── classifier.py Orchestrates AI call + topic auto-creation
│ └── storage.py Flat-file JSON CRUD + filelock
├── data/ Runtime data (volume-mounted in Docker)
│ ├── uploads/ Uploaded document files
│ ├── metadata/ Per-document JSON metadata files
│ ├── topics.json Global topic list
│ └── settings.json Active AI provider + system prompt config
└── tests/
├── conftest.py Fixtures: isolated tmp data dir, TestClient, sample files
├── test_health.py
├── test_documents.py
├── test_topics.py
├── test_settings.py
├── test_extractor.py
├── test_classifier.py
└── test_lmstudio.py
```
---
## Frontend
```
frontend/
├── index.html Vite entry HTML
├── vite.config.js Vite config (Vue plugin, /api proxy)
├── tailwind.config.js
├── postcss.config.js
├── package.json Vue 3, Vue Router 4, Pinia; no test framework
├── Dockerfile
└── src/
├── main.js App bootstrap: Vue + Pinia + Router
├── App.vue Root component (sidebar layout wrapper)
├── style.css Global Tailwind imports
├── api/
│ └── client.js fetch wrapper; all API calls go through here
├── stores/ Pinia stores (data + actions layer)
│ ├── documents.js Document list, upload, classify state
│ ├── topics.js Topic list CRUD state
│ └── settings.js AI provider settings state
├── router/
│ └── index.js Routes: /, /topics, /topics/:name, /document/:id, /settings
├── views/ Page-level components (one per route)
│ ├── HomeView.vue
│ ├── TopicsView.vue
│ ├── DocumentView.vue
│ └── SettingsView.vue
└── components/ Reusable UI components
├── layout/
│ └── AppSidebar.vue
├── documents/
│ └── DocumentCard.vue
├── topics/
│ ├── TopicBadge.vue
│ └── TopicManager.vue
└── upload/
├── DropZone.vue
└── UploadProgress.vue
```
---
## Key Entry Points
| File | Purpose |
|---|---|
| `backend/main.py` | FastAPI app instantiation, middleware, router registration |
| `backend/config.py` | All path constants and default settings — change storage paths here |
| `backend/ai/__init__.py` | Add a new AI provider here |
| `frontend/src/main.js` | Vue app bootstrap |
| `frontend/src/api/client.js` | All HTTP calls originate here |
---
## Where to Add New Code
- **New API endpoint**: add router in `backend/api/`, register in `backend/main.py`
- **New AI provider**: implement `AIProvider` ABC in `backend/ai/`, add case in `get_provider()`
- **New document type**: add extraction branch in `backend/services/extractor.py`
- **New frontend page**: add view in `src/views/`, add route in `src/router/index.js`
- **New shared UI component**: add to relevant `src/components/<category>/` subdirectory
---
## Gaps / Unknowns
- No `src/components/settings/` subdirectory — settings UI is entirely in `SettingsView.vue`
- No migration or schema versioning for `topics.json` / `settings.json` flat files
+87
View File
@@ -0,0 +1,87 @@
# TESTING — document-scanner
_Last updated: 2026-05-21_
## Summary
The backend has solid integration test coverage across all API surfaces and services using pytest + FastAPI TestClient. Each test runs in a fully isolated temporary data directory, so there is no shared state between tests. The frontend has no test framework configured at all.
---
## Backend Testing
### Framework
- **pytest** + **pytest-asyncio** (`asyncio_mode = auto` in `pytest.ini`)
- **FastAPI TestClient** (synchronous ASGI test client from `httpx`)
- No mocking library — AI calls are either tested with real parsing logic or the AI layer is swapped via provider mocking
### Test Isolation Strategy (conftest.py)
- `isolated_data_dir` fixture is `autouse=True` — every test automatically gets:
- A fresh `tmp_path/data/` directory with `uploads/`, `metadata/`
- Clean `topics.json` and `settings.json` initialized from `DEFAULT_SETTINGS`
- Monkeypatched `DATA_DIR` env var and all module-level path constants in `config` and `services.storage`
- New `FileLock` instances pointing to the tmp dir
- `client` fixture wraps FastAPI `TestClient` with the isolated data dir active
### Test Files
| File | What it covers |
|---|---|
| `test_health.py` | `GET /health` returns `{"status": "ok"}` |
| `test_documents.py` | Upload TXT/PDF (no-classify), list, get, delete; extracts text correctly |
| `test_topics.py` | Create, list, delete topics via API |
| `test_settings.py` | Read default settings, update provider config |
| `test_extractor.py` | Unit tests for `extract_text()` on TXT, PDF, DOCX, image paths |
| `test_classifier.py` | Unit tests for JSON parsing helpers (`_parse_classification`, `_parse_suggestions`, `_strip_code_fences`) — no real AI calls |
| `test_lmstudio.py` | LMStudio provider-specific behaviour (likely mocked or uses a local endpoint) |
### Fixtures Available
| Fixture | Provides |
|---|---|
| `isolated_data_dir` | Autouse — clean tmp data dir |
| `client` | FastAPI TestClient with isolated data |
| `sample_txt` | A `.txt` file with test content |
| `sample_pdf` | A minimal valid PDF created with PyMuPDF |
### What Is NOT Tested
- Auto-classification flow end-to-end (requires a live AI provider)
- Document reclassify endpoint
- Anthropic, OpenAI, Ollama provider implementations directly
- Any concurrent write / filelock contention scenarios
- File size / type validation edge cases
- Frontend — no tests exist
---
## Frontend Testing
- **No test framework installed** — `package.json` has no `vitest`, `jest`, or `@testing-library/vue`
- No test files found under `frontend/src/`
- No Cypress or Playwright configuration
---
## Running Tests
```bash
# From backend/
pytest
# With verbose output
pytest -v
# Single file
pytest tests/test_documents.py
```
---
## Gaps / Unknowns
- No test coverage measurement (no `pytest-cov` in `requirements.txt`)
- `test_lmstudio.py` content not inspected — unclear if it hits a real local endpoint
- No CI configuration (no GitHub Actions, no Dockerfile for test runner)
- No snapshot or contract tests for API response shapes
- Frontend is completely untested