# CONCERNS — document-scanner _Last updated: 2026-05-21_ ## Summary The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment. --- ## Concerns by Severity ### HIGH **1. File type validation is defined but never enforced** `ALLOWED_MIME_TYPES` is defined in `backend/api/documents.py` but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives. **2. No file size limit on uploads** The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No `MAX_UPLOAD_SIZE` check exists at the HTTP boundary. **3. API keys stored in plain-text JSON** `backend/data/settings.json` stores API keys in plaintext. The volume mount in `docker-compose.yml` (`./backend/data:/app/data`) means any process with Docker access can read them. Masking only applies to API responses, not to disk. **4. CORS fully open** `allow_origins=["*"]` in `main.py` means any website can make cross-origin requests to the API, including with credentials if ever added. **5. Docker Compose mounts entire backend source as writable volume** `./backend:/app` gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files. --- ### MEDIUM **6. Blocking I/O in async FastAPI handlers** `storage.py` uses synchronous file reads/writes and `filelock` blocking calls inside `async def` endpoints. This blocks the uvicorn event loop during every request. Should use `asyncio.to_thread()` or `aiofiles` (which is already in requirements but unused). **7. Topic rename does not cascade to documents** Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations. **8. `list_metadata` loads all documents before filtering** `storage.list_metadata()` reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows. **9. `topic_doc_counts()` scans all metadata on every topic request** Every `GET /api/topics` call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly. **10. `MAX_AI_CHARS` duplicated across 3 files** The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since `extractor.py` already truncates to `MAX_STORED_CHARS` (50,000). **11. `_parse_classification` / `_parse_suggestions` duplicated between providers** `anthropic_provider.py` and `openai_provider.py` each define their own JSON parsing helpers for AI responses. `test_classifier.py` only imports from `openai_provider`, meaning the Anthropic variants are untested. **12. `health_check()` makes real billed API calls** The "Test Connection" UI action calls `provider.health_check()`, which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status). --- ### LOW **13. `uvicorn --reload` hardcoded in docker-compose.yml** Hot-reload is hardcoded in the production compose file. There is no separate `docker-compose.prod.yml` or build-arg to disable it. **14. Unused `shutil` import in `storage.py`** `import shutil` appears in `storage.py` but is never used. **15. Topic IDs are 8-character UUID prefixes** `str(uuid.uuid4())[:8]` generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers. **16. `classify_document` request body uses raw `dict`, not a Pydantic model** The reclassify endpoint accepts an unvalidated `dict` body. Invalid input causes an unformatted 500 rather than a clean 422 validation error. **17. No global frontend error handling** There is no Vue error boundary or global `window.onerror` / `app.config.errorHandler`. Failed API calls in stores may surface as silent failures or unhandled promise rejections. **18. No document download endpoint** Uploaded files are stored in `data/uploads/` but there is no `GET /api/documents/:id/file` endpoint to retrieve the original binary. Files are effectively write-only through the UI. **19. `aiofiles` in requirements but never used** `aiofiles>=23.2` is listed in `requirements.txt` but no code imports it. The blocking I/O concern (item 6) should use it. --- ## Gaps / Unknowns - Production deployment path is undefined (no nginx, no TLS, no auth) - OCR language support for pytesseract is not configured (defaults to English only) - `suggest_topics` method on all providers is untested — unclear if it is used in the current UI flow - No backup or recovery strategy for `data/` volume