7a34807fa0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
88 lines
5.1 KiB
Markdown
88 lines
5.1 KiB
Markdown
# CONCERNS — document-scanner
|
|
|
|
_Last updated: 2026-05-21_
|
|
|
|
## Summary
|
|
|
|
The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment.
|
|
|
|
---
|
|
|
|
## Concerns by Severity
|
|
|
|
### HIGH
|
|
|
|
**1. File type validation is defined but never enforced**
|
|
`ALLOWED_MIME_TYPES` is defined in `backend/api/documents.py` but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives.
|
|
|
|
**2. No file size limit on uploads**
|
|
The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No `MAX_UPLOAD_SIZE` check exists at the HTTP boundary.
|
|
|
|
**3. API keys stored in plain-text JSON**
|
|
`backend/data/settings.json` stores API keys in plaintext. The volume mount in `docker-compose.yml` (`./backend/data:/app/data`) means any process with Docker access can read them. Masking only applies to API responses, not to disk.
|
|
|
|
**4. CORS fully open**
|
|
`allow_origins=["*"]` in `main.py` means any website can make cross-origin requests to the API, including with credentials if ever added.
|
|
|
|
**5. Docker Compose mounts entire backend source as writable volume**
|
|
`./backend:/app` gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files.
|
|
|
|
---
|
|
|
|
### MEDIUM
|
|
|
|
**6. Blocking I/O in async FastAPI handlers**
|
|
`storage.py` uses synchronous file reads/writes and `filelock` blocking calls inside `async def` endpoints. This blocks the uvicorn event loop during every request. Should use `asyncio.to_thread()` or `aiofiles` (which is already in requirements but unused).
|
|
|
|
**7. Topic rename does not cascade to documents**
|
|
Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations.
|
|
|
|
**8. `list_metadata` loads all documents before filtering**
|
|
`storage.list_metadata()` reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows.
|
|
|
|
**9. `topic_doc_counts()` scans all metadata on every topic request**
|
|
Every `GET /api/topics` call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly.
|
|
|
|
**10. `MAX_AI_CHARS` duplicated across 3 files**
|
|
The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since `extractor.py` already truncates to `MAX_STORED_CHARS` (50,000).
|
|
|
|
**11. `_parse_classification` / `_parse_suggestions` duplicated between providers**
|
|
`anthropic_provider.py` and `openai_provider.py` each define their own JSON parsing helpers for AI responses. `test_classifier.py` only imports from `openai_provider`, meaning the Anthropic variants are untested.
|
|
|
|
**12. `health_check()` makes real billed API calls**
|
|
The "Test Connection" UI action calls `provider.health_check()`, which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status).
|
|
|
|
---
|
|
|
|
### LOW
|
|
|
|
**13. `uvicorn --reload` hardcoded in docker-compose.yml**
|
|
Hot-reload is hardcoded in the production compose file. There is no separate `docker-compose.prod.yml` or build-arg to disable it.
|
|
|
|
**14. Unused `shutil` import in `storage.py`**
|
|
`import shutil` appears in `storage.py` but is never used.
|
|
|
|
**15. Topic IDs are 8-character UUID prefixes**
|
|
`str(uuid.uuid4())[:8]` generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers.
|
|
|
|
**16. `classify_document` request body uses raw `dict`, not a Pydantic model**
|
|
The reclassify endpoint accepts an unvalidated `dict` body. Invalid input causes an unformatted 500 rather than a clean 422 validation error.
|
|
|
|
**17. No global frontend error handling**
|
|
There is no Vue error boundary or global `window.onerror` / `app.config.errorHandler`. Failed API calls in stores may surface as silent failures or unhandled promise rejections.
|
|
|
|
**18. No document download endpoint**
|
|
Uploaded files are stored in `data/uploads/` but there is no `GET /api/documents/:id/file` endpoint to retrieve the original binary. Files are effectively write-only through the UI.
|
|
|
|
**19. `aiofiles` in requirements but never used**
|
|
`aiofiles>=23.2` is listed in `requirements.txt` but no code imports it. The blocking I/O concern (item 6) should use it.
|
|
|
|
---
|
|
|
|
## Gaps / Unknowns
|
|
|
|
- Production deployment path is undefined (no nginx, no TLS, no auth)
|
|
- OCR language support for pytesseract is not configured (defaults to English only)
|
|
- `suggest_topics` method on all providers is untested — unclear if it is used in the current UI flow
|
|
- No backup or recovery strategy for `data/` volume
|