Files
kite/.planning/codebase/CONCERNS.md
T
2026-05-22 08:53:28 +02:00

88 lines
5.1 KiB
Markdown

# CONCERNS — document-scanner
_Last updated: 2026-05-21_
## Summary
The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment.
---
## Concerns by Severity
### HIGH
**1. File type validation is defined but never enforced**
`ALLOWED_MIME_TYPES` is defined in `backend/api/documents.py` but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives.
**2. No file size limit on uploads**
The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No `MAX_UPLOAD_SIZE` check exists at the HTTP boundary.
**3. API keys stored in plain-text JSON**
`backend/data/settings.json` stores API keys in plaintext. The volume mount in `docker-compose.yml` (`./backend/data:/app/data`) means any process with Docker access can read them. Masking only applies to API responses, not to disk.
**4. CORS fully open**
`allow_origins=["*"]` in `main.py` means any website can make cross-origin requests to the API, including with credentials if ever added.
**5. Docker Compose mounts entire backend source as writable volume**
`./backend:/app` gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files.
---
### MEDIUM
**6. Blocking I/O in async FastAPI handlers**
`storage.py` uses synchronous file reads/writes and `filelock` blocking calls inside `async def` endpoints. This blocks the uvicorn event loop during every request. Should use `asyncio.to_thread()` or `aiofiles` (which is already in requirements but unused).
**7. Topic rename does not cascade to documents**
Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations.
**8. `list_metadata` loads all documents before filtering**
`storage.list_metadata()` reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows.
**9. `topic_doc_counts()` scans all metadata on every topic request**
Every `GET /api/topics` call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly.
**10. `MAX_AI_CHARS` duplicated across 3 files**
The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since `extractor.py` already truncates to `MAX_STORED_CHARS` (50,000).
**11. `_parse_classification` / `_parse_suggestions` duplicated between providers**
`anthropic_provider.py` and `openai_provider.py` each define their own JSON parsing helpers for AI responses. `test_classifier.py` only imports from `openai_provider`, meaning the Anthropic variants are untested.
**12. `health_check()` makes real billed API calls**
The "Test Connection" UI action calls `provider.health_check()`, which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status).
---
### LOW
**13. `uvicorn --reload` hardcoded in docker-compose.yml**
Hot-reload is hardcoded in the production compose file. There is no separate `docker-compose.prod.yml` or build-arg to disable it.
**14. Unused `shutil` import in `storage.py`**
`import shutil` appears in `storage.py` but is never used.
**15. Topic IDs are 8-character UUID prefixes**
`str(uuid.uuid4())[:8]` generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers.
**16. `classify_document` request body uses raw `dict`, not a Pydantic model**
The reclassify endpoint accepts an unvalidated `dict` body. Invalid input causes an unformatted 500 rather than a clean 422 validation error.
**17. No global frontend error handling**
There is no Vue error boundary or global `window.onerror` / `app.config.errorHandler`. Failed API calls in stores may surface as silent failures or unhandled promise rejections.
**18. No document download endpoint**
Uploaded files are stored in `data/uploads/` but there is no `GET /api/documents/:id/file` endpoint to retrieve the original binary. Files are effectively write-only through the UI.
**19. `aiofiles` in requirements but never used**
`aiofiles>=23.2` is listed in `requirements.txt` but no code imports it. The blocking I/O concern (item 6) should use it.
---
## Gaps / Unknowns
- Production deployment path is undefined (no nginx, no TLS, no auth)
- OCR language support for pytesseract is not configured (defaults to English only)
- `suggest_topics` method on all providers is untested — unclear if it is used in the current UI flow
- No backup or recovery strategy for `data/` volume