Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5.1 KiB
CONCERNS — document-scanner
Last updated: 2026-05-21
Summary
The codebase is a well-structured local-first prototype. The main concerns are security issues that matter if exposed beyond localhost (open CORS, no file validation, plain-text key storage), several blocking I/O calls in async handlers, and a handful of code duplication issues in the AI provider layer. Overall health is good for a local dev tool; requires hardening before any networked deployment.
Concerns by Severity
HIGH
1. File type validation is defined but never enforced
ALLOWED_MIME_TYPES is defined in backend/api/documents.py but the upload handler never checks it — any file type is accepted. An attacker could upload executable files or crafted archives.
2. No file size limit on uploads
The entire uploaded file is read before any cap is applied. A large file could exhaust memory or disk. No MAX_UPLOAD_SIZE check exists at the HTTP boundary.
3. API keys stored in plain-text JSON
backend/data/settings.json stores API keys in plaintext. The volume mount in docker-compose.yml (./backend/data:/app/data) means any process with Docker access can read them. Masking only applies to API responses, not to disk.
4. CORS fully open
allow_origins=["*"] in main.py means any website can make cross-origin requests to the API, including with credentials if ever added.
5. Docker Compose mounts entire backend source as writable volume
./backend:/app gives the container write access to the host source tree. A path traversal or code execution bug in the app could overwrite source files.
MEDIUM
6. Blocking I/O in async FastAPI handlers
storage.py uses synchronous file reads/writes and filelock blocking calls inside async def endpoints. This blocks the uvicorn event loop during every request. Should use asyncio.to_thread() or aiofiles (which is already in requirements but unused).
7. Topic rename does not cascade to documents Deleting a topic removes it from document metadata, but renaming is not implemented — there is no rename endpoint. Users have no way to rename a topic without losing document associations.
8. list_metadata loads all documents before filtering
storage.list_metadata() reads all metadata JSON files on every list request. No pagination at the storage layer — O(N) disk reads per page request as the document count grows.
9. topic_doc_counts() scans all metadata on every topic request
Every GET /api/topics call triggers a full scan of all metadata files to count documents per topic. Not cached; will degrade linearly.
10. MAX_AI_CHARS duplicated across 3 files
The character truncation limit for AI input is duplicated as a magic constant in multiple provider files. The provider-level truncation is effectively dead code since extractor.py already truncates to MAX_STORED_CHARS (50,000).
11. _parse_classification / _parse_suggestions duplicated between providers
anthropic_provider.py and openai_provider.py each define their own JSON parsing helpers for AI responses. test_classifier.py only imports from openai_provider, meaning the Anthropic variants are untested.
12. health_check() makes real billed API calls
The "Test Connection" UI action calls provider.health_check(), which makes a real API call to Anthropic/OpenAI — incurring cost and latency every time the user tests connectivity. Should use a cheaper probe (e.g., list models endpoint or a cached status).
LOW
13. uvicorn --reload hardcoded in docker-compose.yml
Hot-reload is hardcoded in the production compose file. There is no separate docker-compose.prod.yml or build-arg to disable it.
14. Unused shutil import in storage.py
import shutil appears in storage.py but is never used.
15. Topic IDs are 8-character UUID prefixes
str(uuid.uuid4())[:8] generates IDs with ~4 billion combinations — low collision risk for personal use but not safe at scale or for security-sensitive identifiers.
16. classify_document request body uses raw dict, not a Pydantic model
The reclassify endpoint accepts an unvalidated dict body. Invalid input causes an unformatted 500 rather than a clean 422 validation error.
17. No global frontend error handling
There is no Vue error boundary or global window.onerror / app.config.errorHandler. Failed API calls in stores may surface as silent failures or unhandled promise rejections.
18. No document download endpoint
Uploaded files are stored in data/uploads/ but there is no GET /api/documents/:id/file endpoint to retrieve the original binary. Files are effectively write-only through the UI.
19. aiofiles in requirements but never used
aiofiles>=23.2 is listed in requirements.txt but no code imports it. The blocking I/O concern (item 6) should use it.
Gaps / Unknowns
- Production deployment path is undefined (no nginx, no TLS, no auth)
- OCR language support for pytesseract is not configured (defaults to English only)
suggest_topicsmethod on all providers is untested — unclear if it is used in the current UI flow- No backup or recovery strategy for
data/volume