10 KiB
DocuVault — Research Synthesis
Last updated: 2026-05-21
Executive Summary
DocuVault is a brownfield migration of a functional single-user document scanner into a privacy-first, multi-user SaaS platform. The existing system already handles document upload, text extraction, and AI-based topic classification via a well-designed provider abstraction. This milestone replaces the flat-file JSON + filesystem persistence layer with PostgreSQL + MinIO, adds full multi-user authentication (JWT with httpOnly cookies, TOTP 2FA, refresh token rotation), per-user quota enforcement, folder organization, document sharing, and pluggable cloud storage backends — following the same adapter pattern already used for AI providers.
Confirmed Stack
Use
| Package | Version | Purpose |
|---|---|---|
pyjwt[crypto] |
≥2.12.1 | JWT — current FastAPI docs recommendation; replaces python-jose |
pwdlib[argon2] |
≥0.2.0 | Password hashing — Argon2 is memory-hard (OWASP 2025) |
pyotp |
≥2.9.0 | TOTP 2FA — RFC 6238 reference |
cryptography (Fernet) |
≥44.0.0 | Credential encryption — AES-128-CBC + HMAC-SHA256 |
sqlalchemy[asyncio] |
≥2.0.36 | ORM — async-native; better brownfield fit than SQLModel |
psycopg[asyncio,binary] |
≥3.2.0 | PostgreSQL driver — single driver for async FastAPI + sync Alembic |
alembic |
≥1.14.0 | DB migrations |
minio |
≥7.2.0 | Object storage — presigned URL flow (FastAPI never proxies bytes) |
msgraph-sdk + azure-identity |
≥1.0.0 / ≥1.19.0 | OneDrive — official Microsoft SDK |
google-api-python-client + google-auth-oauthlib |
≥2.150.0 / ≥1.2.0 | Google Drive v3 |
webdav4 |
≥0.9.8 | Nextcloud + generic WebDAV |
Do NOT Use
python-jose— FastAPI dropped it; use PyJWTpasslib[bcrypt]for new hashes — maintenance mode; keep only for migrating existing hashestiangolo/uvicorn-gunicorn-fastapiDocker image — deprecated; usepython:3.12-slimlocalStoragefor any auth token — XSS-accessible; httpOnly cookie for refresh, Pinia memory for access token- Single platform Fernet key for all users — HKDF per-user derivation required (catastrophic blast radius otherwise)
SQLModelfor this migration — async story is thin; SQLAlchemy 2.0 async is better for brownfield
Table-Stakes Features for v1
Confirmed (from PROJECT.md)
- Email + password registration + JWT sessions with refresh tokens
- TOTP 2FA + backup codes (see gap below)
- Password reset via email
- Per-user isolated storage (100 MB free tier)
- Quota tracking, enforcement at upload, visible indicator
- Folder CRUD, move documents, "Shared with me" folder
- Share by handle, view-only default, immediate revoke
- Cloud OAuth2 connect flow + credential encryption
- Admin: user management, quota adjustment, AI provider assignment
- Audit log (append-only, metadata only) + admin viewer
- In-browser PDF preview
Gaps — Items PROJECT.md Missed
-
TOTP backup codes — Every competitor ships these. Without them, a lost phone permanently locks users out. 8–10 single-use codes, stored hashed, acknowledged by user before TOTP is activated.
-
Quota warnings at 80% and 95% — PROJECT.md specifies rejection at 100% only. Pre-emptive warnings are table stakes (Google Drive, Dropbox both do this). In-app banner at 80% (amber) and 95% (red), plus a specific error at 100% showing current usage, rejected file size, and a link to storage settings.
-
"Sign out all devices" / session revocation — Users who believe their account is compromised need forced logout everywhere. Already handled by the
refresh_tokenstable — requires only an endpoint and a UI control. -
Breadcrumb navigation — Folder CRUD is in PROJECT.md but not the navigation UX. Required for nested folder usability.
-
Cloud storage connection status indicator — PROJECT.md doesn't specify what happens when cloud storage is unreachable. Silent failure = data loss. Must show
ACTIVE | REQUIRES_REAUTH | ERRORstate and fall back to local storage with a clear message. -
Admin impersonation is an explicit architectural exclusion — Must be documented as excluded, not just left unbuilt. Directly contradicts the privacy-first core value.
Critical Architectural Decisions (Lock Before Building)
These cannot be safely retrofitted:
1. JWT in httpOnly cookies
Refresh token: httpOnly; Secure; SameSite=Strict cookie. Access token: Pinia memory only. Never localStorage. Vue Router guard silently refreshes before redirecting to login. Axios withCredentials: true.
2. HKDF per-user key derivation for cloud credentials
HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials"). Master key in CLOUD_CREDS_KEY env var only. Salt in users table. Design before writing the first line of credential storage — cannot be added later without re-encrypting everything.
3. Presigned MinIO URL flow
FastAPI generates signed PUT URL → browser uploads directly to MinIO → FastAPI confirms object and commits quota atomically. FastAPI handles metadata only; bytes never pass through the API layer. Object keys: {user_id}/{document_id}/{uuid4()}{ext}. Human-readable filename in DB only.
4. Atomic PostgreSQL quota enforcement
UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes RETURNING used_bytes. If 0 rows returned, delete the MinIO object and return 413. Never perform quota arithmetic in Python between two DB statements.
5. BackgroundTasks replacement before horizontal scaling
FastAPI BackgroundTasks is per-instance — classification tasks cannot distribute across containers. Replace with Celery + Redis or pgqueuer (PostgreSQL-backed, no Redis dependency) before scaling to N instances. Decide during Phase 3 planning.
Additional locked decisions:
- Refresh tokens are opaque UUIDs stored hashed in DB (not JWTs); access tokens are short-lived JWTs (15 min).
refresh_tokenstable hasfamily_id— on reuse of a rotated token, revoke entire family and emit security alert.- Audit log uses
BIGSERIALPK; app DB user has INSERT + SELECT only (no UPDATE/DELETE). - Admin endpoints for cloud connections return only
provider, display_name, connected_at, status— nevercredentials_enc. - Every document/folder endpoint asserts
resource.user_id == current_user.idvia centralizedassert_document_access().
5-Phase Migration Sequence
Phase 1 — Infrastructure Foundation
Wire PostgreSQL + MinIO into Docker Compose. Create db/models.py with full schema. Alembic initial migration. Async session dependency. No API changes — flat-file code still runs. Gate: all services boot cleanly; migrations apply; no behavior change.
Phase 2 — Users and Authentication
Users, refresh_tokens, quotas tables. Auth endpoints (register, login, refresh, TOTP, password reset, forced logout). TOTP with backup codes. Password reset does NOT auto-login (routes through TOTP gate). get_current_user + get_current_admin FastAPI dependencies. Admin user management endpoints. Vue auth store (Pinia memory + httpOnly cookie), Router guard, Axios interceptors. Gate: admin JWT returns 403 on document endpoints; backup codes issued and acknowledged at enrollment.
Phase 3 — Document Migration to PostgreSQL + MinIO
Dual-write window: new uploads write to both stores. Migration script copies historical flat-file data to PostgreSQL + MinIO. Count reconciliation assertion (go/no-go gate). Flip read source to PostgreSQL. Remove JSON write path. Presigned URL flow for all uploads/downloads. asyncio.to_thread() wrapping all MinIO SDK calls. Gate: concurrent upload test at 99% quota — only one succeeds.
Phase 4 — Multi-User Isolation, Quotas, Folders, Sharing
All queries gain WHERE user_id = current_user.id. Quota bar (80%/95% warnings). Folder CRUD + breadcrumbs. Document move + sort. Share by handle + "Shared with me" folder. Audit log wired to all events. Admin audit viewer. In-browser PDF preview. Gate: negative-access test (admin cannot retrieve any document content); quota reconciliation drift <1%.
Phase 5 — Cloud Storage Backends
StorageBackend ABC + factory (mirrors ai/ pattern). MinIOBackend, OneDriveBackend, GoogleDriveBackend, NextcloudBackend, WebDAVBackend. OAuth2 connect/disconnect flows. Connection status UX. HKDF key derivation for all credentials. delete_user_files() on account deletion. Gate: mock invalid_grant → REQUIRES_REAUTH (not 500); account deletion asserts delete_user_files() per connection.
Top 5 Pitfalls by Risk
| # | Pitfall | Severity | Fix |
|---|---|---|---|
| 1 | JWT in localStorage — XSS bypasses TOTP entirely | CRITICAL | httpOnly cookie for refresh, Pinia memory for access token |
| 2 | Quota race condition — concurrent uploads bypass limit | DATA INTEGRITY | Atomic PostgreSQL UPDATE ... RETURNING |
| 3 | TOTP bypass via password reset — full 2FA bypass via email compromise | SECURITY | Reset issues password_reset_pending state, not a full session |
| 4 | Single Fernet key for all cloud credentials — catastrophic on key leak | CATASTROPHIC | HKDF per-user derivation before first credential is stored |
| 5 | Path traversal in MinIO keys — cross-user data access | SECURITY | UUID-only MinIO keys; human filename in DB only; never reconstruct key from request parameters |
Confidence Assessment
| Area | Confidence | Notes |
|---|---|---|
| Stack | MEDIUM-HIGH | Core libraries confirmed from FastAPI official release notes (PyJWT, pwdlib, SQLAlchemy 2.0, psycopg v3). Cloud SDK minor versions — verify on PyPI before pinning. |
| Features | MEDIUM | Based on Google Drive, Dropbox, Box, Paperless-ngx knowledge through Aug 2025. |
| Architecture | HIGH | FastAPI DI pattern from official docs; S3 presigned URLs and atomic PostgreSQL quota update are industry standards. |
| Pitfalls | HIGH | OWASP cheat sheets; RFC 9700 refresh token rotation; GDPR Article 17 stable regulatory text. |
Overall: MEDIUM-HIGH
Gaps to Resolve During Planning
- Verify cloud SDK minor versions on PyPI before pinning
- Confirm PyOTP
valid_windowdefault in current docs (recommendvalid_window=1for ±30s clock drift) - Decide Celery + Redis vs pgqueuer during Phase 3 (depends on Redis availability in deployment target)
- Audit existing codebase for any existing bcrypt hashes before removing
passlib - Validate MinIO Docker Compose public endpoint in Phase 3 acceptance testing (presigned URLs must use host-accessible address, not internal Docker network name)