Files

10 KiB
Raw Permalink Blame History

DocuVault — Research Synthesis

Last updated: 2026-05-21

Executive Summary

DocuVault is a brownfield migration of a functional single-user document scanner into a privacy-first, multi-user SaaS platform. The existing system already handles document upload, text extraction, and AI-based topic classification via a well-designed provider abstraction. This milestone replaces the flat-file JSON + filesystem persistence layer with PostgreSQL + MinIO, adds full multi-user authentication (JWT with httpOnly cookies, TOTP 2FA, refresh token rotation), per-user quota enforcement, folder organization, document sharing, and pluggable cloud storage backends — following the same adapter pattern already used for AI providers.


Confirmed Stack

Use

Package Version Purpose
pyjwt[crypto] ≥2.12.1 JWT — current FastAPI docs recommendation; replaces python-jose
pwdlib[argon2] ≥0.2.0 Password hashing — Argon2 is memory-hard (OWASP 2025)
pyotp ≥2.9.0 TOTP 2FA — RFC 6238 reference
cryptography (Fernet) ≥44.0.0 Credential encryption — AES-128-CBC + HMAC-SHA256
sqlalchemy[asyncio] ≥2.0.36 ORM — async-native; better brownfield fit than SQLModel
psycopg[asyncio,binary] ≥3.2.0 PostgreSQL driver — single driver for async FastAPI + sync Alembic
alembic ≥1.14.0 DB migrations
minio ≥7.2.0 Object storage — presigned URL flow (FastAPI never proxies bytes)
msgraph-sdk + azure-identity ≥1.0.0 / ≥1.19.0 OneDrive — official Microsoft SDK
google-api-python-client + google-auth-oauthlib ≥2.150.0 / ≥1.2.0 Google Drive v3
webdav4 ≥0.9.8 Nextcloud + generic WebDAV

Do NOT Use

  • python-jose — FastAPI dropped it; use PyJWT
  • passlib[bcrypt] for new hashes — maintenance mode; keep only for migrating existing hashes
  • tiangolo/uvicorn-gunicorn-fastapi Docker image — deprecated; use python:3.12-slim
  • localStorage for any auth token — XSS-accessible; httpOnly cookie for refresh, Pinia memory for access token
  • Single platform Fernet key for all users — HKDF per-user derivation required (catastrophic blast radius otherwise)
  • SQLModel for this migration — async story is thin; SQLAlchemy 2.0 async is better for brownfield

Table-Stakes Features for v1

Confirmed (from PROJECT.md)

  • Email + password registration + JWT sessions with refresh tokens
  • TOTP 2FA + backup codes (see gap below)
  • Password reset via email
  • Per-user isolated storage (100 MB free tier)
  • Quota tracking, enforcement at upload, visible indicator
  • Folder CRUD, move documents, "Shared with me" folder
  • Share by handle, view-only default, immediate revoke
  • Cloud OAuth2 connect flow + credential encryption
  • Admin: user management, quota adjustment, AI provider assignment
  • Audit log (append-only, metadata only) + admin viewer
  • In-browser PDF preview

Gaps — Items PROJECT.md Missed

  1. TOTP backup codes — Every competitor ships these. Without them, a lost phone permanently locks users out. 810 single-use codes, stored hashed, acknowledged by user before TOTP is activated.

  2. Quota warnings at 80% and 95% — PROJECT.md specifies rejection at 100% only. Pre-emptive warnings are table stakes (Google Drive, Dropbox both do this). In-app banner at 80% (amber) and 95% (red), plus a specific error at 100% showing current usage, rejected file size, and a link to storage settings.

  3. "Sign out all devices" / session revocation — Users who believe their account is compromised need forced logout everywhere. Already handled by the refresh_tokens table — requires only an endpoint and a UI control.

  4. Breadcrumb navigation — Folder CRUD is in PROJECT.md but not the navigation UX. Required for nested folder usability.

  5. Cloud storage connection status indicator — PROJECT.md doesn't specify what happens when cloud storage is unreachable. Silent failure = data loss. Must show ACTIVE | REQUIRES_REAUTH | ERROR state and fall back to local storage with a clear message.

  6. Admin impersonation is an explicit architectural exclusion — Must be documented as excluded, not just left unbuilt. Directly contradicts the privacy-first core value.


Critical Architectural Decisions (Lock Before Building)

These cannot be safely retrofitted:

1. JWT in httpOnly cookies Refresh token: httpOnly; Secure; SameSite=Strict cookie. Access token: Pinia memory only. Never localStorage. Vue Router guard silently refreshes before redirecting to login. Axios withCredentials: true.

2. HKDF per-user key derivation for cloud credentials HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials"). Master key in CLOUD_CREDS_KEY env var only. Salt in users table. Design before writing the first line of credential storage — cannot be added later without re-encrypting everything.

3. Presigned MinIO URL flow FastAPI generates signed PUT URL → browser uploads directly to MinIO → FastAPI confirms object and commits quota atomically. FastAPI handles metadata only; bytes never pass through the API layer. Object keys: {user_id}/{document_id}/{uuid4()}{ext}. Human-readable filename in DB only.

4. Atomic PostgreSQL quota enforcement UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes RETURNING used_bytes. If 0 rows returned, delete the MinIO object and return 413. Never perform quota arithmetic in Python between two DB statements.

5. BackgroundTasks replacement before horizontal scaling FastAPI BackgroundTasks is per-instance — classification tasks cannot distribute across containers. Replace with Celery + Redis or pgqueuer (PostgreSQL-backed, no Redis dependency) before scaling to N instances. Decide during Phase 3 planning.

Additional locked decisions:

  • Refresh tokens are opaque UUIDs stored hashed in DB (not JWTs); access tokens are short-lived JWTs (15 min).
  • refresh_tokens table has family_id — on reuse of a rotated token, revoke entire family and emit security alert.
  • Audit log uses BIGSERIAL PK; app DB user has INSERT + SELECT only (no UPDATE/DELETE).
  • Admin endpoints for cloud connections return only provider, display_name, connected_at, status — never credentials_enc.
  • Every document/folder endpoint asserts resource.user_id == current_user.id via centralized assert_document_access().

5-Phase Migration Sequence

Phase 1 — Infrastructure Foundation

Wire PostgreSQL + MinIO into Docker Compose. Create db/models.py with full schema. Alembic initial migration. Async session dependency. No API changes — flat-file code still runs. Gate: all services boot cleanly; migrations apply; no behavior change.

Phase 2 — Users and Authentication

Users, refresh_tokens, quotas tables. Auth endpoints (register, login, refresh, TOTP, password reset, forced logout). TOTP with backup codes. Password reset does NOT auto-login (routes through TOTP gate). get_current_user + get_current_admin FastAPI dependencies. Admin user management endpoints. Vue auth store (Pinia memory + httpOnly cookie), Router guard, Axios interceptors. Gate: admin JWT returns 403 on document endpoints; backup codes issued and acknowledged at enrollment.

Phase 3 — Document Migration to PostgreSQL + MinIO

Dual-write window: new uploads write to both stores. Migration script copies historical flat-file data to PostgreSQL + MinIO. Count reconciliation assertion (go/no-go gate). Flip read source to PostgreSQL. Remove JSON write path. Presigned URL flow for all uploads/downloads. asyncio.to_thread() wrapping all MinIO SDK calls. Gate: concurrent upload test at 99% quota — only one succeeds.

Phase 4 — Multi-User Isolation, Quotas, Folders, Sharing

All queries gain WHERE user_id = current_user.id. Quota bar (80%/95% warnings). Folder CRUD + breadcrumbs. Document move + sort. Share by handle + "Shared with me" folder. Audit log wired to all events. Admin audit viewer. In-browser PDF preview. Gate: negative-access test (admin cannot retrieve any document content); quota reconciliation drift <1%.

Phase 5 — Cloud Storage Backends

StorageBackend ABC + factory (mirrors ai/ pattern). MinIOBackend, OneDriveBackend, GoogleDriveBackend, NextcloudBackend, WebDAVBackend. OAuth2 connect/disconnect flows. Connection status UX. HKDF key derivation for all credentials. delete_user_files() on account deletion. Gate: mock invalid_grant → REQUIRES_REAUTH (not 500); account deletion asserts delete_user_files() per connection.


Top 5 Pitfalls by Risk

# Pitfall Severity Fix
1 JWT in localStorage — XSS bypasses TOTP entirely CRITICAL httpOnly cookie for refresh, Pinia memory for access token
2 Quota race condition — concurrent uploads bypass limit DATA INTEGRITY Atomic PostgreSQL UPDATE ... RETURNING
3 TOTP bypass via password reset — full 2FA bypass via email compromise SECURITY Reset issues password_reset_pending state, not a full session
4 Single Fernet key for all cloud credentials — catastrophic on key leak CATASTROPHIC HKDF per-user derivation before first credential is stored
5 Path traversal in MinIO keys — cross-user data access SECURITY UUID-only MinIO keys; human filename in DB only; never reconstruct key from request parameters

Confidence Assessment

Area Confidence Notes
Stack MEDIUM-HIGH Core libraries confirmed from FastAPI official release notes (PyJWT, pwdlib, SQLAlchemy 2.0, psycopg v3). Cloud SDK minor versions — verify on PyPI before pinning.
Features MEDIUM Based on Google Drive, Dropbox, Box, Paperless-ngx knowledge through Aug 2025.
Architecture HIGH FastAPI DI pattern from official docs; S3 presigned URLs and atomic PostgreSQL quota update are industry standards.
Pitfalls HIGH OWASP cheat sheets; RFC 9700 refresh token rotation; GDPR Article 17 stable regulatory text.

Overall: MEDIUM-HIGH


Gaps to Resolve During Planning

  • Verify cloud SDK minor versions on PyPI before pinning
  • Confirm PyOTP valid_window default in current docs (recommend valid_window=1 for ±30s clock drift)
  • Decide Celery + Redis vs pgqueuer during Phase 3 (depends on Redis availability in deployment target)
  • Audit existing codebase for any existing bcrypt hashes before removing passlib
  • Validate MinIO Docker Compose public endpoint in Phase 3 acceptance testing (presigned URLs must use host-accessible address, not internal Docker network name)