151 lines
10 KiB
Markdown
151 lines
10 KiB
Markdown
# DocuVault — Research Synthesis
|
||
|
||
_Last updated: 2026-05-21_
|
||
|
||
## Executive Summary
|
||
|
||
DocuVault is a brownfield migration of a functional single-user document scanner into a privacy-first, multi-user SaaS platform. The existing system already handles document upload, text extraction, and AI-based topic classification via a well-designed provider abstraction. This milestone replaces the flat-file JSON + filesystem persistence layer with PostgreSQL + MinIO, adds full multi-user authentication (JWT with httpOnly cookies, TOTP 2FA, refresh token rotation), per-user quota enforcement, folder organization, document sharing, and pluggable cloud storage backends — following the same adapter pattern already used for AI providers.
|
||
|
||
---
|
||
|
||
## Confirmed Stack
|
||
|
||
### Use
|
||
|
||
| Package | Version | Purpose |
|
||
|---|---|---|
|
||
| `pyjwt[crypto]` | ≥2.12.1 | JWT — current FastAPI docs recommendation; replaces python-jose |
|
||
| `pwdlib[argon2]` | ≥0.2.0 | Password hashing — Argon2 is memory-hard (OWASP 2025) |
|
||
| `pyotp` | ≥2.9.0 | TOTP 2FA — RFC 6238 reference |
|
||
| `cryptography` (Fernet) | ≥44.0.0 | Credential encryption — AES-128-CBC + HMAC-SHA256 |
|
||
| `sqlalchemy[asyncio]` | ≥2.0.36 | ORM — async-native; better brownfield fit than SQLModel |
|
||
| `psycopg[asyncio,binary]` | ≥3.2.0 | PostgreSQL driver — single driver for async FastAPI + sync Alembic |
|
||
| `alembic` | ≥1.14.0 | DB migrations |
|
||
| `minio` | ≥7.2.0 | Object storage — presigned URL flow (FastAPI never proxies bytes) |
|
||
| `msgraph-sdk` + `azure-identity` | ≥1.0.0 / ≥1.19.0 | OneDrive — official Microsoft SDK |
|
||
| `google-api-python-client` + `google-auth-oauthlib` | ≥2.150.0 / ≥1.2.0 | Google Drive v3 |
|
||
| `webdav4` | ≥0.9.8 | Nextcloud + generic WebDAV |
|
||
|
||
### Do NOT Use
|
||
|
||
- `python-jose` — FastAPI dropped it; use PyJWT
|
||
- `passlib[bcrypt]` for new hashes — maintenance mode; keep only for migrating existing hashes
|
||
- `tiangolo/uvicorn-gunicorn-fastapi` Docker image — deprecated; use `python:3.12-slim`
|
||
- `localStorage` for any auth token — XSS-accessible; httpOnly cookie for refresh, Pinia memory for access token
|
||
- Single platform Fernet key for all users — HKDF per-user derivation required (catastrophic blast radius otherwise)
|
||
- `SQLModel` for this migration — async story is thin; SQLAlchemy 2.0 async is better for brownfield
|
||
|
||
---
|
||
|
||
## Table-Stakes Features for v1
|
||
|
||
### Confirmed (from PROJECT.md)
|
||
|
||
- Email + password registration + JWT sessions with refresh tokens
|
||
- TOTP 2FA + backup codes *(see gap below)*
|
||
- Password reset via email
|
||
- Per-user isolated storage (100 MB free tier)
|
||
- Quota tracking, enforcement at upload, visible indicator
|
||
- Folder CRUD, move documents, "Shared with me" folder
|
||
- Share by handle, view-only default, immediate revoke
|
||
- Cloud OAuth2 connect flow + credential encryption
|
||
- Admin: user management, quota adjustment, AI provider assignment
|
||
- Audit log (append-only, metadata only) + admin viewer
|
||
- In-browser PDF preview
|
||
|
||
### Gaps — Items PROJECT.md Missed
|
||
|
||
1. **TOTP backup codes** — Every competitor ships these. Without them, a lost phone permanently locks users out. 8–10 single-use codes, stored hashed, acknowledged by user before TOTP is activated.
|
||
|
||
2. **Quota warnings at 80% and 95%** — PROJECT.md specifies rejection at 100% only. Pre-emptive warnings are table stakes (Google Drive, Dropbox both do this). In-app banner at 80% (amber) and 95% (red), plus a specific error at 100% showing current usage, rejected file size, and a link to storage settings.
|
||
|
||
3. **"Sign out all devices" / session revocation** — Users who believe their account is compromised need forced logout everywhere. Already handled by the `refresh_tokens` table — requires only an endpoint and a UI control.
|
||
|
||
4. **Breadcrumb navigation** — Folder CRUD is in PROJECT.md but not the navigation UX. Required for nested folder usability.
|
||
|
||
5. **Cloud storage connection status indicator** — PROJECT.md doesn't specify what happens when cloud storage is unreachable. Silent failure = data loss. Must show `ACTIVE | REQUIRES_REAUTH | ERROR` state and fall back to local storage with a clear message.
|
||
|
||
6. **Admin impersonation is an explicit architectural exclusion** — Must be documented as excluded, not just left unbuilt. Directly contradicts the privacy-first core value.
|
||
|
||
---
|
||
|
||
## Critical Architectural Decisions (Lock Before Building)
|
||
|
||
These cannot be safely retrofitted:
|
||
|
||
**1. JWT in httpOnly cookies**
|
||
Refresh token: `httpOnly; Secure; SameSite=Strict` cookie. Access token: Pinia memory only. Never `localStorage`. Vue Router guard silently refreshes before redirecting to login. Axios `withCredentials: true`.
|
||
|
||
**2. HKDF per-user key derivation for cloud credentials**
|
||
`HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`. Master key in `CLOUD_CREDS_KEY` env var only. Salt in users table. Design before writing the first line of credential storage — cannot be added later without re-encrypting everything.
|
||
|
||
**3. Presigned MinIO URL flow**
|
||
FastAPI generates signed PUT URL → browser uploads directly to MinIO → FastAPI confirms object and commits quota atomically. FastAPI handles metadata only; bytes never pass through the API layer. Object keys: `{user_id}/{document_id}/{uuid4()}{ext}`. Human-readable filename in DB only.
|
||
|
||
**4. Atomic PostgreSQL quota enforcement**
|
||
`UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes RETURNING used_bytes`. If 0 rows returned, delete the MinIO object and return 413. Never perform quota arithmetic in Python between two DB statements.
|
||
|
||
**5. BackgroundTasks replacement before horizontal scaling**
|
||
FastAPI `BackgroundTasks` is per-instance — classification tasks cannot distribute across containers. Replace with Celery + Redis or pgqueuer (PostgreSQL-backed, no Redis dependency) before scaling to N instances. Decide during Phase 3 planning.
|
||
|
||
**Additional locked decisions:**
|
||
- Refresh tokens are opaque UUIDs stored hashed in DB (not JWTs); access tokens are short-lived JWTs (15 min).
|
||
- `refresh_tokens` table has `family_id` — on reuse of a rotated token, revoke entire family and emit security alert.
|
||
- Audit log uses `BIGSERIAL` PK; app DB user has INSERT + SELECT only (no UPDATE/DELETE).
|
||
- Admin endpoints for cloud connections return only `provider, display_name, connected_at, status` — never `credentials_enc`.
|
||
- Every document/folder endpoint asserts `resource.user_id == current_user.id` via centralized `assert_document_access()`.
|
||
|
||
---
|
||
|
||
## 5-Phase Migration Sequence
|
||
|
||
### Phase 1 — Infrastructure Foundation
|
||
Wire PostgreSQL + MinIO into Docker Compose. Create `db/models.py` with full schema. Alembic initial migration. Async session dependency. No API changes — flat-file code still runs. Gate: all services boot cleanly; migrations apply; no behavior change.
|
||
|
||
### Phase 2 — Users and Authentication
|
||
Users, refresh_tokens, quotas tables. Auth endpoints (register, login, refresh, TOTP, password reset, forced logout). TOTP with backup codes. Password reset does NOT auto-login (routes through TOTP gate). `get_current_user` + `get_current_admin` FastAPI dependencies. Admin user management endpoints. Vue auth store (Pinia memory + httpOnly cookie), Router guard, Axios interceptors. Gate: admin JWT returns 403 on document endpoints; backup codes issued and acknowledged at enrollment.
|
||
|
||
### Phase 3 — Document Migration to PostgreSQL + MinIO
|
||
Dual-write window: new uploads write to both stores. Migration script copies historical flat-file data to PostgreSQL + MinIO. Count reconciliation assertion (go/no-go gate). Flip read source to PostgreSQL. Remove JSON write path. Presigned URL flow for all uploads/downloads. `asyncio.to_thread()` wrapping all MinIO SDK calls. Gate: concurrent upload test at 99% quota — only one succeeds.
|
||
|
||
### Phase 4 — Multi-User Isolation, Quotas, Folders, Sharing
|
||
All queries gain `WHERE user_id = current_user.id`. Quota bar (80%/95% warnings). Folder CRUD + breadcrumbs. Document move + sort. Share by handle + "Shared with me" folder. Audit log wired to all events. Admin audit viewer. In-browser PDF preview. Gate: negative-access test (admin cannot retrieve any document content); quota reconciliation drift <1%.
|
||
|
||
### Phase 5 — Cloud Storage Backends
|
||
`StorageBackend` ABC + factory (mirrors `ai/` pattern). `MinIOBackend`, `OneDriveBackend`, `GoogleDriveBackend`, `NextcloudBackend`, `WebDAVBackend`. OAuth2 connect/disconnect flows. Connection status UX. HKDF key derivation for all credentials. `delete_user_files()` on account deletion. Gate: mock `invalid_grant` → REQUIRES_REAUTH (not 500); account deletion asserts `delete_user_files()` per connection.
|
||
|
||
---
|
||
|
||
## Top 5 Pitfalls by Risk
|
||
|
||
| # | Pitfall | Severity | Fix |
|
||
|---|---|---|---|
|
||
| 1 | JWT in localStorage — XSS bypasses TOTP entirely | CRITICAL | httpOnly cookie for refresh, Pinia memory for access token |
|
||
| 2 | Quota race condition — concurrent uploads bypass limit | DATA INTEGRITY | Atomic PostgreSQL `UPDATE ... RETURNING` |
|
||
| 3 | TOTP bypass via password reset — full 2FA bypass via email compromise | SECURITY | Reset issues `password_reset_pending` state, not a full session |
|
||
| 4 | Single Fernet key for all cloud credentials — catastrophic on key leak | CATASTROPHIC | HKDF per-user derivation before first credential is stored |
|
||
| 5 | Path traversal in MinIO keys — cross-user data access | SECURITY | UUID-only MinIO keys; human filename in DB only; never reconstruct key from request parameters |
|
||
|
||
---
|
||
|
||
## Confidence Assessment
|
||
|
||
| Area | Confidence | Notes |
|
||
|---|---|---|
|
||
| Stack | MEDIUM-HIGH | Core libraries confirmed from FastAPI official release notes (PyJWT, pwdlib, SQLAlchemy 2.0, psycopg v3). Cloud SDK minor versions — verify on PyPI before pinning. |
|
||
| Features | MEDIUM | Based on Google Drive, Dropbox, Box, Paperless-ngx knowledge through Aug 2025. |
|
||
| Architecture | HIGH | FastAPI DI pattern from official docs; S3 presigned URLs and atomic PostgreSQL quota update are industry standards. |
|
||
| Pitfalls | HIGH | OWASP cheat sheets; RFC 9700 refresh token rotation; GDPR Article 17 stable regulatory text. |
|
||
|
||
**Overall: MEDIUM-HIGH**
|
||
|
||
---
|
||
|
||
## Gaps to Resolve During Planning
|
||
|
||
- Verify cloud SDK minor versions on PyPI before pinning
|
||
- Confirm PyOTP `valid_window` default in current docs (recommend `valid_window=1` for ±30s clock drift)
|
||
- Decide Celery + Redis vs pgqueuer during Phase 3 (depends on Redis availability in deployment target)
|
||
- Audit existing codebase for any existing bcrypt hashes before removing `passlib`
|
||
- Validate MinIO Docker Compose public endpoint in Phase 3 acceptance testing (presigned URLs must use host-accessible address, not internal Docker network name)
|