Files

151 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# DocuVault — Research Synthesis
_Last updated: 2026-05-21_
## Executive Summary
DocuVault is a brownfield migration of a functional single-user document scanner into a privacy-first, multi-user SaaS platform. The existing system already handles document upload, text extraction, and AI-based topic classification via a well-designed provider abstraction. This milestone replaces the flat-file JSON + filesystem persistence layer with PostgreSQL + MinIO, adds full multi-user authentication (JWT with httpOnly cookies, TOTP 2FA, refresh token rotation), per-user quota enforcement, folder organization, document sharing, and pluggable cloud storage backends — following the same adapter pattern already used for AI providers.
---
## Confirmed Stack
### Use
| Package | Version | Purpose |
|---|---|---|
| `pyjwt[crypto]` | ≥2.12.1 | JWT — current FastAPI docs recommendation; replaces python-jose |
| `pwdlib[argon2]` | ≥0.2.0 | Password hashing — Argon2 is memory-hard (OWASP 2025) |
| `pyotp` | ≥2.9.0 | TOTP 2FA — RFC 6238 reference |
| `cryptography` (Fernet) | ≥44.0.0 | Credential encryption — AES-128-CBC + HMAC-SHA256 |
| `sqlalchemy[asyncio]` | ≥2.0.36 | ORM — async-native; better brownfield fit than SQLModel |
| `psycopg[asyncio,binary]` | ≥3.2.0 | PostgreSQL driver — single driver for async FastAPI + sync Alembic |
| `alembic` | ≥1.14.0 | DB migrations |
| `minio` | ≥7.2.0 | Object storage — presigned URL flow (FastAPI never proxies bytes) |
| `msgraph-sdk` + `azure-identity` | ≥1.0.0 / ≥1.19.0 | OneDrive — official Microsoft SDK |
| `google-api-python-client` + `google-auth-oauthlib` | ≥2.150.0 / ≥1.2.0 | Google Drive v3 |
| `webdav4` | ≥0.9.8 | Nextcloud + generic WebDAV |
### Do NOT Use
- `python-jose` — FastAPI dropped it; use PyJWT
- `passlib[bcrypt]` for new hashes — maintenance mode; keep only for migrating existing hashes
- `tiangolo/uvicorn-gunicorn-fastapi` Docker image — deprecated; use `python:3.12-slim`
- `localStorage` for any auth token — XSS-accessible; httpOnly cookie for refresh, Pinia memory for access token
- Single platform Fernet key for all users — HKDF per-user derivation required (catastrophic blast radius otherwise)
- `SQLModel` for this migration — async story is thin; SQLAlchemy 2.0 async is better for brownfield
---
## Table-Stakes Features for v1
### Confirmed (from PROJECT.md)
- Email + password registration + JWT sessions with refresh tokens
- TOTP 2FA + backup codes *(see gap below)*
- Password reset via email
- Per-user isolated storage (100 MB free tier)
- Quota tracking, enforcement at upload, visible indicator
- Folder CRUD, move documents, "Shared with me" folder
- Share by handle, view-only default, immediate revoke
- Cloud OAuth2 connect flow + credential encryption
- Admin: user management, quota adjustment, AI provider assignment
- Audit log (append-only, metadata only) + admin viewer
- In-browser PDF preview
### Gaps — Items PROJECT.md Missed
1. **TOTP backup codes** — Every competitor ships these. Without them, a lost phone permanently locks users out. 810 single-use codes, stored hashed, acknowledged by user before TOTP is activated.
2. **Quota warnings at 80% and 95%** — PROJECT.md specifies rejection at 100% only. Pre-emptive warnings are table stakes (Google Drive, Dropbox both do this). In-app banner at 80% (amber) and 95% (red), plus a specific error at 100% showing current usage, rejected file size, and a link to storage settings.
3. **"Sign out all devices" / session revocation** — Users who believe their account is compromised need forced logout everywhere. Already handled by the `refresh_tokens` table — requires only an endpoint and a UI control.
4. **Breadcrumb navigation** — Folder CRUD is in PROJECT.md but not the navigation UX. Required for nested folder usability.
5. **Cloud storage connection status indicator** — PROJECT.md doesn't specify what happens when cloud storage is unreachable. Silent failure = data loss. Must show `ACTIVE | REQUIRES_REAUTH | ERROR` state and fall back to local storage with a clear message.
6. **Admin impersonation is an explicit architectural exclusion** — Must be documented as excluded, not just left unbuilt. Directly contradicts the privacy-first core value.
---
## Critical Architectural Decisions (Lock Before Building)
These cannot be safely retrofitted:
**1. JWT in httpOnly cookies**
Refresh token: `httpOnly; Secure; SameSite=Strict` cookie. Access token: Pinia memory only. Never `localStorage`. Vue Router guard silently refreshes before redirecting to login. Axios `withCredentials: true`.
**2. HKDF per-user key derivation for cloud credentials**
`HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`. Master key in `CLOUD_CREDS_KEY` env var only. Salt in users table. Design before writing the first line of credential storage — cannot be added later without re-encrypting everything.
**3. Presigned MinIO URL flow**
FastAPI generates signed PUT URL → browser uploads directly to MinIO → FastAPI confirms object and commits quota atomically. FastAPI handles metadata only; bytes never pass through the API layer. Object keys: `{user_id}/{document_id}/{uuid4()}{ext}`. Human-readable filename in DB only.
**4. Atomic PostgreSQL quota enforcement**
`UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes RETURNING used_bytes`. If 0 rows returned, delete the MinIO object and return 413. Never perform quota arithmetic in Python between two DB statements.
**5. BackgroundTasks replacement before horizontal scaling**
FastAPI `BackgroundTasks` is per-instance — classification tasks cannot distribute across containers. Replace with Celery + Redis or pgqueuer (PostgreSQL-backed, no Redis dependency) before scaling to N instances. Decide during Phase 3 planning.
**Additional locked decisions:**
- Refresh tokens are opaque UUIDs stored hashed in DB (not JWTs); access tokens are short-lived JWTs (15 min).
- `refresh_tokens` table has `family_id` — on reuse of a rotated token, revoke entire family and emit security alert.
- Audit log uses `BIGSERIAL` PK; app DB user has INSERT + SELECT only (no UPDATE/DELETE).
- Admin endpoints for cloud connections return only `provider, display_name, connected_at, status` — never `credentials_enc`.
- Every document/folder endpoint asserts `resource.user_id == current_user.id` via centralized `assert_document_access()`.
---
## 5-Phase Migration Sequence
### Phase 1 — Infrastructure Foundation
Wire PostgreSQL + MinIO into Docker Compose. Create `db/models.py` with full schema. Alembic initial migration. Async session dependency. No API changes — flat-file code still runs. Gate: all services boot cleanly; migrations apply; no behavior change.
### Phase 2 — Users and Authentication
Users, refresh_tokens, quotas tables. Auth endpoints (register, login, refresh, TOTP, password reset, forced logout). TOTP with backup codes. Password reset does NOT auto-login (routes through TOTP gate). `get_current_user` + `get_current_admin` FastAPI dependencies. Admin user management endpoints. Vue auth store (Pinia memory + httpOnly cookie), Router guard, Axios interceptors. Gate: admin JWT returns 403 on document endpoints; backup codes issued and acknowledged at enrollment.
### Phase 3 — Document Migration to PostgreSQL + MinIO
Dual-write window: new uploads write to both stores. Migration script copies historical flat-file data to PostgreSQL + MinIO. Count reconciliation assertion (go/no-go gate). Flip read source to PostgreSQL. Remove JSON write path. Presigned URL flow for all uploads/downloads. `asyncio.to_thread()` wrapping all MinIO SDK calls. Gate: concurrent upload test at 99% quota — only one succeeds.
### Phase 4 — Multi-User Isolation, Quotas, Folders, Sharing
All queries gain `WHERE user_id = current_user.id`. Quota bar (80%/95% warnings). Folder CRUD + breadcrumbs. Document move + sort. Share by handle + "Shared with me" folder. Audit log wired to all events. Admin audit viewer. In-browser PDF preview. Gate: negative-access test (admin cannot retrieve any document content); quota reconciliation drift <1%.
### Phase 5 — Cloud Storage Backends
`StorageBackend` ABC + factory (mirrors `ai/` pattern). `MinIOBackend`, `OneDriveBackend`, `GoogleDriveBackend`, `NextcloudBackend`, `WebDAVBackend`. OAuth2 connect/disconnect flows. Connection status UX. HKDF key derivation for all credentials. `delete_user_files()` on account deletion. Gate: mock `invalid_grant` → REQUIRES_REAUTH (not 500); account deletion asserts `delete_user_files()` per connection.
---
## Top 5 Pitfalls by Risk
| # | Pitfall | Severity | Fix |
|---|---|---|---|
| 1 | JWT in localStorage — XSS bypasses TOTP entirely | CRITICAL | httpOnly cookie for refresh, Pinia memory for access token |
| 2 | Quota race condition — concurrent uploads bypass limit | DATA INTEGRITY | Atomic PostgreSQL `UPDATE ... RETURNING` |
| 3 | TOTP bypass via password reset — full 2FA bypass via email compromise | SECURITY | Reset issues `password_reset_pending` state, not a full session |
| 4 | Single Fernet key for all cloud credentials — catastrophic on key leak | CATASTROPHIC | HKDF per-user derivation before first credential is stored |
| 5 | Path traversal in MinIO keys — cross-user data access | SECURITY | UUID-only MinIO keys; human filename in DB only; never reconstruct key from request parameters |
---
## Confidence Assessment
| Area | Confidence | Notes |
|---|---|---|
| Stack | MEDIUM-HIGH | Core libraries confirmed from FastAPI official release notes (PyJWT, pwdlib, SQLAlchemy 2.0, psycopg v3). Cloud SDK minor versions — verify on PyPI before pinning. |
| Features | MEDIUM | Based on Google Drive, Dropbox, Box, Paperless-ngx knowledge through Aug 2025. |
| Architecture | HIGH | FastAPI DI pattern from official docs; S3 presigned URLs and atomic PostgreSQL quota update are industry standards. |
| Pitfalls | HIGH | OWASP cheat sheets; RFC 9700 refresh token rotation; GDPR Article 17 stable regulatory text. |
**Overall: MEDIUM-HIGH**
---
## Gaps to Resolve During Planning
- Verify cloud SDK minor versions on PyPI before pinning
- Confirm PyOTP `valid_window` default in current docs (recommend `valid_window=1` for ±30s clock drift)
- Decide Celery + Redis vs pgqueuer during Phase 3 (depends on Redis availability in deployment target)
- Audit existing codebase for any existing bcrypt hashes before removing `passlib`
- Validate MinIO Docker Compose public endpoint in Phase 3 acceptance testing (presigned URLs must use host-accessible address, not internal Docker network name)