docs: add domain research (4 dimensions + synthesis)

This commit is contained in:
curo1305
2026-05-21 20:42:16 +02:00
parent 2a298a4276
commit daa7e0f289
5 changed files with 2406 additions and 0 deletions
+704
View File
@@ -0,0 +1,704 @@
# Architecture Research
**Domain:** Multi-user SaaS document management platform (FastAPI + Vue 3 brownfield migration)
**Researched:** 2026-05-21
**Confidence:** HIGH (auth DI pattern confirmed via official FastAPI docs; storage/DB patterns are well-established S3/PostgreSQL engineering standards cross-verified against official MinIO and SQLAlchemy docs)
---
## Standard Architecture
### System Overview
```
┌──────────────────────────────────────────────────────────────────────┐
│ Browser (Vue 3 SPA) │
│ ┌───────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ auth store│ │ docs store │ │quota store│ │settings store│ │
│ └─────┬─────┘ └──────┬───────┘ └─────┬─────┘ └──────┬───────┘ │
│ └───────────────┴────────────────┴────────────────┘ │
│ api/client.js (Bearer token injected) │
└───────────────────────────────────┬──────────────────────────────────┘
│ HTTPS/JSON + multipart
┌─────────▼─────────┐
│ Load Balancer │ (future; optional now)
└────────┬──────────┘
┌───────────────────┼───────────────────┐
│ │ │
┌──────────▼──────┐ ┌─────────▼──────┐ ┌────────▼───────┐
│ FastAPI inst 1 │ │ FastAPI inst 2 │ │ FastAPI inst N │
│ (stateless) │ │ (stateless) │ │ (stateless) │
└──────────┬───────┘ └────────┬────────┘ └────────┬───────┘
└───────────────────┼───────────────────┘
┌─────────▼──────────┐
│ Shared Services │
┌──────────┴──────────────────────┴─────────┐
│ │
┌──────────▼──────────┐ ┌────────────────▼──────┐
│ PostgreSQL │ │ MinIO │
│ (users, docs, meta, │ │ (object storage, │
│ quotas, audit) │ │ one bucket per user │
└──────────────────────┘ │ OR prefix-per-user) │
└───────────────────────┘
┌──────────────────────────┼─────────────────┐
│ │ │
┌──────────▼────────┐ ┌────────────▼───────┐ ┌────▼──────┐
│ Cloud Storage │ │ OneDrive Adapter │ │ WebDAV │
│ Adapter (base) │ │ Google Drive │ │ Adapter │
└───────────────────┘ └────────────────────┘ └───────────┘
```
---
## Component Boundaries
| Component | Responsibility | Communicates With |
|-----------|---------------|-------------------|
| `api/auth.py` | Registration, login, token refresh, TOTP enroll/verify | `services/user_service.py`, DB |
| `api/documents.py` | Upload, list, get, delete, reclassify, share | `services/document_service.py`, quota dep |
| `api/folders.py` | Folder CRUD, move | `services/folder_service.py` |
| `api/storage_backends.py` | Connect/disconnect cloud accounts, list/browse | `services/cloud_service.py` |
| `api/admin.py` | User CRUD, quota adjustments, audit log, AI config | `services/admin_service.py` |
| `deps/auth.py` | `get_current_user` — verifies JWT, returns `User` model | DB, `jose`/`PyJWT` |
| `deps/quota.py` | `check_quota` — reads user's usage, raises 413 if exceeded | DB |
| `deps/db.py` | `get_db` — yields async SQLAlchemy session | PostgreSQL |
| `services/document_service.py` | Orchestrates extract → classify → store flow | `extractor`, `classifier`, `storage_service` |
| `services/storage_service.py` | Routes to MinIO or cloud adapter; enforces object key namespacing | MinIO, cloud adapters |
| `services/user_service.py` | Password hashing, TOTP provisioning, breach check | DB, `bcrypt`, `pyotp` |
| `services/quota_service.py` | Compute used bytes from DB, update after upload/delete | DB |
| `services/audit_service.py` | Append-only audit log writes | DB |
| `services/cloud_service.py` | Manage encrypted cloud credentials, proxy operations | Cloud adapters, DB |
| `storage/base.py` | `StorageBackend` ABC (mirrors `ai/base.py` pattern) | — |
| `storage/minio_backend.py` | MinIO S3 implementation | MinIO |
| `storage/onedrive_backend.py` | OneDrive Graph API implementation | Microsoft Graph |
| `storage/gdrive_backend.py` | Google Drive API implementation | Google Drive API |
| `storage/nextcloud_backend.py` | Nextcloud WebDAV implementation | WebDAV |
| `db/models.py` | SQLAlchemy ORM models | PostgreSQL |
| `db/migrations/` | Alembic migration history | — |
---
## Recommended Project Structure
```
backend/
├── main.py # FastAPI app factory, middleware, router registration
├── config.py # pydantic-settings: DB URL, MinIO creds, secret keys
├── deps/
│ ├── auth.py # get_current_user, get_current_admin
│ ├── db.py # get_db (async session dependency)
│ └── quota.py # check_upload_quota (raises 413 if exceeded)
├── api/
│ ├── auth.py # /auth/register, /auth/login, /auth/refresh, /auth/totp/*
│ ├── documents.py # /documents/* (existing routes, now user-scoped)
│ ├── folders.py # /folders/*
│ ├── storage_backends.py # /storage-backends/* (cloud account management)
│ └── admin.py # /admin/* (users, quotas, audit, AI config)
├── services/
│ ├── document_service.py # upload orchestration (extract → classify → store → quota)
│ ├── storage_service.py # routes uploads to correct StorageBackend
│ ├── quota_service.py # read/write quota usage
│ ├── user_service.py # user creation, password, TOTP
│ ├── audit_service.py # audit log writes
│ └── cloud_service.py # cloud backend credential management
├── storage/ # cloud storage adapter layer (mirrors ai/)
│ ├── base.py # StorageBackend ABC
│ ├── __init__.py # get_storage_backend() factory
│ ├── minio_backend.py # default local-S3 backend
│ ├── onedrive_backend.py
│ ├── gdrive_backend.py
│ └── nextcloud_backend.py # WebDAV-based
├── ai/ # unchanged — existing provider abstraction
│ └── ...
├── db/
│ ├── models.py # all SQLAlchemy ORM models
│ ├── session.py # async engine + sessionmaker
│ └── migrations/ # Alembic env + version scripts
└── tests/
```
### Structure Rationale
- **`deps/`:** FastAPI dependency functions isolated from service logic. Auth, DB session, and quota are injected independently — routes compose them without coupling.
- **`storage/`:** Direct mirror of `ai/` module. Same ABC + factory pattern. Existing team mental model applies immediately.
- **`db/`:** ORM models and session config separated from services, ensuring migrations can be run independently of app startup.
---
## Architectural Patterns
### Pattern 1: JWT Verification via Dependency Injection (not middleware)
**What:** JWT parsing and user lookup happens in `deps/auth.py::get_current_user`, injected via `Depends()` per route.
**When to use:** All authenticated routes. Admin routes additionally inject `get_current_admin` which calls `get_current_user` then checks `user.role == "admin"`.
**Trade-offs:** Unauthenticated routes (health check, login, register) require no special exclusion logic. Middleware-based auth forces you to maintain an allowlist of public routes — that list inevitably drifts. DI is opt-in per route, which is safer.
**Example:**
```python
# deps/auth.py
oauth2_scheme = OAuth2PasswordBearer(tokenUrl="/auth/login")
async def get_current_user(
token: Annotated[str, Depends(oauth2_scheme)],
db: Annotated[AsyncSession, Depends(get_db)],
) -> User:
try:
payload = jwt.decode(token, settings.jwt_secret, algorithms=["HS256"])
user_id: str = payload.get("sub")
if user_id is None:
raise credentials_exception
except JWTError:
raise credentials_exception
user = await db.get(User, user_id)
if user is None or not user.is_active:
raise credentials_exception
return user
# api/documents.py
@router.get("/documents")
async def list_documents(
current_user: Annotated[User, Depends(get_current_user)],
db: Annotated[AsyncSession, Depends(get_db)],
):
...
```
**Confirmed:** HIGH confidence — FastAPI official documentation explicitly recommends this pattern over middleware for auth.
---
### Pattern 2: Refresh Token Rotation
**What:** Short-lived access tokens (15 min) + long-lived refresh tokens (30 days) stored in `refresh_tokens` table. On every `/auth/refresh` call, the old token is invalidated and a new pair is issued.
**When to use:** Always, for multi-user SaaS. Prevents stolen tokens having indefinite access.
**Trade-offs:** Requires `refresh_tokens` DB table and one extra DB write per refresh. The alternative (long-lived JWTs) cannot be revoked without a blocklist, which has the same cost.
**Implementation notes:**
- Refresh token = opaque random UUID (not JWT) — store hashed in DB alongside `user_id`, `expires_at`, `revoked`
- Access token = JWT with `sub=user_id`, `exp=now+15m`, `jti=uuid` (for optional blocklist future)
- On logout or password change: set `revoked=true` on all user's refresh tokens
- On TOTP failure after password success: do not issue any token; log failed_mfa audit event
---
### Pattern 3: MinIO Presigned URL Flow (preferred over streaming proxy)
**What:** FastAPI generates a short-lived presigned PUT URL from MinIO; the browser uploads directly to MinIO. For downloads, FastAPI generates a presigned GET URL and redirects.
**When to use:** All document uploads and downloads where the client is a browser on the same network as MinIO (typical Docker Compose deployment). Use streaming proxy only when MinIO is not reachable from the browser (e.g., MinIO is behind an internal network).
**Trade-offs:**
- Presigned URL avoids buffering the file through FastAPI — reduces memory pressure and latency significantly for large files.
- The FastAPI instance must be able to reach MinIO to generate the URL, but does not need to handle the byte stream.
- For Docker Compose: MinIO is on the internal Docker network; expose only the presigned-URL-generating endpoint externally. The presigned URL itself points to the MinIO public port.
**Flow:**
```
1. POST /documents/upload-url (FastAPI, authenticated)
→ quota check
→ generate presigned PUT URL (expires 5 min)
→ return { upload_url, object_key, document_id }
2. PUT <upload_url> (browser → MinIO directly)
→ no FastAPI involvement
3. POST /documents/confirm { document_id } (FastAPI, authenticated)
→ verify object exists in MinIO
→ trigger text extraction + classification (background task)
→ update document status to "processing"
→ return document record
```
**Object key namespace:** `{user_id}/{document_id}/{filename}` — ensures per-user isolation without separate buckets. One bucket (`docuvault-documents`) is sufficient; IAM policies or object key prefix checks enforce isolation in code.
**Presigned GET for downloads:**
```python
url = minio_client.presigned_get_object(
bucket_name="docuvault-documents",
object_name=f"{user_id}/{document_id}/{filename}",
expires=timedelta(minutes=30),
)
return RedirectResponse(url)
```
**Confidence:** HIGH for the S3 presigned URL pattern (standard across all S3-compatible stores). MinIO Python SDK `presigned_put_object` and `presigned_get_object` methods confirmed as stable API.
---
### Pattern 4: Cloud Storage Adapter (StorageBackend ABC)
**What:** A `StorageBackend` ABC in `storage/base.py` defines the interface. Each cloud integration implements it. `storage_service.py` routes to the correct backend based on the user's `default_storage_backend` setting.
**When to use:** Any operation that reads or writes document bytes. The service layer never calls MinIO or Google Drive directly — always via the adapter.
**Interface:**
```python
# storage/base.py
from abc import ABC, abstractmethod
from typing import AsyncIterator
class StorageBackend(ABC):
@abstractmethod
async def put_object(self, key: str, data: bytes, content_type: str) -> str:
"""Store object, return canonical reference (URL or key)."""
@abstractmethod
async def get_object(self, key: str) -> bytes:
"""Retrieve object bytes."""
@abstractmethod
async def delete_object(self, key: str) -> None:
"""Delete object."""
@abstractmethod
async def get_presigned_url(self, key: str, expires_seconds: int = 3600) -> str | None:
"""Return a time-limited direct URL, or None if backend doesn't support it."""
@abstractmethod
async def list_objects(self, prefix: str) -> list[str]:
"""List keys under prefix."""
@abstractmethod
async def health_check(self) -> bool:
"""Verify connectivity."""
```
**Factory:**
```python
# storage/__init__.py
def get_storage_backend(user: User, credentials: dict | None) -> StorageBackend:
backend_type = user.default_storage_backend # "minio" | "onedrive" | "gdrive" | ...
if backend_type == "minio":
return MinIOBackend(settings.minio_endpoint, ...)
elif backend_type == "onedrive":
return OneDriveBackend(credentials) # decrypted before passing in
...
```
**Credentials encryption:** Cloud OAuth tokens and refresh tokens are stored encrypted with Fernet symmetric encryption. The key is in `CLOUD_CREDS_KEY` env var. Encryption/decryption happens in `cloud_service.py` before the credentials are passed to the backend constructor — the adapter itself always receives plaintext credentials and never touches the DB.
---
### Pattern 5: Storage Quota Enforcement via Service Layer (not middleware, not DB constraint)
**What:** Quota is checked in `deps/quota.py::check_upload_quota` — a FastAPI dependency injected on upload routes. After successful upload, `quota_service.increment_usage(user_id, bytes)` is called.
**Where NOT to enforce:**
- **Not in middleware:** Middleware cannot easily read the `Content-Length` before the body is buffered, and cannot know user identity without re-implementing auth.
- **Not as a DB constraint:** `CHECK (used_bytes <= limit_bytes)` would require the DB to reject the commit, creating a race between the object already uploaded to MinIO and the metadata not committed. Inconsistency.
**Correct sequence:**
```
1. Pre-upload: deps/quota.py reads user.quota_used_bytes + Content-Length header
→ if (used + incoming) > limit_bytes: raise HTTP 413 with quota detail
2. Upload proceeds to MinIO (presigned URL or proxy)
3. Post-upload: quota_service.increment_usage atomically:
UPDATE quotas SET used_bytes = used_bytes + $delta
WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes
RETURNING used_bytes
→ if no rows returned: another concurrent upload exceeded quota; delete from MinIO + 413
```
**Why atomic update with check:** Two simultaneous uploads can both pass the pre-check. The atomic UPDATE with WHERE guard prevents double-spend. This is the correct pattern for optimistic quota enforcement under concurrency.
---
## PostgreSQL Schema Design
### Core Tables
```sql
-- Users
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
handle TEXT UNIQUE NOT NULL, -- @username for sharing
email TEXT UNIQUE NOT NULL,
password_hash TEXT NOT NULL, -- bcrypt
totp_secret TEXT, -- NULL = TOTP not enabled
totp_enabled BOOLEAN NOT NULL DEFAULT FALSE,
role TEXT NOT NULL DEFAULT 'user', -- 'user' | 'admin'
is_active BOOLEAN NOT NULL DEFAULT TRUE,
ai_provider TEXT, -- NULL = use system default
ai_model TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Quotas (1:1 with users; separate for clean admin queries)
CREATE TABLE quotas (
user_id UUID PRIMARY KEY REFERENCES users(id) ON DELETE CASCADE,
limit_bytes BIGINT NOT NULL DEFAULT 104857600, -- 100 MB
used_bytes BIGINT NOT NULL DEFAULT 0,
CONSTRAINT no_negative_usage CHECK (used_bytes >= 0)
);
-- Refresh tokens
CREATE TABLE refresh_tokens (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
token_hash TEXT NOT NULL UNIQUE, -- SHA-256 of the opaque token
expires_at TIMESTAMPTZ NOT NULL,
revoked BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON refresh_tokens(user_id, revoked);
-- Folders
CREATE TABLE folders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
parent_id UUID REFERENCES folders(id) ON DELETE CASCADE, -- NULL = root
name TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (user_id, parent_id, name)
);
-- Documents
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
folder_id UUID REFERENCES folders(id) ON DELETE SET NULL,
filename TEXT NOT NULL,
content_type TEXT NOT NULL,
size_bytes BIGINT NOT NULL DEFAULT 0,
storage_backend TEXT NOT NULL DEFAULT 'minio', -- 'minio' | 'onedrive' | ...
object_key TEXT NOT NULL, -- backend-specific reference
extracted_text TEXT, -- NULL until extraction complete
status TEXT NOT NULL DEFAULT 'pending', -- pending | processing | ready | error
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON documents(user_id, folder_id);
CREATE INDEX ON documents(user_id, created_at DESC);
-- Document topics (M:N)
CREATE TABLE document_topics (
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
topic_id UUID NOT NULL REFERENCES topics(id) ON DELETE CASCADE,
PRIMARY KEY (document_id, topic_id)
);
-- Topics (per-user; admin sets defaults)
CREATE TABLE topics (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID REFERENCES users(id) ON DELETE CASCADE, -- NULL = system default
name TEXT NOT NULL,
UNIQUE (user_id, name)
);
-- Document shares
CREATE TABLE document_shares (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
owner_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
recipient_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
permission TEXT NOT NULL DEFAULT 'view', -- 'view' | 'download' (future)
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (document_id, recipient_id)
);
CREATE INDEX ON document_shares(recipient_id);
-- Cloud storage backends per user
CREATE TABLE cloud_backends (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
backend_type TEXT NOT NULL, -- 'onedrive' | 'gdrive' | 'nextcloud' | 'webdav'
display_name TEXT NOT NULL,
credentials_enc TEXT NOT NULL, -- Fernet-encrypted JSON blob
is_default BOOLEAN NOT NULL DEFAULT FALSE,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON cloud_backends(user_id);
-- Audit log (append-only)
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
user_id UUID REFERENCES users(id) ON DELETE SET NULL,
actor_id UUID REFERENCES users(id) ON DELETE SET NULL, -- admin acting on behalf
event_type TEXT NOT NULL, -- login | login_failed | upload | delete | share | quota_change | ...
resource_id UUID, -- document_id / folder_id / user_id depending on context
ip_address INET,
metadata JSONB, -- event-specific extra fields (no document content)
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON audit_log(user_id, created_at DESC);
CREATE INDEX ON audit_log(event_type, created_at DESC);
-- NOTE: no UPDATE or DELETE grants on audit_log for app user; only INSERT + SELECT
```
**Schema design notes:**
- `topics.user_id IS NULL` = system-wide default topics visible to all users; per-user topics shadow them.
- `documents.object_key` stores the backend-relative reference — for MinIO it is `{user_id}/{document_id}/{filename}`; for OneDrive it is the Drive item ID. The `storage_backend` column tells the service which adapter to use.
- `cloud_backends.credentials_enc` is never returned in any API response; only the adapter factory decrypts it server-side.
- Audit log uses `BIGSERIAL` (not UUID) for append-ordered natural scan and to discourage random access patterns.
---
## Data Flow
### Document Upload Flow (MinIO presigned URL path)
```
Browser
├─[1] POST /documents/upload-url {filename, size, content_type, folder_id?}
│ → get_current_user dep (JWT verify → load User from DB)
│ → check_upload_quota dep (reads quotas table, compares size)
│ → document_service.prepare_upload()
│ → INSERT documents row (status='pending')
│ → minio_backend.generate_presigned_put(object_key, expires=300s)
│ ← {upload_url, object_key, document_id}
├─[2] PUT <upload_url> (browser → MinIO, no FastAPI)
├─[3] POST /documents/{id}/confirm
│ → get_current_user dep
│ → document_service.confirm_upload()
│ → verify object exists in MinIO (HEAD request)
│ → quota_service.increment_usage(user_id, size_bytes) [atomic]
│ → UPDATE documents SET status='processing'
│ → enqueue background task: extract_and_classify(document_id)
│ ← {document_id, status: "processing"}
└─[4] Background: extract_and_classify(document_id)
→ extractor.extract_text(object bytes from MinIO)
→ classifier.classify(text, user_topics)
→ UPDATE documents SET extracted_text=..., status='ready'
→ UPDATE document_topics
→ audit_service.log(event='upload', ...)
```
### Authentication Flow
```
Browser
├─[1] POST /auth/login {email, password, totp_code?}
│ → user_service.verify_password(email, password)
│ → if totp_enabled: pyotp.TOTP(secret).verify(totp_code)
│ → issue access_token (JWT, 15 min) + refresh_token (opaque UUID)
│ → store hash(refresh_token) in refresh_tokens table
│ ← {access_token, refresh_token, expires_in}
├─[2] Any authenticated request
│ Authorization: Bearer <access_token>
│ → get_current_user dep decodes JWT locally (no DB round-trip for valid tokens)
└─[3] POST /auth/refresh {refresh_token}
→ look up hash(refresh_token) in refresh_tokens table
→ verify not revoked, not expired
→ set revoked=true on old token
→ issue new access_token + new refresh_token (rotation)
← {access_token, refresh_token, expires_in}
```
### Shared Document Access Flow
```
Recipient accesses "Shared with me"
├─ GET /documents/shared-with-me
│ → SELECT d.* FROM documents d
│ JOIN document_shares s ON s.document_id = d.id
│ WHERE s.recipient_id = :current_user_id
│ ← list of document records (owner's documents, recipient has view access)
└─ GET /documents/{id}/download (recipient, shared document)
→ verify document_shares row exists for (document_id, current_user_id)
→ generate presigned GET URL using owner's object_key
← 302 redirect to presigned URL
(file bytes flow from MinIO → browser, never through FastAPI)
```
---
## Migration Path: Flat-File → PostgreSQL + MinIO
### Principle: parallel-run, not flag-day cutover
The safest approach is to keep the existing flat-file code running and introduce the new stack incrementally, in a sequence that never breaks the existing API contract from the Vue frontend's perspective.
### Phase 1 — Infrastructure, no behavior change
1. Add PostgreSQL and MinIO services to `docker-compose.yml`
2. Create `db/models.py` with initial schema (users, documents, quotas — no auth yet)
3. Add Alembic, run initial migration
4. Add `deps/db.py` with async session dependency
5. No API changes. Existing flat-file code still runs.
**Validation:** `docker-compose up` boots all services without errors. Alembic migrations apply cleanly.
### Phase 2 — Auth layer (new endpoints, existing endpoints temporarily open)
1. Add `users` table, `refresh_tokens` table
2. Implement `/auth/register`, `/auth/login`, `/auth/refresh`
3. Add `get_current_user` dependency to `deps/auth.py`
4. Add a single test authenticated route (`GET /auth/me`)
5. Existing document endpoints remain unauthenticated (guarded by feature flag or separate router prefix)
**Validation:** Auth endpoints work independently. Existing UI still calls existing routes without tokens.
### Phase 3 — Document storage migration (dual-write period)
1. Add MinIO `minio_backend.py`, integrate into `storage_service.py`
2. Create a one-time migration script:
- Reads each `data/metadata/<id>.json`
- Inserts a `documents` row with `user_id = SYSTEM_USER_ID` (a single placeholder user)
- Uploads `data/uploads/<id>.<ext>` to MinIO
3. New uploads go to MinIO + PostgreSQL; old flat-file data has been migrated
4. Update document API routes to read from PostgreSQL + MinIO, guarded behind `get_current_user`
**Critical:** Run migration script in a transaction; if any file fails, roll back DB inserts (MinIO objects can be cleaned up separately). Do not delete flat-file data until validation is complete.
### Phase 4 — Multi-user isolation
1. Add per-user `quotas` rows, `folders` table
2. Enforce user-scoped queries: all document queries include `WHERE user_id = :current_user_id`
3. Add quota enforcement dependency on upload routes
4. Add `document_shares`, `cloud_backends` tables
### Phase 5 — Cloud storage backends
1. Implement `StorageBackend` ABC and `MinIOBackend`
2. Implement first cloud adapter (OneDrive or Google Drive)
3. Add `/storage-backends/*` API endpoints
4. Add frontend UI for connecting cloud accounts
### Frontend changes during migration
- Add `Authorization: Bearer <token>` header injection in `src/api/client.js` (single change point — all API calls go through this module already)
- Add login/register views and auth Pinia store
- Redirect to `/login` on 401 responses
- No other frontend changes required until Phase 4 (user-scoped UI)
---
## Horizontal Scaling Concerns
| Concern | What to Share | What Can Be Instance-Local |
|---------|--------------|---------------------------|
| DB connections | PostgreSQL (shared) — use connection pooling (asyncpg pool size 10-20 per instance) | None |
| Object storage | MinIO (shared) — all instances use same endpoint | None |
| Refresh token state | PostgreSQL `refresh_tokens` table (shared) | JWT validation (CPU-only, no shared state needed) |
| Quota state | PostgreSQL `quotas` table with atomic UPDATE (shared) | Pre-flight Content-Length check (instance-local read, final write shared) |
| Background tasks | Cannot use `BackgroundTasks` across instances — use Celery + Redis OR Postgres-backed queue (pg_boss / pgqueuer) | Single-instance: `BackgroundTasks` is fine for Phase 1 |
| File upload temp buffers | If streaming proxy pattern used: RAM per instance | Use presigned URLs to avoid this entirely |
| AI provider instances | Re-instantiated per request already — no shared state | Per-instance re-instantiation is fine |
| CORS / session | Stateless JWT — no sticky sessions needed | — |
**First bottleneck:** Background task queue. `FastAPI BackgroundTasks` runs in the same process. When classification is slow or multiple uploads arrive simultaneously, workers block. Introduce a task queue (Celery + Redis, or pgqueuer) before scaling to N instances — otherwise each instance has its own queue and tasks are not distributed.
**Second bottleneck:** DB connection count. With N instances × 20 connections = N×20 PostgreSQL connections. Add PgBouncer in transaction mode in front of PostgreSQL before N gets large.
---
## Anti-Patterns
### Anti-Pattern 1: Per-Instance File Locks for Quota
**What people do:** Carry forward the `filelock` pattern into the multi-instance world, using a lock file on a shared volume.
**Why it's wrong:** Shared NFS/volume file locking has undefined behavior under Docker Compose networking, requires a shared filesystem mount (kills stateless instances), and is slower than a DB atomic update.
**Do this instead:** Atomic `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes` in PostgreSQL. Single round-trip, correct under concurrency, no shared filesystem required.
---
### Anti-Pattern 2: Streaming All File Traffic Through FastAPI
**What people do:** `POST /upload` receives the multipart body into memory, then POSTs it to MinIO from FastAPI.
**Why it's wrong:** Doubles memory usage (once in FastAPI, once in MinIO client buffer). Saturates FastAPI worker threads during large uploads. Introduces FastAPI as a bottleneck for byte transfer.
**Do this instead:** Two-step presigned URL flow (Pattern 3 above). FastAPI only handles metadata; bytes flow browser → MinIO directly.
---
### Anti-Pattern 3: Auth in Middleware Instead of Dependencies
**What people do:** Write a custom ASGI middleware that reads the `Authorization` header and either passes or rejects requests.
**Why it's wrong:** Middleware runs before FastAPI routing. To allow public routes (login, register, health), you must maintain an exclusion list in the middleware. This list inevitably goes stale when new public routes are added. Middleware cannot easily populate `request.state.user` in a way that's type-safe for path operations.
**Do this instead:** `Depends(get_current_user)` on each protected router. Optional auth uses `Depends(get_optional_user)` returning `User | None`. Explicit, type-safe, co-located with the route it protects. Confirmed as FastAPI's recommended pattern.
---
### Anti-Pattern 4: Storing Cloud Credentials Unencrypted (or Relying on DB-Level Encryption Alone)
**What people do:** Store OAuth tokens in plaintext DB columns, assuming DB-level TLS or disk encryption is sufficient.
**Why it's wrong:** Any user with DB read access (admin, compromised migration, backup leak) can extract all users' cloud tokens. Violates the privacy-first admin model requirement.
**Do this instead:** Fernet-encrypt the credential JSON blob in `cloud_service.py` before writing to `cloud_backends.credentials_enc`. The Fernet key lives in `CLOUD_CREDS_KEY` env var only — never in the DB. Admin queries on `cloud_backends` return only `id`, `backend_type`, `display_name`, `is_default` — the `credentials_enc` column is excluded from all admin-facing serializers.
---
### Anti-Pattern 5: One MinIO Bucket Per User
**What people do:** Create a new MinIO bucket for each registered user to enforce isolation.
**Why it's wrong:** MinIO is not designed for millions of buckets. Bucket creation is a management operation. IAM policies per bucket become complex to manage at scale.
**Do this instead:** Single bucket, key prefix isolation: `{user_id}/{document_id}/{filename}`. Enforce prefix scoping in `storage_service.py` — never let a user-supplied key escape their `{user_id}/` prefix. Verify in every `get_object` and `delete_object` call that the resolved key starts with the authenticated user's ID.
---
## Integration Points
### External Services
| Service | Integration Pattern | Notes |
|---------|---------------------|-------|
| PostgreSQL | SQLAlchemy 2.0 async (`asyncpg` driver), sessions via `Depends(get_db)` | Use `asyncpg` pool, not per-request connections |
| MinIO | `minio` Python SDK (sync) wrapped in `asyncio.to_thread()`, or `aiobotocore` for async S3 | Presigned URL generation is CPU-bound, not I/O-bound — `to_thread` is fine |
| OneDrive | Microsoft Graph API via `httpx` async client + OAuth2 PKCE flow | Refresh tokens stored encrypted in `cloud_backends` |
| Google Drive | Google Drive API v3 via `httpx` or `google-auth` library | Same credential model as OneDrive |
| Nextcloud | WebDAV via `httpx` (PUT/GET/DELETE) or `webdavclient3` library | Basic auth or app password — simpler than OAuth |
| PyOTP | TOTP generation/verification (`pyotp.TOTP(secret).verify(code)`) | Time-window tolerance: default ±1 period (±30 sec) is sufficient |
| `python-jose` or `PyJWT` | JWT encode/decode | Use `HS256` with a 256-bit secret. `python-jose` has broader algorithm support; `PyJWT` is simpler and more actively maintained |
| `cryptography` (Fernet) | Cloud credential encryption/decryption | `Fernet.generate_key()` at setup; store in `CLOUD_CREDS_KEY` env var |
| `passlib[bcrypt]` | Password hashing | bcrypt work factor 12 minimum |
### Internal Boundaries
| Boundary | Communication | Notes |
|----------|---------------|-------|
| `api/``services/` | Direct async function calls | Services never import from `api/`; dependency is one-directional |
| `services/``storage/` | `StorageBackend` ABC interface | Services import from `storage/__init__.py` factory only |
| `services/``ai/` | Existing `get_provider()` factory — unchanged | AI provider is still re-instantiated per call |
| `deps/``services/` | Services can be called from deps (e.g., quota_service from quota dep) | Keep deps thin — prefer passing a DB session to the dep and calling service functions |
| `db/models.py` ↔ everywhere | Import models directly | No repository pattern needed at this scale; SQLAlchemy session + models is sufficient |
---
## Scaling Considerations
| Scale | Architecture | Notes |
|-------|-------------|-------|
| 1100 users | Single FastAPI instance, `BackgroundTasks`, no queue | This milestone's target; simplest path |
| 10010k users | Add Celery + Redis for background tasks; add PgBouncer; scale FastAPI to 24 instances | Background task queue is the first change needed |
| 10k100k users | Read replica for PostgreSQL (document listing queries), MinIO multi-node cluster | Document metadata reads dominate; separate read/write paths |
| 100k+ users | Consider separate microservice for classification (GPU workers); CDN in front of MinIO presigned URLs | Classification latency becomes user-facing bottleneck |
---
## Sources
- FastAPI official docs — Security / OAuth2 with JWT: https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/ (HIGH confidence — directly confirmed pattern)
- FastAPI official docs — Advanced Middleware: https://fastapi.tiangolo.com/advanced/middleware/ (HIGH confidence — confirms DI > middleware for auth)
- FastAPI official docs — SQL Databases: https://fastapi.tiangolo.com/tutorial/sql-databases/ (HIGH confidence — session-per-request via Depends confirmed)
- MinIO S3 presigned URL pattern: S3-compatible standard, documented in AWS S3 and MinIO docs (HIGH confidence — industry-standard pattern)
- PostgreSQL atomic UPDATE for quota enforcement: standard optimistic concurrency pattern (HIGH confidence)
- Fernet symmetric encryption (`cryptography` library): well-documented Python standard for symmetric key encryption (HIGH confidence)
- Refresh token rotation pattern: IETF OAuth 2.0 Security BCP (RFC 9700 / draft-ietf-oauth-security-topics) (HIGH confidence)
---
*Architecture research for: DocuVault multi-user SaaS document management (FastAPI + Vue 3 brownfield)*
*Researched: 2026-05-21*
+449
View File
@@ -0,0 +1,449 @@
# Feature Research
**Domain:** SaaS Document Management Platform (multi-user, quota-enforced, privacy-first)
**Researched:** 2026-05-21
**Confidence:** MEDIUM (web fetch/search unavailable; based on training knowledge of Google Drive, OneDrive, Dropbox, Box, Notion, Paperless-ngx, DocuWare through Aug 2025 — all mature platforms with stable, well-documented feature sets)
---
## Research Scope
Platforms surveyed: Google Drive, Microsoft OneDrive, Dropbox, Notion, Box, Paperless-ngx (self-hosted), DocuWare (enterprise DMS).
Eight areas analyzed per the brief:
1. Auth & access control
2. Storage & quota UX
3. Folder/organization UX
4. Sharing model
5. Cloud storage integration UX
6. Admin panel
7. Audit / compliance
8. Document viewer features
---
## Feature Landscape
### Table Stakes (Users Expect These)
Features users assume exist. Missing these = product feels incomplete or broken.
#### 1. Auth & Access Control
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Email + password registration with validation | Every SaaS has this | LOW | Strength rules (length, complexity) expected; breach-check (HaveIBeenPwned API) is a notable addition |
| Persistent sessions with "remember me" | Users hate logging in every visit | LOW | JWT with refresh token; sliding expiry. Pure stateless JWT without refresh feels cheap. |
| Password reset via email | Users forget passwords constantly | LOW | Time-limited signed token; mandatory |
| TOTP 2FA (authenticator app) | Expected at any security-conscious SaaS | MEDIUM | PyOTP / HOTP RFC 6238. Users expect QR code setup + backup codes. Missing backup codes = major UX gap. |
| Forced logout / session revocation | Power users and security-conscious users expect this | MEDIUM | "Sign out all devices" is table stakes at Box and Google. Requires server-side session tracking (defeats pure JWT — use a token revocation list or short-lived JWTs + refresh token table). |
| Per-user isolated data space | Every cloud storage product does this | LOW | Violated = catastrophic trust failure. MinIO prefix isolation per user ID. |
| Account deletion with data wipe | GDPR + user trust | MEDIUM | Must cascade: documents, quotas, shares, cloud credentials, audit entries scoped to user. |
#### 2. Storage & Quota UX
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Visible quota indicator (used / total) | Google Drive, Dropbox, OneDrive all show this prominently | LOW | Progress bar + "X MB of Y MB used" in sidebar or settings. Missing = users feel blind. |
| Quota check at upload time with clear error | Every platform rejects over-quota uploads | LOW | Error must say "You've used X of Y MB — free up space or upgrade." Not a generic 500. |
| Per-file size shown in file list | Dropbox, Drive show sizes in list view | LOW | Users make quota decisions based on file size visibility. |
| Sort/filter by size | Users cleaning up quota expect to find large files | LOW | "Largest files first" sort is a common quota-management action. |
| Quota warning at 80% and 95% | Drive warns at 80%; Dropbox emails at 90% | LOW | In-app banner + optionally email. Users are surprised when they hit the wall without warning. |
| Storage usage breakdown by folder/category | Drive shows breakdown by file type | MEDIUM | "X MB in Documents, Y MB in Images" helps users understand what to delete. Folder-level usage is lower priority but expected by power users. |
| Delete confirmation that shows freed space | Minor but Dropbox/Drive do this | LOW | "Deleting this file will free 24 MB." Reinforces value of the action. |
#### 3. Folder / Organization UX
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Create folder | Universal | LOW | Standard. |
| Rename folder | Universal | LOW | Inline rename (click-to-edit) is expected; modal is acceptable. |
| Delete folder (with contents confirmation) | Universal | LOW | Must warn if non-empty: "This will delete N documents." Soft-delete preferred; hard-delete on confirm. |
| Move document to folder (drag-and-drop or context menu) | Drive, Dropbox have both | MEDIUM | Drag-and-drop is a strong UX expectation in desktop-class web apps. Context menu "Move to..." is the fallback minimum. |
| Move folder into another folder (nesting) | Drive, OneDrive support arbitrary depth | MEDIUM | Users expect arbitrary depth. Cap at 58 levels deep to avoid pathological nesting; unlimited is fine for v1. |
| Breadcrumb navigation | Universal in file managers | LOW | Clickable breadcrumb showing full path: Root > Invoices > 2025. Without this users get lost in nested folders. |
| Sort documents within folder (name, date, size, type) | All platforms have this | LOW | Default: date descending (newest first). |
| "Recent" / "Last accessed" virtual folder | Drive, OneDrive surface this prominently | LOW | Not a real folder — a filtered view. Users navigating back to recent work expect it. |
| Search across all documents | Every platform has global search | MEDIUM | Full-text or metadata search. For DocuVault, text is already extracted — expose it in search. Missing = product feels like a filing cabinet with no index. |
| Multi-select for batch operations | Drive, Dropbox support shift-click + checkbox | MEDIUM | Batch delete, batch move. Without this, managing many files is tedious. |
| "Shared with me" virtual folder | Drive has this; Box has it; Dropbox has it | LOW | Auto-populated when another user shares a document. Already in PROJECT.md. |
#### 4. Sharing Model
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Share with named user (by handle or email) | Drive, Box, Dropbox all do this | MEDIUM | Core sharing primitive. Must show: who has access, their permission level, revoke button. |
| View-only vs. edit permission distinction | Box, Drive, Dropbox all distinguish view/edit | LOW | For DocuVault, edit means "re-upload / annotate" — since no in-app editing, view-only is the primary mode. Still must distinguish the concept for future extensibility. |
| Revoke share immediately | All platforms do this | LOW | No delay. If the user is currently viewing the document, their next request should 403. |
| Share notification to recipient | Drive/Dropbox send email; Box shows in-app | LOW | At minimum: in-app notification. Email is table stakes for platforms where users may not be logged in. |
| See list of "what I've shared" | Drive, Box surface this | LOW | Users need to audit their own shares. A "Shared by me" list per document + a global shares list in settings. |
| Accept / decline incoming share | Box requires acceptance; Drive shows silently | MEDIUM | DocuVault's "Shared with me" folder appearing is implicit acceptance. An explicit accept/decline adds trust. Drive skips it; Box requires it for external shares. Given privacy-first positioning, explicit accept is the better default. |
| Share expiry date | Box supports this; Drive (with Workspace) does; Dropbox Business does | MEDIUM | Not universal at consumer tier but expected in security-conscious / business contexts. |
#### 5. Cloud Storage Integration UX
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| OAuth2 connect flow for Google Drive / OneDrive | Standard for these integrations | HIGH | Redirect-based OAuth2 PKCE. Users expect "Connect" → browser popup → granted → back to app. No credential entry. |
| Connection status indicator (connected / disconnected / error) | Any integration page shows this | LOW | Green dot = connected, red = error with message. Dropbox Business sync indicators are the model. |
| Disconnect / re-authenticate option | Users rotate tokens; credentials expire | LOW | "Disconnect" button. Re-auth if token expires (silent token refresh where possible). |
| Default storage selector | Users need to know where new uploads go | LOW | Clear radio/dropdown: "Local storage" vs "Google Drive" vs "OneDrive." Show current default prominently. |
| Upload routing confirmation | When cloud is default, users want to see it confirmed | LOW | "Stored in Google Drive" in upload success message. |
| Error state when cloud storage unreachable | Cloud services go down; credentials expire | MEDIUM | Graceful: queue locally, show warning, retry. Hard fail (lose document) = catastrophic. Minimum: "Storage unavailable — document saved locally." |
| Basic usage from cloud backend | Show how much cloud storage is used vs available | MEDIUM | Google Drive has 15 GB free; OneDrive 5 GB. Surfacing "You've used 3 GB of your 15 GB Google Drive" alongside local quota gives users a full picture. |
#### 6. Admin Panel
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| User list with search and pagination | Every admin panel has this | LOW | Show: username, email, registration date, last login, quota used/total, active/inactive status. |
| Create user account | Admin needs to onboard users manually (invite flow) | LOW | Form: email, temp password (force reset on first login), quota assignment. |
| Deactivate / reactivate user | Admin must be able to remove access without deleting data | LOW | Soft disable: user cannot log in; data preserved. Separate from delete. |
| Force password reset | Standard admin action | LOW | Invalidates current sessions; user gets reset email. |
| Quota adjustment per user | Already in PROJECT.md | LOW | Input: new quota in MB. Show current usage. Prevent setting quota below current usage (warn, not block). |
| System-wide default quota setting | Admins want to change the free tier baseline | LOW | Global default that new users inherit. Existing users keep their individually-set quota. |
| AI provider/model assignment per user | Already in PROJECT.md | MEDIUM | Dropdown of configured providers + models. Show "using system default" until overridden. |
| System-wide AI provider default | Already in PROJECT.md | LOW | Global fallback when no user override exists. |
| Audit log viewer | All enterprise DMS products have this | MEDIUM | See Audit section below. |
| Platform health indicators | Admins need to know if things are broken | MEDIUM | Storage backend connectivity, database connection, queue depth. A simple status page at /admin/health. |
#### 7. Audit / Compliance
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Audit log of security events | GDPR, SOC2 baseline, enterprise buyers expect this | MEDIUM | Mandatory events: login, failed login, password reset, 2FA enable/disable, session revoke. |
| Audit log of data events | Same compliance baseline | MEDIUM | Mandatory events: document upload, document delete, document move/rename, share create, share revoke, quota change. |
| Metadata-only audit (no content) | Privacy requirement — DocuVault's stated constraint | LOW | Log: actor_user_id, action, target_resource_id, timestamp, IP address. Never log document content or filenames in the log (filename is sensitive metadata). |
| Admin-viewable audit log with filters | Already in PROJECT.md | MEDIUM | Filter by: user, action type, date range. Export to CSV is expected by compliance-oriented admins. |
| Immutable audit log | Audit logs lose their value if they can be edited | MEDIUM | Append-only table. No UPDATE/DELETE on audit rows. Admin UI shows only read operations. |
| IP address logging | Standard for login events; expected by security teams | LOW | Log IP on all auth events. GDPR note: IP is PII — include in retention policy. |
| Log retention policy | GDPR requires defined retention | LOW | Configurable in admin settings. Default: 1 year. Automated purge of older entries. |
| GDPR data export (user's own data) | GDPR Article 20 right to portability | HIGH | User can request export of all their data: document list + metadata (not necessarily content), account info, audit log of their own actions. Full content export is optional but the metadata export is required. |
#### 8. Document Viewer Features
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| In-browser PDF preview | Drive, Dropbox, Box all have this | MEDIUM | Use browser's native PDF viewer or PDF.js. Without preview, users must download to read — major friction. |
| Document metadata panel | All DMS platforms show this | LOW | Show: upload date, file size, MIME type, AI-assigned topics, uploader, storage backend. |
| AI topics display | Core DocuVault feature | LOW | Show assigned topics with confidence if available. Allow manual topic override from document detail. |
| Download original file | Universal | LOW | Always available. Even if preview fails, download must work. |
| Document rename | Drive, Dropbox allow rename from detail view | LOW | Inline rename in detail view or list view. |
| Delete from detail view | Universal | LOW | With confirmation. Returns to parent folder. |
| "Shared with" indicator on document | Drive shows a "people" icon on shared items | LOW | Visual indicator in list view that document is shared. Click to see who. |
| Basic image preview (JPG, PNG) | All platforms preview images natively | LOW | Browser <img> tag is sufficient. |
| Text file preview | Drive, Dropbox show .txt inline | LOW | Simple <pre> render of extracted text is adequate. |
---
### Differentiators (Competitive Advantage)
Features that align with DocuVault's core value (privacy-first, self-hosted, AI-classified).
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Privacy-first admin model (admin cannot read documents) | Unique trust proposition vs Google Drive/OneDrive where the operator can read everything | HIGH | Encryption-at-rest with user-scoped keys; admin queries explicitly exclude content. No other mainstream cloud platform offers this by default. Strong differentiator for privacy-conscious users. |
| Bring-your-own cloud storage backend | Users keep their existing cloud storage; DocuVault is the intelligent layer on top | HIGH | Google Drive/OneDrive are the storage, DocuVault is the classifier/organizer. Removes the "I have to trust another cloud with my docs" objection. |
| AI topic classification on upload (automatic) | No other mainstream DMS auto-classifies by topic on upload | MEDIUM | Paperless-ngx has auto-tagging; DocuWare has rules-based classification. AI-driven flexible topics is differentiated for non-enterprise users. |
| Multiple AI provider support (local/private inference) | Users who want fully on-premises AI (Ollama) get privacy guarantee no cloud DMS offers | MEDIUM | Ollama / LMStudio means documents never leave the user's infrastructure. Strong appeal to privacy-first and regulated-industry users. |
| Per-user topic customization on top of system defaults | Users get personalized classification without admin overhead | MEDIUM | System topics + per-user overrides. No mainstream cloud DMS supports per-user AI taxonomy customization. |
| Share expiry with automatic revocation | Goes beyond basic sharing; prevents forgotten shares | MEDIUM | Dropbox Business and Box have this at paid tiers. Including it at v1 is differentiating for a free-tier self-hosted platform. |
| Explicit share accept/decline | Recipient has control; aligns with privacy-first positioning | LOW | Box has this; Drive doesn't require it. Gives recipients agency. |
| Storage backend per-document routing | Some documents go to Drive, others to local — user decides per-upload | HIGH | No mainstream platform does this. Users with mixed sensitivity needs can route sensitive docs to local and bulk docs to Drive. Complex to implement but uniquely valuable. |
| In-app audit log visible to the user (not just admin) | Users can see their own activity history | LOW | GDPR-aligned; builds trust. Google Activity Dashboard is the model. Most self-hosted DMS don't surface this to users. |
---
### Anti-Features (Deliberately Excluded)
Features that seem good but would create problems for this project.
| Feature | Why Requested | Why Problematic | Alternative |
|---------|---------------|-----------------|-------------|
| Public link sharing (unauthenticated) | Users want to share with people not on the platform | Creates a public attack surface; quota abuse; legal exposure if abused; hard to revoke at scale | Out of scope for v1 per PROJECT.md. Named-user sharing with accept/decline serves legitimate use cases. |
| In-app document editing | Users want a full office suite | Massive scope expansion (collaborative editing = CRDT, OT, conflict resolution); not core to document management; every vendor locks you into their editor | "View and organize" is the value; editing stays in the originating app. |
| Real-time collaborative editing (Google Docs model) | Feels modern | Requires WebSocket infrastructure, OT/CRDT algorithms, presence indicators — easily 3x the codebase complexity of everything else combined | Explicit non-goal; DocuVault stores files, not live documents. |
| Mobile app (iOS/Android) | Users want mobile access | React Native or Flutter doubles the implementation surface; mobile OAuth2 and background sync are non-trivial | Responsive web app is the minimum. PWA capabilities (offline-capable via service worker) are a future v2 differentiator. |
| SSO / SAML / OAuth enterprise federation | Enterprise buyers ask for it | Premature: adds Keycloak or similar dependency, requires session model changes, needs testing against multiple IdPs | TOTP 2FA first; SSO when subscription billing lands. Schema already designed for extension. |
| Subscription billing in-platform | Users want self-service upgrades | Payment processing is a separate product (Stripe integration, dunning, invoicing, tax). Doing it half-way creates billing bugs that destroy trust | Quota model is billing-ready (limit_bytes column); add Stripe when subscription tier is validated. |
| Soft-delete / Trash with restore | Users accidentally delete things | Reasonable feature, but complicates quota accounting (does trash count against quota?), storage lifecycle, and purge timing. Drive's 30-day trash has caused user confusion about "why is my quota still full?" | Clear confirmation dialog before hard delete is the v1 approach. Trash can be added in v1.x. |
| Document version history | Users want to see old versions | Correct for collaborative tools; for a personal document store where users own their files, versioning explodes storage use. Paperless-ngx doesn't have it. DocuWare does but it's enterprise | Hash-based deduplication (don't store the same file twice) is a better v1 primitive. Version history in v2 for users who need it. |
| AI-generated document summaries | Seems valuable given AI integration | Every summary requires an AI call → cost scales with uploads; summaries become stale when topics change; storing summaries consumes quota | AI is used for classification only. Summary generation can be a user-triggered on-demand action later. |
| Admin impersonation / "log in as user" | Admins want to debug user issues | Directly contradicts the privacy-first core value. If admin can impersonate, admin can access user documents. Trust collapses. | Structured audit logs give admins enough to diagnose issues without impersonation. Document this as an explicit architectural decision. |
---
## Feature Dependencies
```
[User Registration + Auth]
└──requires──> [Email verification] (optional at v1, needed for password reset)
└──requires──> [JWT session management]
└──requires──> [Token revocation list] (for forced logout)
[TOTP 2FA]
└──requires──> [Backup codes] (users get locked out without these)
└──requires──> [User Auth base]
[Document Upload]
└──requires──> [User Auth]
└──requires──> [Quota enforcement]
└──requires──> [Quota tracking (DB)]
[Folder Structure]
└──requires──> [Document Upload]
└──enhances──> [Search] (search within folder scope)
[Document Sharing]
└──requires──> [User accounts] (both sender and recipient)
└──requires──> [Folder Structure] ("Shared with me" virtual folder)
└──requires──> [Permission model in DB]
[Share expiry]
└──requires──> [Document Sharing base]
└──requires──> [Background job / cron] (to revoke expired shares)
[Cloud Storage Integration]
└──requires──> [User Auth] (OAuth2 tokens stored per user)
└──requires──> [Credential encryption] (Fernet / pgcrypto)
└──requires──> [StorageBackend adapter interface] (already planned)
└──enhances──> [Quota tracking] (cloud quota shown separately)
[Per-document storage routing]
└──requires──> [Cloud Storage Integration]
└──requires──> [Multiple backends connected]
[Admin Panel]
└──requires──> [User accounts]
└──requires──> [Quota DB model]
└──requires──> [Audit log]
[Audit Log]
└──requires──> [User Auth] (actor_user_id)
└──requires──> [All logged operations exist]
└──conflicts──> [Admin impersonation] (impersonation breaks log integrity)
[GDPR data export]
└──requires──> [Audit log]
└──requires──> [Document metadata model]
└──requires──> [Background export job] (large exports are async)
[AI classification]
└──requires──> [Document upload + text extraction] (already exists)
└──requires──> [Topic model] (already exists)
└──enhances──> [Search] (topic-filtered search)
[In-browser PDF preview]
└──requires──> [Document stored accessibly] (MinIO presigned URL or proxy endpoint)
└──independent of──> [Cloud storage] (preview proxied through app, not direct cloud URL)
```
### Dependency Notes
- **TOTP requires backup codes:** Without backup codes, users who lose their phone lose their account permanently. This is a support nightmare and a documented UX failure in many 2FA implementations.
- **Forced logout requires server-side token tracking:** Pure stateless JWT cannot support "sign out all devices." A revocation list (Redis or DB table) or short-lived JWTs (15 min) + refresh token table is required. This is a non-trivial architecture decision that must be resolved before auth is implemented.
- **Share expiry requires a background job:** Expiry cannot be purely DB-enforced. A cron job or Celery task must scan and revoke expired shares, or each share-check must evaluate expiry at access time (lazy revocation — simpler, acceptable for v1).
- **Per-document storage routing is independent of the default storage selector:** Default storage is "which backend do new uploads go to by default?" Routing is "for this specific upload, override the default." Routing is a differentiator; default selector is table stakes.
- **Admin impersonation conflicts with privacy-first model:** These two features are architecturally incompatible. Document this as an explicit decision in PROJECT.md.
- **GDPR export is async:** Exporting all user data (documents + metadata) can take minutes for large accounts. Must be a background job with "email me when ready" — not a synchronous HTTP download.
---
## MVP Definition (This Milestone)
This is a subsequent milestone — the core document + AI system exists. MVP for this milestone is everything in PROJECT.md Active section, implemented in a way that doesn't create rework.
### Must Ship (v1 — this milestone)
- [ ] Email + password registration with strength validation — table stakes, nothing works without it
- [ ] JWT sessions with refresh tokens + forced logout capability — security baseline
- [ ] TOTP 2FA with backup codes — PROJECT.md requires it; backup codes prevent lockouts
- [ ] Password reset via email — users will forget passwords
- [ ] Per-user isolated storage (MinIO prefix per user ID) — without this, multi-user is unsafe
- [ ] Quota tracking (DB) + enforcement at upload + visible indicator — table stakes for any quota system
- [ ] Quota warnings at 80% and 95% — prevents users hitting the wall blindly
- [ ] Create / rename / delete folders with breadcrumb navigation — table stakes file management
- [ ] Move document to folder (context menu minimum; drag-drop in v1.x) — without this folder structure is useless
- [ ] Sort by name/date/size — basic list management
- [ ] Global search (metadata + extracted text) — without this the product is a dumb file cabinet
- [ ] Share document with named user by handle (view-only default) — core sharing primitive
- [ ] Revoke share with immediate effect — security; users must be able to undo shares
- [ ] "Shared with me" virtual folder — PROJECT.md requires it
- [ ] Cloud storage connect flow (Google Drive + OneDrive OAuth2) — PROJECT.md requires it
- [ ] Connection status indicator + error state with local fallback — users need to know if cloud is broken
- [ ] Default storage backend selector — users need to control where uploads go
- [ ] Credential encryption (Fernet + env key) — PROJECT.md hard requirement
- [ ] Admin: user list, create, deactivate, quota adjustment — PROJECT.md requires it
- [ ] Admin: AI provider/model assignment per user + system default — PROJECT.md requires it
- [ ] Audit log (append-only): auth events + data events — PROJECT.md requires it
- [ ] Admin audit log viewer with date/user/action filters — PROJECT.md requires it
- [ ] In-browser PDF preview (PDF.js or native) — without preview, users must download to read
- [ ] Document metadata panel (size, date, topics, storage backend) — contextual information users expect
### Add After Core Validated (v1.x)
- [ ] Share expiry date — valuable for security-conscious users; requires background job
- [ ] Explicit share accept/decline — adds trust; low complexity
- [ ] "What I've shared" global list in settings — users need to audit their shares
- [ ] In-app share notification — without email infrastructure, in-app is the fallback
- [ ] Storage usage breakdown by folder — power user feature; quota bar is sufficient at v1
- [ ] CSV export of audit log — compliance teams want this; not day-one critical
- [ ] Log retention policy configuration — needed for GDPR compliance; default 1yr is fine for v1
- [ ] Drag-and-drop document move — UX polish; context menu is the functional minimum
- [ ] "Recent documents" virtual view — convenience; not blocking
### Future (v2+)
- [ ] GDPR data export (Article 20) — required eventually; complex (async job); defer until user base exists
- [ ] Per-document storage routing (override default per upload) — strong differentiator but complex; v2
- [ ] User-facing activity log (own audit trail) — GDPR-aligned trust feature; v2
- [ ] Soft-delete / Trash with restore — adds quota accounting complexity; solve properly in v2
- [ ] Document version history — storage-intensive; needs a clear retention policy first
- [ ] SSO / SAML — per PROJECT.md, after subscription billing
- [ ] Public link sharing — per PROJECT.md, explicitly out of scope for v1
- [ ] Platform health dashboard at /admin/health — operational convenience; Docker healthchecks handle v1
---
## Feature Prioritization Matrix
| Feature | User Value | Implementation Cost | Priority |
|---------|------------|---------------------|----------|
| Per-user auth (registration, login, JWT) | HIGH | LOW | P1 |
| TOTP 2FA + backup codes | HIGH | MEDIUM | P1 |
| Password reset | HIGH | LOW | P1 |
| Quota tracking + enforcement + indicator | HIGH | LOW | P1 |
| Quota warnings (80%, 95%) | HIGH | LOW | P1 |
| Folder CRUD + breadcrumb | HIGH | LOW | P1 |
| Move document to folder | HIGH | LOW | P1 |
| Global search | HIGH | MEDIUM | P1 |
| Share with named user (view-only) | HIGH | MEDIUM | P1 |
| Revoke share | HIGH | LOW | P1 |
| "Shared with me" folder | HIGH | LOW | P1 |
| Cloud connect flow (OAuth2) | HIGH | HIGH | P1 |
| Connection status + error state + local fallback | HIGH | MEDIUM | P1 |
| Default storage selector | HIGH | LOW | P1 |
| Credential encryption | HIGH | MEDIUM | P1 |
| Admin user management (list/create/deactivate/quota) | HIGH | LOW | P1 |
| Admin AI config | MEDIUM | LOW | P1 |
| Audit log (append-only) | HIGH | MEDIUM | P1 |
| Admin audit log viewer | MEDIUM | MEDIUM | P1 |
| In-browser PDF preview | HIGH | MEDIUM | P1 |
| Document metadata panel | MEDIUM | LOW | P1 |
| Forced logout / session revocation | MEDIUM | MEDIUM | P2 |
| Share expiry date | MEDIUM | MEDIUM | P2 |
| Explicit share accept/decline | MEDIUM | LOW | P2 |
| "What I've shared" list | MEDIUM | LOW | P2 |
| Storage usage breakdown by folder | MEDIUM | MEDIUM | P2 |
| CSV export of audit log | MEDIUM | LOW | P2 |
| Log retention policy | MEDIUM | LOW | P2 |
| Drag-and-drop move | MEDIUM | MEDIUM | P2 |
| Recent documents view | LOW | LOW | P2 |
| Per-document storage routing | HIGH | HIGH | P3 |
| GDPR data export | MEDIUM | HIGH | P3 |
| Soft-delete / Trash | MEDIUM | HIGH | P3 |
| User-facing activity log | MEDIUM | LOW | P3 |
| Document version history | LOW | HIGH | P3 |
| Platform health dashboard | LOW | MEDIUM | P3 |
---
## Competitor Feature Analysis
| Feature | Google Drive | Dropbox | Box | Paperless-ngx | DocuVault Plan |
|---------|--------------|---------|-----|----------------|----------------|
| Share by named user | Yes (email) | Yes (email) | Yes (email + internal) | No (single-user) | Yes (by handle) |
| Share permission levels | View / Comment / Edit | View / Edit | View / Edit / Co-owner | N/A | View-only v1; edit later |
| Share expiry | Google Workspace only | Business+ only | Business+ free tier | N/A | Include in v1.x — differentiator at free tier |
| Public link sharing | Yes | Yes | Yes | No | Explicitly excluded v1 |
| Quota indicator | Yes (prominent) | Yes | Yes | N/A (local disk) | Yes — progress bar + text |
| Quota warnings | Yes (email at 80%) | Yes (email at 90%) | Yes | N/A | In-app banner at 80%, 95% |
| Folder organization | Yes (arbitrary depth) | Yes | Yes | Yes | Yes (arbitrary depth) |
| Drag-and-drop move | Yes | Yes | Yes | Yes | v1.x (context menu at v1) |
| Global search | Yes (full-text) | Yes (full-text) | Yes (full-text) | Yes (extracted text) | Yes (extracted text — already extracted) |
| AI classification | No (manual labels) | No | No | Basic rules-based | Yes — core differentiator |
| Audit log (admin) | Google Workspace | Business+ | All tiers | No | Yes — all tiers |
| TOTP 2FA | Yes | Yes | Yes | Yes | Yes |
| Backup codes | Yes | Yes | Yes | Yes | Yes — required with TOTP |
| Bring-your-own storage | No | No | No | Partially (local FS) | Yes — core differentiator |
| Privacy-first admin model | No (Google can read) | No | No | N/A (self-hosted) | Yes — core differentiator |
| In-browser PDF preview | Yes | Yes | Yes | Yes | Yes (PDF.js) |
| Document version history | Yes | Yes (30 days free) | Yes (30 days free) | No | v2 |
| GDPR data export | Yes (Google Takeout) | Yes | Yes | N/A | v2 |
---
## Critical UX Patterns to Follow
### Quota UX (from Drive / Dropbox study)
**Sidebar quota bar pattern** (Drive): Always-visible storage indicator at the bottom of the left sidebar. Shows "X.X MB of Y MB used" with a color-coded bar (green → yellow → red as quota fills). This is the pattern users are conditioned to expect.
**Upload rejection UX**: Never show a generic error. Show: current usage, quota limit, how much the rejected file would have added, and a direct link to storage settings. Dropbox does this well; many self-hosted tools fail here.
**Quota warning banner**: Non-modal, dismissable, but persistent. Appears at 80% — amber. At 95% — red. At 100% — blocks upload with inline error (not just a banner).
### Folder UX (from Drive / Explorer pattern)
**Breadcrumb is mandatory**: Any folder deeper than one level without a breadcrumb creates navigation confusion. Users instinctively hit "back" in their browser and are surprised to leave the app.
**Empty folder state**: Show a clear empty state ("This folder has no documents yet. Upload one to get started.") — not a blank white space.
**Delete folder confirmation must list contents**: "Are you sure? This will permanently delete 47 documents." — not just a generic "are you sure?"
### Sharing UX (from Box / Drive)
**Share dialog anti-pattern to avoid**: Auto-sending share notifications without preview. Drive's "A notification will be sent" with an optional message is the correct pattern — user controls whether to notify.
**Shared document visual indicator**: An icon overlay (people icon) in list view on shared items. Without this, users lose track of what they've shared. Drive and Box both do this.
**Revoke is immediate and feedback is given**: "Access revoked for user@handle" toast. Not silent.
### Cloud Integration UX (from Dropbox Business / Google Drive)
**Connection health must be persistent**: Don't only show errors at connect time. Show ongoing status in storage settings. A disconnected cloud backend that silently fails uploads is a data-loss scenario.
**Token expiry is silent without handling**: OAuth2 tokens expire. If the app doesn't handle refresh silently (or at least alert the user clearly), users will think their uploads succeeded when they didn't. **This is a critical pitfall.**
---
## What We Might Miss (Gap Analysis)
Items that competitors have which aren't in PROJECT.md Active requirements and are easy to overlook:
1. **Backup codes for TOTP** — PROJECT.md mentions TOTP but not backup codes. Without backup codes, a lost phone = permanently locked-out account. Every 2FA implementation must include these. HIGH severity omission.
2. **Quota warning thresholds** — PROJECT.md says "quota enforced; uploads rejected." It doesn't mention pre-emptive warnings at 80%/95%. Users who hit the wall without warning give negative reviews. Easy to implement; easy to forget.
3. **Session revocation / forced logout** — Not in PROJECT.md. JWT-based auth has no built-in revocation. If a user believes their account is compromised, they need "sign out everywhere." Requires either short-lived JWTs + refresh token table, or a server-side revocation list.
4. **Breadcrumb navigation** — Easy to forget in folder implementation. The folder CRUD is in PROJECT.md but the navigation UX isn't. Without breadcrumbs, nested folders become unusable.
5. **"What I've shared" list** — PROJECT.md covers sharing mechanics but doesn't cover share auditability from the sharer's perspective. Users who've shared many documents need a way to manage them all.
6. **Upload error when cloud backend is unreachable** — PROJECT.md says "documents stored in cloud backend accessed via app." What happens if the cloud backend is down at upload time? Needs explicit handling: local fallback with a flag, or queue with retry, or reject with explanation. Silence = data loss.
7. **MinIO presigned URL vs proxy for document access** — Not a feature gap but an architecture gap that affects features: if documents in cloud backends are accessed via presigned URLs, the URL leaks the storage path. If proxied through the app, privacy is preserved but adds load to the backend. For a privacy-first platform, proxy is the correct choice — but it must be a conscious decision before the cloud integration is built.
8. **Empty folder state and confirmation dialogs** — Micro-UX that competitors all get right. Easy to skip in implementation. Users notice.
9. **Shared document icon in list view** — Small visual indicator but prevents users from losing track of what they've shared. Three lines of CSS; easy to miss in sprint planning.
10. **Admin cannot set quota below current usage without warning** — Obvious in hindsight but easy to leave as a silent truncation. Must warn and require explicit confirmation.
---
## Sources
- Google Drive feature set: training knowledge through Aug 2025 (stable, well-documented product)
- Microsoft OneDrive feature set: training knowledge through Aug 2025
- Dropbox feature set: training knowledge through Aug 2025
- Box feature set: training knowledge through Aug 2025 (enterprise DMS reference)
- Notion: training knowledge through Aug 2025 (organization patterns)
- Paperless-ngx: training knowledge through Aug 2025 (self-hosted DMS reference — open source, well-documented)
- DocuWare: training knowledge through Aug 2025 (enterprise DMS reference)
- Web fetch and web search unavailable during this research session; confidence reduced to MEDIUM for specific UI detail claims; HIGH for feature existence claims (these are all mature, stable platforms)
---
*Feature research for: DocuVault — SaaS Document Management Platform*
*Researched: 2026-05-21*
+550
View File
@@ -0,0 +1,550 @@
# Pitfalls Research
**Domain:** Multi-user SaaS document management platform (FastAPI + Vue 3, PostgreSQL + MinIO, cloud storage integrations)
**Researched:** 2026-05-21
**Confidence:** HIGH (auth/storage/multi-tenancy patterns are well-established; specific FastAPI + MinIO combination is MEDIUM — no web search available)
---
## Critical Pitfalls
### Pitfall 1: JWT in localStorage — XSS Gives Full Account Takeover
**What goes wrong:**
The Vue 3 SPA stores the JWT access token in `localStorage`. Any JavaScript injected via XSS (in file names, document content previewed in the UI, a compromised dependency) can call `localStorage.getItem('token')` and exfiltrate a long-lived credential. The attacker then impersonates the user from any origin, bypasses TOTP entirely (the token is post-authentication), and can access all documents, including cloud storage credentials.
**Why it happens:**
`localStorage` is the path of least resistance in SPAs. It survives page reloads, works with Axios interceptors trivially, and requires no server-side session state. FastAPI tutorials almost universally use `Authorization: Bearer` headers set from localStorage.
**How to avoid:**
- Issue the JWT as an **httpOnly, SameSite=Strict, Secure cookie** — JavaScript cannot read it.
- Use a short-lived access token (15 minutes) in the httpOnly cookie.
- Issue a separate refresh token (httpOnly cookie, longer TTL, `/auth/refresh` path-scoped) to rotate access tokens silently.
- The Vue frontend never holds the raw token string. Axios is configured with `withCredentials: true`; the browser attaches cookies automatically.
- CSRF protection: because `SameSite=Strict` blocks cross-site cookie submission, CSRF tokens are not strictly required for same-origin SPAs, but add a CSRF header check (`X-Requested-With: XMLHttpRequest`) as defence-in-depth.
**Warning signs:**
- `localStorage.getItem` anywhere in auth-related frontend code
- JWT decode functions in frontend code (the frontend should not need to decode a token it can't read)
- `Authorization: Bearer ${token}` header set manually in Axios interceptor
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 2: TOTP Bypass via Password Reset Flow
**What goes wrong:**
The password reset flow issues a one-time token by email. After the user resets their password, the system logs them in and redirects to the dashboard — skipping the TOTP prompt because the session was created through the reset path, not the login path. An attacker who compromises a user's email account can therefore completely bypass TOTP.
**Why it happens:**
Password reset and login are treated as separate code paths. The TOTP check lives in the login handler; the reset handler creates a session directly after credential update without going through the TOTP gate.
**How to avoid:**
- After a successful password reset, issue a partial session or a `password_reset_pending` state, not a full authenticated session.
- Force the user to complete the full login flow (including TOTP if enabled) from the new credentials.
- Alternatively, on password reset completion, invalidate all existing sessions and send the user to the login page (no auto-login).
- Log password resets in the audit trail with IP and user-agent.
**Warning signs:**
- Password reset handler calls the same session-creation function as login but omits 2FA state checks
- No `mfa_verified` flag on sessions (only `authenticated`)
- Users can reach protected endpoints via a token created from the reset path
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 3: Refresh Token Rotation — Stolen Refresh Token Not Detected
**What goes wrong:**
Refresh tokens are issued as single-use rotating credentials. If an attacker steals a refresh token and uses it before the legitimate user does, the server rotates the token and the attacker has a valid new one. The legitimate user's next refresh request fails — but without a detection mechanism the failure just looks like a session expiry, no alert is raised, and the attacker continues with the stolen session indefinitely.
**Why it happens:**
Teams implement the "rotate on use" mechanic without implementing the "revoke family on reuse" detection. The `refresh_tokens` table lacks a `family_id` column linking reissued tokens.
**How to avoid:**
- Store refresh tokens in the database with a `family_id` (UUID for the original issuance chain) and a `revoked` flag.
- When a refresh token is presented: if it is already marked `revoked` (i.e., a previously rotated token), revoke the **entire family** — force logout of all sessions for that user.
- Emit a security alert (audit log + optionally email) when a reuse attempt is detected.
- Refresh tokens should be hashed before storage (bcrypt or SHA-256 with a per-row salt), same as passwords.
**Warning signs:**
- Refresh tokens stored as plain values in DB with no `revoked` column
- No `family_id` linking related rotation chains
- Stolen refresh token detection treated as "v2 feature"
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 4: TOTP — Timing Attack on Code Verification
**What goes wrong:**
TOTP verification uses Python `==` string comparison. Python's `==` on strings is not constant-time — it short-circuits on the first differing character. A sufficiently sophisticated timing oracle (millions of requests from a local network) can distinguish valid from invalid codes, reducing the 6-digit brute-force space. More practically: without rate limiting, an attacker can brute-force all 1,000,000 possible 6-digit codes during a 30-second window.
**Why it happens:**
TOTP libraries (PyOTP) return a string; developers do `if provided == expected`. Rate limiting is added "later" and often never lands.
**How to avoid:**
- Use `hmac.compare_digest(provided, expected)` instead of `==` for TOTP comparison.
- Rate-limit TOTP attempts: 5 attempts per 30-second window, 15-minute lockout on excess.
- The lockout must be stored server-side (Redis or DB), not client-side.
- Accept only the current window and optionally ±1 window for clock drift — do not accept wider ranges.
- Log every failed TOTP attempt with IP.
**Warning signs:**
- `if totp.verify(code):` without inspecting PyOTP's internal comparison method
- No rate limit on `POST /auth/totp/verify`
- TOTP window set to 2+ (default is ±1 window = 90 seconds valid — wider than needed)
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 5: Path Traversal in User File Access (MinIO Object Keys)
**What goes wrong:**
MinIO object keys are constructed from user input: `f"users/{user_id}/{filename}"`. If `filename` contains `../`, `../../`, or URL-encoded equivalents (`%2e%2e%2f`), the resulting key may escape the user's prefix and land in another user's namespace. A request for `../../other_user_id/secret.pdf` resolves to `users/other_user_id/secret.pdf`.
**Why it happens:**
Developers trust that MinIO will sanitize paths. It does not — S3-compatible APIs treat object keys as arbitrary strings. The prefix-based isolation is only as safe as the key construction code.
**How to avoid:**
- Never use raw filenames as object key components. Generate a UUID (or UUID + original extension) as the stored key: `f"users/{user_id}/{uuid4()}{ext}"`.
- Store the human-readable filename in the database metadata row, completely decoupled from the storage key.
- If filenames must appear in keys for any reason, strip or reject any `/`, `\`, `..`, `%` characters before key construction.
- On retrieval, look up the object key from the database row (which is owned by `user_id`) rather than constructing it from user input.
**Warning signs:**
- `object_key = f"users/{user_id}/{request.filename}"`
- File download endpoint accepts a `path` or `filename` query parameter and constructs the key from it
- No database lookup intermediating between "user requests file" and "MinIO key used"
**Phase to address:** Storage migration phase (Phase 2 — DB + MinIO migration)
---
### Pitfall 6: Quota Race Condition — Concurrent Uploads Bypass the Limit
**What goes wrong:**
Two upload requests arrive simultaneously for a user at 99 MB of a 100 MB quota. Both read quota usage as 99 MB, both pass the `99 + 1 < 100` check, both proceed to upload — the user ends at 101 MB. At larger scale (many simultaneous large uploads) the overage can be significant.
**Why it happens:**
Quota enforcement is implemented as: read current usage → check → write file → update usage. This is a classic check-then-act race when the check and the write are not atomic.
**How to avoid:**
- Enforce quota atomically using a `SELECT ... FOR UPDATE` on the quota row before uploading, or use a PostgreSQL advisory lock keyed on `user_id`.
- Better: use an optimistic update: `UPDATE quotas SET used_bytes = used_bytes + $new WHERE user_id = $uid AND used_bytes + $new <= limit_bytes RETURNING used_bytes`. If 0 rows are updated, the quota was exceeded — reject before touching MinIO.
- Only update `used_bytes` after the MinIO upload succeeds, but hold the lock/reservation through the upload, or use a two-phase: reserve bytes → upload → confirm, with a cleanup job for stuck reservations.
- Never read quota, do arithmetic in Python, then write back as two separate statements.
**Warning signs:**
- `current = get_quota(user_id); if current + size <= limit: upload()`
- No database transaction wrapping quota check and update
- Quota table updated with `SET used_bytes = $computed_value` (full overwrite) rather than `used_bytes + delta`
**Phase to address:** Storage migration phase (Phase 2 — DB + MinIO migration) and Quotas phase
---
### Pitfall 7: Admin Privilege Escalation via Missing Ownership Checks
**What goes wrong:**
An admin API endpoint is gated on `is_admin=True` but document-access endpoints only check `is_authenticated`. An admin user calling `GET /api/documents/{id}` with any document ID can read any user's document because the handler checks authentication but not `document.owner_id == current_user.id`. The privacy-first model is violated without any special exploit — just a correctly authenticated request.
**Why it happens:**
Authorization is implemented as authentication + role checks ("is this user an admin?") without resource-level ownership verification ("does this user own this resource?"). FastAPI dependency injection makes it easy to write `current_user = Depends(get_current_user)` and forget to check `resource.user_id == current_user.id`.
**How to avoid:**
- Every document/file/folder endpoint must assert `resource.user_id == current_user.id` (or check share grants). This check cannot be optional or deferred.
- Admins access user account metadata via separate admin-scoped endpoints that explicitly exclude document content, file URLs, and cloud credentials.
- Write a test: log in as admin, attempt `GET /documents/{document_owned_by_other_user}`, assert `403`.
- Use a centralized `assert_document_access(document, current_user)` function rather than inline checks to prevent omissions.
**Warning signs:**
- Document endpoints that check `if not current_user` but not `if document.user_id != current_user.id`
- Admin endpoints that return document content or presigned URLs
- No explicit tests for cross-user access attempts
**Phase to address:** Auth phase (Phase 1) and every subsequent phase that adds resource-access endpoints
---
### Pitfall 8: Cloud Credential Leakage via Admin Query
**What goes wrong:**
Cloud storage credentials (OAuth tokens, Nextcloud passwords) are encrypted at rest. But a query like `SELECT * FROM cloud_connections WHERE user_id = $uid` returns the ciphertext. If an admin dashboard endpoint runs this query and serializes the full row to JSON, the ciphertext ships to the admin browser. An admin cannot decrypt it without the key — but the ciphertext is now in browser history, proxy logs, and admin audit records. If the encryption key is later exposed, all credentials decrypt retroactively.
**Why it happens:**
ORM models serialize all columns by default. Developers add `encrypted_credentials` to the model and forget to exclude it from admin-facing serializers.
**How to avoid:**
- The `cloud_connections` table's credential column (`encrypted_token`, `encrypted_refresh_token`) must be **excluded** from all serialization by default.
- Use explicit Pydantic response models for every endpoint — no `orm_mode` with full model pass-through.
- Admin endpoints for cloud connections return only: `provider`, `account_label`, `connected_at`, `last_used_at`, `status` — never the credential column, not even ciphertext.
- Audit log cloud credential access separately: the only code path that should ever read the encrypted column is the storage adapter, not admin or user-info endpoints.
**Warning signs:**
- `CloudConnection` Pydantic schema includes `encrypted_token` field
- Admin user-detail endpoint returns full cloud connection rows
- ORM model uses `model_config = ConfigDict(from_attributes=True)` without explicit field exclusions
**Phase to address:** Cloud storage integration phase (Phase 3 or 4)
---
### Pitfall 9: Flat-file to PostgreSQL Migration — Data Loss During Cutover
**What goes wrong:**
The migration script reads all JSON files, transforms them to DB rows, and then a flag switches the app to use PostgreSQL. Documents uploaded during the migration window are written to the old JSON store (because the flag has not flipped yet) and are missed by the migration script, which ran before they arrived. After cutover, those documents are invisible.
**Why it happens:**
Migrations are planned as offline events ("we'll take the app down for 5 minutes") and then discovered to be impractical — the app is used 24/7 or downtime feels risky. The team runs the migration online without dual-write.
**How to avoid:**
- Plan the migration in three phases:
1. **Dual-write**: deploy code that writes to both JSON and PostgreSQL, reads from JSON. All new documents land in both stores.
2. **Backfill**: run migration script to copy historical JSON records to PostgreSQL. New records are already there.
3. **Cutover**: flip read source to PostgreSQL, verify counts match, remove JSON write path.
- The dual-write window can be as short as one deployment cycle.
- Include a reconciliation check: `assert doc_count_json == doc_count_db` before cutting over.
- Keep the old JSON store read-only for 1 week post-cutover as a rollback option.
**Warning signs:**
- Migration is planned as "take app down, run script, bring back up"
- No count-reconciliation step
- No rollback plan documented before migration begins
**Phase to address:** DB + MinIO migration phase (Phase 2)
---
### Pitfall 10: Blocking I/O Inside Async FastAPI Handlers (Existing Issue)
**What goes wrong:**
The codebase already uses synchronous `filelock` and `open()` inside `async def` handlers (CONCERNS.md item 6). After migration to PostgreSQL + MinIO, if synchronous DB drivers (psycopg2) or synchronous MinIO client calls replace the file I/O without wrapping in `asyncio.to_thread()`, the event loop stalls on every I/O operation. Under concurrent load (multiple users uploading), requests queue behind each other even though the hardware is idle.
**Why it happens:**
SQLAlchemy sync engine + psycopg2 is the default FastAPI tutorial stack. The MinIO Python SDK (`minio` package) is synchronous. Developers add `await` in front of calls that are not coroutines, get a type error, remove the `await`, and ship blocking code.
**How to avoid:**
- Use SQLAlchemy async engine with `asyncpg` driver: `create_async_engine("postgresql+asyncpg://...")`.
- Wrap all MinIO SDK calls in `asyncio.to_thread()` since there is no official async MinIO client: `await asyncio.to_thread(minio_client.put_object, ...)`.
- Alternatively use `aioboto3` (async boto3) which works with MinIO's S3-compatible API.
- `aiofiles` is already in `requirements.txt` — use it for any remaining local file operations.
- Run `pytest-asyncio` + `asyncio.get_event_loop().set_debug(True)` in tests; the debug mode logs blocking calls.
**Warning signs:**
- `from sqlalchemy import create_engine` (not `create_async_engine`)
- `import psycopg2` anywhere in application code
- `minio_client.put_object(...)` not wrapped in `asyncio.to_thread`
- Uvicorn logs show high request latency even with low CPU usage
**Phase to address:** DB + MinIO migration phase (Phase 2)
---
### Pitfall 11: N+1 Queries on Document Listing
**What goes wrong:**
`GET /api/documents` returns a list of documents, each including folder name, topic names, and share count. The handler fetches the document list, then for each document issues separate queries for folder, topics, and shares. 100 documents = 301+ queries per page load. The existing codebase already has an O(N) disk scan equivalent (CONCERNS.md items 8, 9); the PostgreSQL migration preserves this pattern if not corrected.
**Why it happens:**
SQLAlchemy lazy loading is the default. `document.folder.name` triggers a query. Developers don't see it until production load hits.
**How to avoid:**
- Use SQLAlchemy `joinedload` or `selectinload` options on the document list query to eagerly load related entities.
- The list endpoint should be a single SQL query with JOINs, not a loop.
- Add `EXPLAIN ANALYZE` checks as part of the phase acceptance criteria.
- Enable SQLAlchemy `echo=True` in development to log every SQL statement.
- Pagination at the database level: `LIMIT 50 OFFSET $n` — not "fetch all, slice in Python."
**Warning signs:**
- SQLAlchemy models using default relationship loading without explicit `lazy=` options
- `for doc in documents: doc.folder.name` pattern
- No `joinedload` or `selectinload` in list query definitions
- Query count in tests grows linearly with fixture size
**Phase to address:** DB + MinIO migration phase (Phase 2); enforce in document listing feature phase
---
### Pitfall 12: MinIO Presigned URL Expiry — Stale Links in the UI
**What goes wrong:**
The document preview and download UI displays links generated as MinIO presigned URLs with a 1-hour expiry. The Vue frontend fetches the document list on page load, stores the URLs in Pinia state, and renders them as `<img src>` or `<a href>`. If the user leaves the tab open for 2 hours and then clicks a link, they get a 403 (presigned URL expired). No error is shown; the image just fails to load or the download silently fails.
**Why it happens:**
Presigned URLs feel like permanent links. Teams generate them at list time for convenience and cache them in frontend state without expiry awareness.
**How to avoid:**
- Do not embed presigned URLs in list responses. Return only document metadata.
- Generate presigned URLs **on demand**: a separate `GET /api/documents/{id}/download-url` endpoint generates a short-lived URL (515 minutes) at the moment of user intent.
- Alternatively, proxy document bytes through the FastAPI backend (`GET /api/documents/{id}/file` streams from MinIO) — eliminates presigned URL complexity at the cost of bandwidth.
- If presigned URLs are cached in frontend state, include the expiry timestamp and regenerate before expiry.
**Warning signs:**
- Presigned URLs included in document list response JSON
- Frontend stores presigned URLs in Pinia state without TTL tracking
- No `GET /api/documents/{id}/download-url` endpoint (or equivalent)
- `presigned_url_expiry = 3600` with no frontend refresh logic
**Phase to address:** DB + MinIO migration phase (Phase 2) and document access phase
---
### Pitfall 13: OAuth Token for Cloud Storage — Revocation Not Handled
**What goes wrong:**
A user connects Google Drive via OAuth. The app stores the encrypted access token + refresh token. Six months later, the user revokes the app's access in their Google account settings. The next time DocuVault tries to access their Drive, it gets a 401. The refresh token exchange also fails with `invalid_grant`. The app has no handler for this — it retries, logs an error, and the user sees a generic "storage error" with no path to reconnect.
**Why it happens:**
OAuth happy-path is well-documented. Revocation and `invalid_grant` handling are in footnotes. Teams handle 401 → refresh → retry but don't handle refresh failure → user notification.
**How to avoid:**
- Wrap all cloud storage adapter calls in a two-level error handler:
- Level 1: 401 → attempt refresh → retry.
- Level 2: refresh fails with `invalid_grant` or `token_revoked` → mark connection as `REQUIRES_REAUTH` in DB, **do not retry**, surface a "reconnect required" notice to the user.
- The `cloud_connections` table must have a `status` column: `ACTIVE | REQUIRES_REAUTH | ERROR`.
- The UI must poll or react to `REQUIRES_REAUTH` state and prompt the user to re-authorize, not just show an error toast.
- Never silently swallow a revoked-token error or retry indefinitely.
**Warning signs:**
- Cloud storage adapter raises generic `StorageError` for all OAuth errors without distinguishing revocation
- No `status` or `needs_reauth` column on `cloud_connections`
- Reconnect UI not planned in feature scope
**Phase to address:** Cloud storage integration phase
---
### Pitfall 14: Cloud Storage Rate Limits — No Backoff, No Per-User Throttling
**What goes wrong:**
Multiple users with Google Drive connected trigger simultaneous document uploads. The app hits Google's per-app rate limit (Drive API: 20,000 requests/100 seconds/user, but also global per-project limits). Google returns 429. The app has no retry-with-backoff; uploads fail and the errors are attributed to the individual user, not the platform-level limit.
**Why it happens:**
Rate limits are per-API-key, not per-user. A single misbehaving user or burst of activity for all users shares the same quota. Teams only test with a single user.
**How to avoid:**
- Implement exponential backoff with jitter for all cloud storage adapter calls (start 1s, max 32s, 3 retries).
- Use a task queue (Celery or FastAPI BackgroundTasks with a semaphore) for cloud-destined uploads rather than processing inline in the request handler.
- Log 429 responses separately from other errors — they indicate platform-level throttling, not user errors.
- Per-provider rate limit documentation should be captured in adapter docstrings.
**Warning signs:**
- Cloud upload performed synchronously inside the HTTP request handler with no retry logic
- No `tenacity` or equivalent retry decorator on provider calls
- 429 responses from cloud APIs cause 500 responses to the end user
**Phase to address:** Cloud storage integration phase
---
### Pitfall 15: GDPR Right to Erasure — Cloud Copies Not Deleted
**What goes wrong:**
A user requests account deletion. The app deletes the PostgreSQL rows, the MinIO objects, and the user record. But if the user's documents were stored in Google Drive or OneDrive via the cloud storage adapter, those files remain on the cloud provider. The user's data has not actually been erased from all systems.
**Why it happens:**
Account deletion is implemented against the systems the team controls (PostgreSQL, MinIO). Cloud storage is treated as "the user's own storage" so deletion is skipped. Under GDPR, if the platform wrote files to the user's cloud storage on their behalf as a data processor, erasure obligations may extend there.
**How to avoid:**
- Account deletion must enumerate all `cloud_connections` for the user and call `adapter.delete_all_user_files()` (or equivalent) on each active connection.
- If the cloud connection is already in `REQUIRES_REAUTH` state, the deletion cannot proceed automatically — log a compliance alert and require manual follow-up or notify the user to delete manually.
- Document the data flow in a data map: "files stored in X go to Y" — this is a GDPR Article 30 requirement.
- Right to erasure flow must be tested as part of acceptance criteria, not assumed to work.
**Warning signs:**
- Account deletion handler only operates on PostgreSQL and MinIO
- No `delete_user_files` method in the `StorageBackend` interface
- No GDPR data map in project documentation
**Phase to address:** Cloud storage integration phase and account management phase
---
### Pitfall 16: Encryption Key Management — Single Key for All Users
**What goes wrong:**
All cloud credentials are encrypted with one Fernet key stored in `CLOUD_ENCRYPTION_KEY` env var. If this key is exposed (leaked `.env`, compromised deployment config, insider threat), every user's cloud credentials decrypt at once. The privacy-first model collapses entirely.
**Why it happens:**
One key is simpler than per-user keys. Fernet key derivation per user is slightly more complex to implement.
**How to avoid:**
- Derive a per-user encryption key from the master key + a user-specific salt: `HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`. The salt is stored in the users table.
- Even if the master key leaks, an attacker still needs each user's salt to decrypt their credentials.
- Rotate the master key on a schedule: re-encrypt all stored credentials with the new key in a background job.
- Audit log any code path that accesses the encryption key — this should only ever happen in the storage adapter, never in API handlers.
- Never log the encryption key or derived key material, even at DEBUG level.
**Warning signs:**
- Single `ENCRYPTION_KEY` env var with no per-user derivation
- Encryption/decryption called directly in API endpoint handlers (not in the adapter layer)
- No key rotation plan documented
**Phase to address:** Cloud storage integration phase (design the key derivation before writing the first line of credential storage code)
---
### Pitfall 17: Shared Document — Quota Not Charged to Sharer Correctly After Revoke
**What goes wrong:**
User A shares a document with User B. The document is not duplicated — it stays in User A's storage and User A's quota. User A revokes the share. If the quota update on revoke has a bug (or is missing), User A's used bytes may drift from reality over time. At scale, quota inaccuracies accumulate and users can either be locked out prematurely or exceed limits invisibly.
**Why it happens:**
Sharing is implemented as a metadata-only operation (a `shares` table row), but quota accounting is only re-examined on upload and delete. Edge cases (revoke, owner deletion while share is active, cloud storage document quota) are skipped.
**How to avoid:**
- Quota is a property of documents, not shares. Ensure the quota model is: "sum of `file_size` for all documents where `owner_id = user_id`." Shares do not affect quota calculation.
- Recalculate quota from source-of-truth (a `SUM` query) as a periodic background job and reconcile against the cached `used_bytes` value. Alert if drift exceeds 1%.
- The `DELETE /documents/{id}` handler must atomically decrement quota and delete the DB row and MinIO object in a single transaction (DB) + best-effort cleanup (MinIO).
**Warning signs:**
- `used_bytes` stored as a counter updated by application code, never reconciled against a database SUM
- Share revocation handler doesn't touch quota (correct if quota is owner-only, but must be verified)
- Document deletion does not atomically update quota
**Phase to address:** Quotas and sharing phases
---
## Technical Debt Patterns
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Single global encryption key for cloud credentials | Simple env var config | Single point of failure; one leak decrypts all users | Never for a SaaS product |
| localStorage JWT storage | Easy Axios integration | XSS → full account takeover, TOTP bypass | Never; httpOnly cookies have identical DX once set up |
| Synchronous DB driver (psycopg2) with FastAPI | Familiar, simple | Event loop blocking under any load | Never for new greenfield code; acceptable only during an explicitly time-boxed spike |
| Presigned URL in list response | One fewer endpoint | Stale URLs, user-facing failures after expiry | Only for short-lived signed URLs (<5 min) with TTL tracking in frontend |
| Skip per-user key derivation | Fewer moving parts | Catastrophic blast radius on key leak | Never for cloud credential encryption |
| Inline quota check (read-check-write) without atomic update | Simple code | Silent quota bypass under concurrent load | Never for enforced limits |
| Soft-coding 8-char UUID prefix as document IDs (existing) | Shorter IDs | Collision risk at scale; insecure for auth tokens | Replace with full UUID before multi-user goes live |
---
## Integration Gotchas
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Google Drive OAuth | Storing `access_token` only; ignoring `refresh_token` | Store both; refresh_token is issued only on first authorization — store immediately or it's lost |
| OneDrive / MSAL | Treating MSAL token cache as permanent | MSAL token cache can be invalidated by Microsoft; always handle `invalid_grant` and re-auth prompt |
| Nextcloud | Using Basic Auth with plain credentials | Use App Passwords (Nextcloud-specific token), never store the user's Nextcloud master password |
| MinIO | Using `minio.get_object()` for large files without streaming | Stream the response: `response.stream(32768)` or use boto3 streaming; loading full file to memory crashes on large PDFs |
| MinIO | Treating bucket names as security boundaries | Bucket names are not credentials; isolation is enforced by object key prefix + IAM policy. Set MinIO IAM so the app's service account can only access the designated bucket |
| PostgreSQL | Using `autocommit=True` for all connections | Quota updates and document record creation must be transactional; autocommit makes rollback impossible |
| PyOTP (TOTP) | Using default `valid_window=0` then switching to `valid_window=2` for "user convenience" | Window=1 (±30s) tolerates clock skew. Window=2+ increases brute-force surface disproportionately |
---
## Performance Traps
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| O(N) metadata scan on every document list (existing pattern) | List endpoint slows linearly with document count | Paginated SQL query with index on `(user_id, created_at)` | ~500 documents per user |
| Topic count scan on every topic fetch (existing pattern) | `GET /topics` latency spikes as documents grow | Materialized counter column updated on insert/delete, or cached with 30s TTL | ~1,000 documents |
| Synchronous MinIO upload inline in HTTP handler | Requests time out on large files; one upload blocks all other requests | Background task queue; return 202 Accepted with job ID | First user uploading a file >10 MB |
| Eager loading all cloud connections on user login | Login latency grows with number of connected providers | Lazy-load cloud connection status; only check on storage page | >3 cloud providers per user |
| Full-text search via `ILIKE '%term%'` on document content | Full table scan on every search | PostgreSQL `tsvector` full-text index on extracted text column | ~10,000 documents platform-wide |
| JWT validation on every request without caching public key | Repeated public key fetch (if using asymmetric JWT) | Cache the signing key in memory; never refetch on every request | High request volume |
---
## Security Mistakes
| Mistake | Risk | Prevention |
|---------|------|------------|
| CORS `allow_origins=["*"]` retained from existing codebase | Cross-origin requests from attacker-controlled pages | Set exact origin (frontend URL) in production; never wildcard with credentials |
| File type validation defined but not enforced (existing) | Executable or malicious file uploads | Enforce MIME type check at upload boundary; also validate magic bytes (not just Content-Type header) |
| No file size limit at HTTP boundary (existing) | Memory exhaustion, DoS via large upload | Add `max_upload_bytes` limit in FastAPI middleware before reading body |
| API keys in plaintext JSON (existing) | Credential leakage via Docker volume mount | Env vars for all secrets; never serialize keys to disk in application data directory |
| Password reset without invalidating existing sessions | Attacker keeps session after victim resets password | Invalidate all sessions for user on successful password reset |
| Audit log includes document metadata (filenames) | Filename can reveal document content | Audit log stores only document ID, event type, timestamp, IP — no filenames |
| Admin can read `cloud_connections.encrypted_token` column | Ciphertext exposure; retroactive decryption if key leaks | Exclude credential columns from all admin serializers by policy |
| No breach check on registration password | Users reuse passwords from breached services | Integrate HIBP k-anonymity API on registration |
---
## UX Pitfalls
| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| Silent presigned URL expiry (image fails to load, download does nothing) | User thinks the app is broken; no recovery path | On-demand URL generation; show explicit "link expired, regenerating..." state |
| TOTP enrollment without backup codes | User loses phone → permanently locked out of account | Issue 810 single-use backup codes at TOTP enrollment; require user to acknowledge |
| Quota error shown only as "Upload failed" | User doesn't know why; may try repeatedly | Return structured quota error: `{"error": "quota_exceeded", "used": 98MB, "limit": 100MB}` |
| Cloud storage reconnect required, but no in-app prompt | User's uploads silently fail until they notice the settings page | Banner notification or upload-time prompt: "Your Google Drive connection needs re-authorization" |
| Folder delete with documents inside silently deletes documents | Data loss without confirmation | Require explicit confirmation listing document count; offer "move to root" alternative |
| Share revoke with no notification to recipient | Recipient sees documents disappear without explanation | Audit-trail entry visible to recipient: "Access to [folder] was revoked by [owner]" |
---
## "Looks Done But Isn't" Checklist
- [ ] **TOTP enrollment:** Backup codes issued, stored hashed, and acknowledged by user before TOTP is marked active
- [ ] **Password reset:** Does NOT create a full authenticated session — routes through login with TOTP check
- [ ] **Document delete:** Atomically decrements quota AND removes MinIO object AND removes DB row in a single transaction
- [ ] **Account delete:** Deletes from MinIO, PostgreSQL, AND calls `adapter.delete_user_files()` for each cloud connection
- [ ] **Cloud disconnect:** Revokes OAuth token at provider (not just deletes local record) — provider refresh tokens remain valid until explicitly revoked
- [ ] **Quota display:** Shows real-time usage, not cached value from registration — must query SUM from DB
- [ ] **Sharing:** Shared documents do NOT copy to recipient's quota — verify with a `SELECT SUM(file_size) WHERE owner_id = recipient_id` before and after share
- [ ] **Migration:** Document count in PostgreSQL equals document count in old JSON store before cutover flag is flipped
- [ ] **Presigned URLs:** Never in list responses — verified by API contract test asserting no `url` field in list endpoint response body
- [ ] **Admin isolation:** Admin token cannot retrieve document content — verified by dedicated negative-access test
---
## Recovery Strategies
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| JWT in localStorage discovered post-launch | HIGH | Rotate all tokens (force logout all users), redeploy with httpOnly cookies, communicate to users |
| Quota race condition caused 10% of users to exceed quota | MEDIUM | Run `UPDATE quotas SET used_bytes = (SELECT SUM(file_size) FROM documents WHERE owner_id = user_id)` reconciliation; enforce atomic updates going forward |
| Flat-file migration data loss (documents missing post-cutover) | HIGH | Restore from pre-migration backup of JSON data directory; re-run migration with dual-write; reconcile counts |
| Encryption key leaked (single-key model) | CRITICAL | Immediately rotate key; re-encrypt all credentials with new key; notify affected users; assume all cloud credentials compromised — prompt all users to revoke and reconnect |
| Admin escalation (admin accessed user documents) | HIGH | Audit log review to determine scope; GDPR breach notification if EU users affected (72h window); patch authorization layer; force session invalidation |
| Cloud connection revoked but app retried indefinitely | LOW | Mark all stuck connections as `REQUIRES_REAUTH`; send user notification; add `invalid_grant` handler to adapter |
---
## Pitfall-to-Phase Mapping
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| JWT in localStorage | Phase 1 — Auth | Confirm no `localStorage.setItem('token', ...)` in frontend; auth cookie is httpOnly in browser DevTools |
| TOTP bypass via password reset | Phase 1 — Auth | Integration test: reset password → assert no full session created → assert login still requires TOTP |
| Refresh token reuse detection | Phase 1 — Auth | Test: use a rotated (old) refresh token → assert 401 and full family revocation |
| TOTP timing attack / brute force | Phase 1 — Auth | Rate limit test: 6th TOTP attempt within window → assert 429 |
| Path traversal in MinIO keys | Phase 2 — DB + MinIO migration | Unit test: filename with `../` → assert stored key is UUID-based, not user-provided |
| Quota race condition | Phase 2 — DB + MinIO migration | Concurrent upload test: 2 threads uploading to full quota → assert only one succeeds |
| Admin privilege escalation | Phase 1 + every phase adding resource endpoints | Negative access test: admin JWT → `GET /documents/{other_user_doc_id}` → assert 403 |
| Cloud credential leakage via admin query | Phase 3 — Cloud storage | Test: admin list user cloud connections → assert response contains no `token` or `encrypted_` fields |
| Flat-file to PostgreSQL data loss | Phase 2 — DB + MinIO migration | Count reconciliation: `assert json_doc_count == db_doc_count` in migration script |
| Blocking I/O in async handlers | Phase 2 — DB + MinIO migration | Load test with 10 concurrent requests; assert no event loop warnings; check SQLAlchemy driver is asyncpg |
| N+1 queries on document list | Phase 2 — DB + MinIO migration | Enable SQLAlchemy echo; assert document list query count = 1 regardless of result size |
| Presigned URL expiry | Phase 2 — DB + MinIO migration | API contract test: list endpoint response contains no presigned URL fields |
| OAuth revocation not handled | Phase 3 — Cloud storage | Mock `invalid_grant` from OAuth provider → assert connection status set to `REQUIRES_REAUTH`, not retried |
| Rate limits without backoff | Phase 3 — Cloud storage | Mock 429 from provider → assert exponential backoff, not immediate 500 to user |
| GDPR right to erasure incomplete | Phase 3 — Cloud storage + account mgmt | Account deletion test: assert cloud adapter `delete_user_files` called for each active connection |
| Single encryption key | Phase 3 — Cloud storage | Code review gate: assert per-user HKDF derivation; no single-key decryption in any code path |
| Quota drift from sharing/revoke | Sharing phase | Reconciliation query: `SUM(file_size WHERE owner_id = user)` vs `used_bytes` — assert < 1% drift |
---
## Sources
- Project context: `/Users/nik/Documents/Progamming/document_scanner/.planning/PROJECT.md`
- Codebase audit: `/Users/nik/Documents/Progamming/document_scanner/.planning/codebase/CONCERNS.md`
- Auth pitfalls: OWASP JWT Security Cheat Sheet, OWASP Authentication Cheat Sheet (well-established, HIGH confidence)
- OAuth token lifecycle: RFC 6749 Section 10.4 (refresh token revocation), Google OAuth error codes documentation (HIGH confidence from training data)
- MinIO/S3 path traversal: AWS S3 object key documentation; known class of vulnerability in multi-tenant S3-prefix isolation (HIGH confidence)
- Quota race conditions: Classic check-then-act concurrency pattern; PostgreSQL SELECT FOR UPDATE documentation (HIGH confidence)
- GDPR Article 17 (right to erasure) and Article 30 (records of processing): well-established regulatory requirements (HIGH confidence)
- N+1 query pattern: SQLAlchemy documentation on relationship loading strategies (HIGH confidence)
- TOTP RFC 6238; PyOTP library behavior from training data (MEDIUM — verify PyOTP `valid_window` defaults against current docs before implementation)
---
*Pitfalls research for: DocuVault — multi-user SaaS document management (FastAPI + Vue 3 + PostgreSQL + MinIO)*
*Researched: 2026-05-21*
+553
View File
@@ -0,0 +1,553 @@
# Stack Research — DocuVault: Multi-User Auth, Storage & Cloud Integrations
**Domain:** SaaS document management — adding multi-user auth, PostgreSQL, MinIO, cloud storage integrations to existing FastAPI + Vue 3 app
**Researched:** 2026-05-21
**Overall Confidence:** MEDIUM-HIGH (most core library choices verified against official FastAPI docs and release notes; cloud SDK versions partially from training data, flagged where unverified)
---
## Existing Stack (Do Not Replace)
| Component | Current | Notes |
|-----------|---------|-------|
| Backend framework | FastAPI 0.136.1 | Latest confirmed from official release notes |
| Frontend framework | Vue 3 | Keep as-is |
| Runtime | Python 3.11+ | FastAPI supports 3.14t as of 0.136.0 |
| Deployment | Docker Compose | Remains primary target |
| ASGI server | Uvicorn (via `fastapi run`) | Starlette 1.0.0 now bundled |
---
## Area 1: Authentication
### JWT — PyJWT 2.12.1
**Confidence: HIGH** (verified from FastAPI release notes: `pyjwt` bumped to `2.12.1` in FastAPI 0.136.1; FastAPI tutorial now uses `import jwt` not `python-jose`)
```
pip install "pyjwt[crypto]>=2.12.1"
```
Use `pyjwt[crypto]` to enable RS256/ES256 if asymmetric keys are ever needed. For this project HS256 with a strong secret is sufficient (single-issuer, stateless).
**Do not use `python-jose`** — the FastAPI tutorial no longer references it, it has had unmaintained periods, and the official docs have migrated entirely to PyJWT.
### Password Hashing — pwdlib 0.2.x with Argon2
**Confidence: HIGH** (verified from current FastAPI security tutorial — `pwdlib[argon2]` is the documented recommendation, replacing the old `passlib[bcrypt]` guidance)
```
pip install "pwdlib[argon2]>=0.2.0"
```
**Why Argon2 over bcrypt:** Argon2id won the Password Hashing Competition, is memory-hard (resistant to GPU/ASIC attacks), and is the default recommendation in OWASP 2025 guidelines. `pwdlib` is a thin, modern wrapper; it does not carry `passlib`'s legacy baggage.
**Exception:** If the existing codebase already stores any bcrypt hashes, keep `passlib[bcrypt]` for the migration phase to verify and re-hash on login, then remove it.
### TOTP 2FA — pyotp 2.9.x
**Confidence: MEDIUM** (standard library for RFC 6238 TOTP in Python; no competing library of comparable adoption exists; version from training data — verify on PyPI before pinning)
```
pip install "pyotp>=2.9.0"
```
`pyotp` implements RFC 6238 TOTP and RFC 4226 HOTP. It generates provisioning URIs compatible with Google Authenticator, Authy, and any standard TOTP app. Generates QR code URIs via `pyotp.totp.TOTP.provisioning_uri()`. Pair with `qrcode[pil]` or `segno` to render a QR code PNG for the setup screen.
For TOTP enrollment flow:
1. Generate secret: `pyotp.random_base32()`
2. Store secret encrypted at rest (Fernet — see credential encryption below)
3. Return provisioning URI + QR code to user
4. Verify one TOTP code before marking 2FA active
5. On login: verify password first, then verify TOTP code with a 1-period window (`valid_window=1`)
### Session / Token Strategy
**Confidence: HIGH** (pattern; no external library needed beyond PyJWT)
Use a **dual-token pattern** for stateless horizontal scaling:
- **Access token**: Short-lived JWT (15 min), HS256, payload includes `user_id`, `email`, `roles`, `jti`
- **Refresh token**: Long-lived JWT (730 days), stored as `httpOnly` + `Secure` cookie, rotated on use
- **Revocation**: Store `jti` of revoked refresh tokens in PostgreSQL `token_blacklist` table with TTL. Clean up expired entries via a periodic task.
No additional session library is needed. Do not use Redis for token storage — the PROJECT.md requires stateless backends; a PostgreSQL blacklist table is sufficient for this scale and avoids another infrastructure dependency.
FastAPI's `fastapi.security.OAuth2PasswordBearer` handles the Bearer extraction from headers. Implement `get_current_user` as a dependency.
### Credential Encryption (Cloud OAuth Tokens, TOTP Secrets) — cryptography 44.x Fernet
**Confidence: HIGH** (cryptography is a stable, core Python library; Fernet is its symmetric authenticated encryption primitive)
```
pip install "cryptography>=44.0.0"
```
`cryptography.fernet.Fernet` provides AES-128-CBC + HMAC-SHA256 in a single call. Key lives in an env var (`FERNET_KEY`), never in the database. Encrypt per-user cloud OAuth tokens and TOTP secrets before writing to PostgreSQL. This satisfies the PROJECT.md privacy constraint: admin queries never see plaintext credentials.
**Key derivation pattern:** Generate one `Fernet.generate_key()` at deploy time, store in `CREDENTIAL_ENCRYPTION_KEY` env var, inject via Docker Compose secrets. Do not store the key in the database or expose it through any admin endpoint.
---
## Area 2: Database
### ORM — SQLAlchemy 2.0 (async) + psycopg (v3)
**Confidence: HIGH for SQLAlchemy 2.0 async; MEDIUM for psycopg v3 vs asyncpg** (SQLAlchemy 2.0 async confirmed stable; driver choice between asyncpg and psycopg 3 is functionally equivalent — see note below)
```
pip install "sqlalchemy[asyncio]>=2.0.36" "psycopg[asyncio,binary]>=3.2.0"
```
**Why SQLAlchemy 2.0 over SQLModel for this project:**
SQLModel 0.0.38 (current version per FastAPI release notes) is the official recommendation for greenfield apps, but for this brownfield migration it introduces risk:
1. SQLModel does not yet have first-class async session documentation. Its `AsyncSession` support works but is inherited from SQLAlchemy and not well-documented in SQLModel's own tutorials.
2. The existing codebase already has Pydantic models for all API schemas. Adding SQLModel means maintaining a second model hierarchy (table models vs response models) which increases complexity mid-migration.
3. SQLAlchemy 2.0 `AsyncSession` with `asyncpg` or `psycopg[asyncio]` is battle-tested and the pattern used by the FastAPI full-stack template.
4. Alembic (see below) integrates directly with SQLAlchemy — the migration toolchain is native.
**Recommended pattern:**
```python
from sqlalchemy.ext.asyncio import create_async_engine, AsyncSession, async_sessionmaker
engine = create_async_engine(
"postgresql+psycopg://user:pass@db:5432/docuvault",
pool_pre_ping=True,
pool_size=10,
max_overflow=20,
)
AsyncSessionLocal = async_sessionmaker(engine, expire_on_commit=False)
async def get_db() -> AsyncGenerator[AsyncSession, None]:
async with AsyncSessionLocal() as session:
yield session
```
**asyncpg vs psycopg 3:** Both work with SQLAlchemy 2.0 async. Prefer `psycopg[asyncio,binary]` for this project because:
- psycopg 3 is the PostgreSQL-sanctioned successor to psycopg2, meaning the same package covers both sync (Alembic) and async (FastAPI) paths
- asyncpg is async-only and requires a separate sync driver for Alembic migrations
- psycopg 3 binary wheel has comparable performance to asyncpg in benchmarks
**Conflict note:** psycopg2 is incompatible with psycopg 3 (different import names: `psycopg2` vs `psycopg`). If any existing dependency pins `psycopg2`, update it. Do not install both.
### Migrations — Alembic 1.14.x
**Confidence: HIGH** (Alembic is the only migration tool for SQLAlchemy; no viable alternative)
```
pip install "alembic>=1.14.0"
```
**Async migration pattern** — Alembic's `env.py` needs special handling for async engines. Use the `run_sync` pattern:
```python
# alembic/env.py
import asyncio
from sqlalchemy.ext.asyncio import create_async_engine
def run_migrations_online():
connectable = create_async_engine(settings.DATABASE_URL)
async def run():
async with connectable.connect() as connection:
await connection.run_sync(do_run_migrations)
asyncio.run(run())
```
**Migration strategy for brownfield migration:**
1. Create initial migration that builds schema from scratch (new install path)
2. Create a separate data migration script that reads flat-file JSON and inserts rows
3. Run both in sequence during the deploy that replaces the existing data
---
## Area 3: Object Storage (MinIO)
### MinIO Python SDK 7.x
**Confidence: MEDIUM** (MinIO SDK is well-known; exact version from training data — verify on PyPI before pinning)
```
pip install "minio>=7.2.0"
```
The MinIO Python SDK (`minio`) wraps the S3 API. It is synchronous. Use it inside FastAPI via `asyncio.to_thread()` for large streaming operations, or call it directly for short metadata operations.
**Important:** Do NOT use the MinIO SDK for high-throughput streaming (uploads/downloads of large documents). Instead, use **pre-signed URLs**:
```python
from minio import Minio
from datetime import timedelta
client = Minio(
"minio:9000",
access_key=settings.MINIO_ACCESS_KEY,
secret_key=settings.MINIO_SECRET_KEY,
secure=False, # True in production with TLS
)
# Generate upload URL (client uploads directly to MinIO, bypassing FastAPI)
url = client.presigned_put_object(
bucket_name="user-documents",
object_name=f"{user_id}/{document_id}",
expires=timedelta(minutes=15),
)
# Generate download URL
url = client.presigned_get_object(
bucket_name="user-documents",
object_name=f"{user_id}/{document_id}",
expires=timedelta(minutes=60),
)
```
Pre-signed URLs mean FastAPI never proxies document bytes — only metadata flows through the backend. This is critical for horizontal scaling (no file pinning to a specific backend instance) and for quota enforcement (track bytes at upload-record creation time, not at streaming time).
**Quota enforcement pattern:**
1. Client requests an upload token from FastAPI
2. FastAPI checks current usage against `user.quota_used_bytes` + `user.quota_limit_bytes`
3. If within quota, record tentative size, issue pre-signed PUT URL
4. After successful upload, confirm actual size (via MinIO event or HEAD request) and commit to quota
**boto3 alternative:** `boto3` works against MinIO via `endpoint_url` override. Only use it if you anticipate migrating to AWS S3 — for a MinIO-only deployment the native SDK is simpler and avoids the large boto3 dependency tree.
### aiobotocore / aiominio — Do Not Use
The async MinIO/S3 client libraries (`aiobotocore`, `aiominio`) add significant complexity with uncertain maintenance status. The pre-signed URL pattern renders them unnecessary — the sync SDK is only called in the FastAPI path for URL generation (microseconds), not for streaming.
---
## Area 4: Cloud Storage SDKs
### OneDrive — msgraph-sdk 1.x + azure-identity 1.x
**Confidence: MEDIUM** (Microsoft Graph Python SDK is GA per official Microsoft docs; exact version from training data — verify on PyPI)
```
pip install "msgraph-sdk>=1.0.0" "azure-identity>=1.19.0"
```
Microsoft Graph Python SDK (`msgraph-sdk`) is the official Microsoft library for OneDrive access. It covers:
- Drive item CRUD (`/me/drive/items/{id}`)
- Upload sessions for large files
- Delta sync for listing changes
For server-side (backend-behalf-of-user) flows use the **OAuth 2.0 Authorization Code** flow with `azure-identity`'s `OnBehalfOfCredential` or a custom token provider wrapping stored refresh tokens.
**Important:** Microsoft's OneDrive tokens (access + refresh) must be stored encrypted at rest using the Fernet approach described in Area 1. Refresh tokens are long-lived and grant significant access.
**Package note:** The older `O365` package and `office365-REST-python-client` both wrap Graph API but are community-maintained. Prefer the official `msgraph-sdk` which Microsoft now actively develops and tests against Graph v1.0.
### Google Drive — google-api-python-client 2.x + google-auth-oauthlib 1.x
**Confidence: MEDIUM** (package names confirmed from Google Cloud docs; exact minor versions from training data)
```
pip install "google-api-python-client>=2.150.0" "google-auth-oauthlib>=1.2.0" "google-auth-httplib2>=0.2.0"
```
Use the Drive API v3 (not v2 — v2 is deprecated). For server-side OAuth flows:
- Use `google_auth_oauthlib.flow.Flow` for the authorization redirect
- Store OAuth2 credentials (`Credentials` object JSON) encrypted in PostgreSQL
- Rebuild credentials from stored JSON on each API call: `google.oauth2.credentials.Credentials.from_authorized_user_info(json_data, scopes)`
Required scopes for this project: `https://www.googleapis.com/auth/drive.file` (access only files created by the app — minimum privilege).
### Nextcloud — webdav4 0.x
**Confidence: MEDIUM** (webdav4 is the most actively maintained Python WebDAV client as of 2024; version from training data)
```
pip install "webdav4[fsspec]>=0.9.8"
```
Nextcloud exposes two APIs: WebDAV (for file operations) and OCS (for sharing, users, and metadata). For document upload/download, WebDAV is sufficient. `webdav4` wraps the WebDAV protocol with a clean interface and optional `fsspec` integration.
**Nextcloud-specific paths:**
- WebDAV root: `https://{host}/remote.php/dav/files/{username}/`
- Authentication: Basic auth (username + app password) or Bearer token
For Nextcloud, recommend storing an **app password** (user-generated in Nextcloud settings) rather than OAuth tokens — it's simpler to implement and doesn't require an OAuth app registration.
**webdavclient3 alternative:** An older library with less active maintenance. `webdav4` is preferred.
### Generic WebDAV — webdav4 (same package)
`webdav4` handles generic RFC 4918 WebDAV, so any WebDAV-compatible server (ownCloud, Seafile WebDAV bridge, etc.) is covered by the same adapter.
---
## Area 5: Storage Abstraction
### Pattern — Protocol-based Adapter (no third-party library needed)
**Confidence: HIGH** (this is the architecture mandated by PROJECT.md and mirrors the existing AI provider pattern)
Define a `StorageBackend` Protocol that all adapters implement:
```python
from typing import Protocol, AsyncIterator
class StorageBackend(Protocol):
async def put_object(
self,
path: str,
data: AsyncIterator[bytes],
size: int,
content_type: str,
) -> None: ...
async def get_object(self, path: str) -> AsyncIterator[bytes]: ...
async def delete_object(self, path: str) -> None: ...
async def list_objects(self, prefix: str) -> list[str]: ...
async def get_presigned_url(self, path: str, expires_seconds: int) -> str | None: ...
```
Concrete implementations:
- `MinIOBackend` — uses the MinIO SDK + pre-signed URLs
- `OneDriveBackend` — uses `msgraph-sdk`
- `GoogleDriveBackend` — uses `google-api-python-client`
- `NextcloudBackend` — uses `webdav4`
The `get_presigned_url` method returns `None` for backends that don't support it (Google Drive, Nextcloud). FastAPI then falls back to proxying the stream through the backend for those cases.
**No FSSpec dependency at the protocol layer** — FSSpec (`fsspec`) can be used internally by `webdav4` but should not leak into the storage abstraction interface. The interface must be async-native.
**Per-user backend resolution:** Store `user.storage_backend_type` (enum: `minio`, `onedrive`, `gdrive`, `nextcloud`) and `user.storage_backend_credential_id` (FK to encrypted credentials table) in PostgreSQL. A `StorageBackendFactory` resolves the correct adapter on each request.
---
## Area 6: Vue 3 Auth Patterns
### State Management — Pinia 2.x
**Confidence: HIGH** (Pinia is the official Vue 3 state management library per vuejs.org; Vuex is deprecated for Vue 3)
```
npm install pinia@^2.0.0
```
Store auth state in a Pinia store:
```typescript
// stores/auth.ts
import { defineStore } from 'pinia'
export const useAuthStore = defineStore('auth', {
state: () => ({
accessToken: null as string | null,
user: null as User | null,
}),
getters: {
isAuthenticated: (state) => !!state.accessToken,
},
actions: {
setTokens(accessToken: string) {
this.accessToken = accessToken
// Refresh token is httpOnly cookie — not stored in JS
},
logout() {
this.accessToken = null
this.user = null
},
},
})
```
### Token Storage Strategy
**Confidence: HIGH** (security best practice, not library-specific)
- **Access token:** Store in Pinia memory state only (not `localStorage`, not `sessionStorage`). Survives tab navigation but is cleared on page refresh — intentional for security.
- **Refresh token:** Store as `httpOnly; Secure; SameSite=Strict` cookie set by FastAPI. Never readable by JavaScript. Refresh is done by hitting a `/auth/refresh` endpoint which reads the cookie server-side.
- **Do not use `localStorage` for tokens** — XSS vulnerability. In a document management app users upload arbitrary files; stored XSS risk is not theoretical.
On page load/refresh, immediately call `/auth/me` (which uses the httpOnly refresh cookie automatically). If it returns 200, restore access token from the response. If 401, redirect to login.
### Protected Routes — Vue Router 4.x Navigation Guards
**Confidence: HIGH** (Vue Router 4 is the Vue 3 router; this is a standard pattern)
```
npm install vue-router@^4.0.0
```
```typescript
// router/index.ts
router.beforeEach(async (to) => {
const auth = useAuthStore()
if (to.meta.requiresAuth && !auth.isAuthenticated) {
// Attempt silent refresh before redirecting
try {
await auth.silentRefresh() // hits /auth/refresh endpoint
} catch {
return { name: 'login', query: { redirect: to.fullPath } }
}
}
})
```
Mark routes with `meta: { requiresAuth: true }`. The guard attempts a silent refresh before redirecting — this handles the page-refresh case where the access token is gone but the refresh cookie is still valid.
### Refresh Token Handling — Axios Interceptors
**Confidence: HIGH** (standard pattern for token refresh in SPA + REST API; Axios is already common in Vue 3 projects)
```
npm install axios@^1.0.0
```
```typescript
// api/client.ts
axiosInstance.interceptors.response.use(
(response) => response,
async (error) => {
if (error.response?.status === 401 && !error.config._retry) {
error.config._retry = true
await authStore.silentRefresh()
error.config.headers['Authorization'] = `Bearer ${authStore.accessToken}`
return axiosInstance(error.config)
}
return Promise.reject(error)
}
)
```
### TOTP UI — No dedicated library needed
The TOTP enrollment flow only requires:
1. Display a QR code image (returned as base64 PNG from FastAPI, rendered via `<img :src="qrDataUrl">`)
2. An OTP input field (6-digit numeric input, `type="text" inputmode="numeric" maxlength="6"`)
No Vue TOTP component library is needed. Avoid heavy auth UI libraries (Auth0 components, etc.) — they assume SSO flows incompatible with this design.
---
## Full Dependency Summary
### Python (backend)
```
# requirements.txt additions for this milestone
# Auth
pyjwt[crypto]>=2.12.1
pwdlib[argon2]>=0.2.0
pyotp>=2.9.0
cryptography>=44.0.0
qrcode[pil]>=8.0.0 # TOTP QR code generation
# Database
sqlalchemy[asyncio]>=2.0.36
psycopg[asyncio,binary]>=3.2.0
alembic>=1.14.0
# Object storage
minio>=7.2.0
# Cloud storage
msgraph-sdk>=1.0.0
azure-identity>=1.19.0
google-api-python-client>=2.150.0
google-auth-oauthlib>=1.2.0
google-auth-httplib2>=0.2.0
webdav4>=0.9.8
```
### JavaScript (frontend)
```json
{
"dependencies": {
"pinia": "^2.0.0",
"vue-router": "^4.0.0",
"axios": "^1.0.0"
}
}
```
---
## Alternatives Considered
| Category | Recommended | Alternative | Why Not |
|----------|-------------|-------------|---------|
| JWT | PyJWT 2.12.1 | python-jose | FastAPI docs migrated away; python-jose had unmaintained periods; PyJWT is the Python JWT spec reference implementation |
| Password hashing | pwdlib + Argon2 | passlib + bcrypt | passlib is in maintenance mode; bcrypt is weaker than Argon2 (not memory-hard); pwdlib is the current FastAPI recommendation |
| ORM | SQLAlchemy 2.0 async | SQLModel 0.0.38 | SQLModel is great for greenfield but brownfield migration risk is higher; async SQLModel docs are thin; direct SQLAlchemy gives full control |
| ORM | SQLAlchemy 2.0 async | Tortoise ORM 0.21.x | Tortoise has its own metaclass system that conflicts with Pydantic models; integration with FastAPI requires aerich for migrations (separate toolchain); less ecosystem momentum than SQLAlchemy |
| PostgreSQL driver | psycopg 3 | asyncpg | asyncpg is async-only (needs separate sync driver for Alembic); psycopg 3 covers both paths; psycopg 3 is the official PostgreSQL Python driver successor |
| OneDrive | msgraph-sdk | O365 / office365-REST | Community-maintained; Graph API coverage incomplete; Microsoft has deprecated these in favor of the official SDK |
| S3 integration | minio native SDK | boto3 | boto3 pulls in botocore (large dep tree); minio SDK is purpose-built and simpler for MinIO-only use; boto3 makes sense only if AWS S3 migration is planned |
| Frontend state | Pinia | Vuex | Vuex is the Vue 2 store; Vue 3 official recommendation is Pinia |
| Token storage | Memory (Pinia) | localStorage | localStorage is vulnerable to XSS; document management apps with file upload have non-trivial XSS surface |
---
## What NOT to Use
| Avoid | Why | Use Instead |
|-------|-----|-------------|
| `python-jose` | No longer referenced by FastAPI docs; had maintenance gaps; `python-multipart` dependency overlap caused version conflicts | `pyjwt[crypto]` |
| `passlib[bcrypt]` for new hashes | In maintenance mode; bcrypt is not memory-hard; weaker than Argon2 against modern GPU attacks | `pwdlib[argon2]` (keep passlib only for migrating existing bcrypt hashes) |
| `Tortoise ORM` | Incompatible metaclass system creates friction with Pydantic v2; aerich migration toolchain is less mature; smaller ecosystem | SQLAlchemy 2.0 async |
| `tiangolo/uvicorn-gunicorn-fastapi` Docker image | **Deprecated** by FastAPI author as of 2024. Official FastAPI docs now recommend building from `python:3.x` base directly | Plain `python:3.12-slim` base image |
| `databases` (encode/databases) | Was an early async DB wrapper; SQLAlchemy 2.0 async has superseded its use case; the project is effectively in maintenance mode | SQLAlchemy 2.0 `AsyncSession` |
| `localStorage` for auth tokens | XSS-accessible; a document management app is an attractive XSS target | httpOnly cookies for refresh tokens; Pinia memory for access tokens |
| Multiple per-user Fernet keys | Overly complex key management; one platform-level Fernet key is sufficient — user data isolation is enforced at the PostgreSQL row level, not at the encryption key level | Single `CREDENTIAL_ENCRYPTION_KEY` env var |
---
## Stack Compatibility Notes
| Concern | Detail |
|---------|--------|
| Pydantic v2 required | FastAPI 0.136.x requires `pydantic>=2.9.0`. SQLAlchemy 2.0 is Pydantic v2-compatible. The existing app must already be on Pydantic v2 to run FastAPI 0.136. |
| psycopg 3 vs psycopg 2 | If the existing codebase (or any dependency) imports `psycopg2`, there will be a name conflict. `psycopg` (v3) imports as `import psycopg`, so they can technically coexist in the same environment, but avoid having both. |
| Starlette 1.0.0 | Bumped in FastAPI 0.136.1 — this is a major version. If the existing app uses any Starlette internals directly (middleware, routing), audit for breaking changes before upgrading FastAPI. |
| PyJWT 2.x vs 1.x API | PyJWT 2.x changed `jwt.encode()` to return `str` (not `bytes`). If the existing codebase has any JWT code using the 1.x API, update the call sites. |
| Vue Router 4 + Pinia SSR | Not applicable (no SSR in this project), but worth noting: Pinia's state is per-request in SSR contexts. For this SPA deployment, no issues. |
| Argon2 system dependency | `pwdlib[argon2]` requires `argon2-cffi` which needs a C compiler or binary wheel. The official Python Docker image (`python:3.12-slim`) provides wheels for common platforms — no `build-essential` needed. |
---
## Version Compatibility Matrix
| Package | Version | Python | Pydantic | FastAPI |
|---------|---------|--------|---------|--------|
| pyjwt | 2.12.1 | 3.8+ | any | 0.100+ |
| pwdlib | 0.2.x | 3.9+ | v2 | 0.100+ |
| sqlalchemy | 2.0.36+ | 3.8+ | v2 (via fastapi) | 0.100+ |
| psycopg (v3) | 3.2.x | 3.8+ | — | — |
| alembic | 1.14.x | 3.8+ | — | — |
| minio | 7.2.x | 3.7+ | — | — |
| msgraph-sdk | 1.x | 3.8+ | — | — |
| azure-identity | 1.19.x | 3.8+ | — | — |
| pinia | 2.x | — | — | — |
| vue-router | 4.x | — | — | — |
---
## Sources
- FastAPI official release notes (verified 2026-05-21): https://fastapi.tiangolo.com/release-notes/ — PyJWT 2.12.1, SQLModel 0.0.38, Starlette 1.0.0, pydantic>=2.9.0 confirmed
- FastAPI security tutorial (verified 2026-05-21): https://fastapi.tiangolo.com/tutorial/security/oauth2-jwt/ — PyJWT recommended, python-jose absent, pwdlib[argon2] recommended
- FastAPI SQL databases tutorial (verified 2026-05-21): https://fastapi.tiangolo.com/tutorial/sql-databases/ — SQLModel documented as recommended ORM
- FastAPI Docker guide (verified 2026-05-21): https://fastapi.tiangolo.com/deployment/docker/ — tiangolo/uvicorn-gunicorn-fastapi deprecated confirmed
- Microsoft Graph SDK overview (verified 2026-05-21): https://learn.microsoft.com/en-us/graph/sdks/sdks-overview — Python SDK confirmed GA
- pwdlib argon2 version: MEDIUM confidence — training data, verify on PyPI
- pyotp version: MEDIUM confidence — training data, verify on PyPI
- minio Python SDK version: MEDIUM confidence — training data, verify on PyPI
- webdav4 version: MEDIUM confidence — training data, verify on PyPI
- google-api-python-client version: MEDIUM confidence — training data, verify on PyPI
- azure-identity / msgraph-sdk minor versions: MEDIUM confidence — training data, verify on PyPI
---
*Stack research for: DocuVault multi-user auth, PostgreSQL, MinIO, cloud integrations*
*Researched: 2026-05-21*
+150
View File
@@ -0,0 +1,150 @@
# DocuVault — Research Synthesis
_Last updated: 2026-05-21_
## Executive Summary
DocuVault is a brownfield migration of a functional single-user document scanner into a privacy-first, multi-user SaaS platform. The existing system already handles document upload, text extraction, and AI-based topic classification via a well-designed provider abstraction. This milestone replaces the flat-file JSON + filesystem persistence layer with PostgreSQL + MinIO, adds full multi-user authentication (JWT with httpOnly cookies, TOTP 2FA, refresh token rotation), per-user quota enforcement, folder organization, document sharing, and pluggable cloud storage backends — following the same adapter pattern already used for AI providers.
---
## Confirmed Stack
### Use
| Package | Version | Purpose |
|---|---|---|
| `pyjwt[crypto]` | ≥2.12.1 | JWT — current FastAPI docs recommendation; replaces python-jose |
| `pwdlib[argon2]` | ≥0.2.0 | Password hashing — Argon2 is memory-hard (OWASP 2025) |
| `pyotp` | ≥2.9.0 | TOTP 2FA — RFC 6238 reference |
| `cryptography` (Fernet) | ≥44.0.0 | Credential encryption — AES-128-CBC + HMAC-SHA256 |
| `sqlalchemy[asyncio]` | ≥2.0.36 | ORM — async-native; better brownfield fit than SQLModel |
| `psycopg[asyncio,binary]` | ≥3.2.0 | PostgreSQL driver — single driver for async FastAPI + sync Alembic |
| `alembic` | ≥1.14.0 | DB migrations |
| `minio` | ≥7.2.0 | Object storage — presigned URL flow (FastAPI never proxies bytes) |
| `msgraph-sdk` + `azure-identity` | ≥1.0.0 / ≥1.19.0 | OneDrive — official Microsoft SDK |
| `google-api-python-client` + `google-auth-oauthlib` | ≥2.150.0 / ≥1.2.0 | Google Drive v3 |
| `webdav4` | ≥0.9.8 | Nextcloud + generic WebDAV |
### Do NOT Use
- `python-jose` — FastAPI dropped it; use PyJWT
- `passlib[bcrypt]` for new hashes — maintenance mode; keep only for migrating existing hashes
- `tiangolo/uvicorn-gunicorn-fastapi` Docker image — deprecated; use `python:3.12-slim`
- `localStorage` for any auth token — XSS-accessible; httpOnly cookie for refresh, Pinia memory for access token
- Single platform Fernet key for all users — HKDF per-user derivation required (catastrophic blast radius otherwise)
- `SQLModel` for this migration — async story is thin; SQLAlchemy 2.0 async is better for brownfield
---
## Table-Stakes Features for v1
### Confirmed (from PROJECT.md)
- Email + password registration + JWT sessions with refresh tokens
- TOTP 2FA + backup codes *(see gap below)*
- Password reset via email
- Per-user isolated storage (100 MB free tier)
- Quota tracking, enforcement at upload, visible indicator
- Folder CRUD, move documents, "Shared with me" folder
- Share by handle, view-only default, immediate revoke
- Cloud OAuth2 connect flow + credential encryption
- Admin: user management, quota adjustment, AI provider assignment
- Audit log (append-only, metadata only) + admin viewer
- In-browser PDF preview
### Gaps — Items PROJECT.md Missed
1. **TOTP backup codes** — Every competitor ships these. Without them, a lost phone permanently locks users out. 810 single-use codes, stored hashed, acknowledged by user before TOTP is activated.
2. **Quota warnings at 80% and 95%** — PROJECT.md specifies rejection at 100% only. Pre-emptive warnings are table stakes (Google Drive, Dropbox both do this). In-app banner at 80% (amber) and 95% (red), plus a specific error at 100% showing current usage, rejected file size, and a link to storage settings.
3. **"Sign out all devices" / session revocation** — Users who believe their account is compromised need forced logout everywhere. Already handled by the `refresh_tokens` table — requires only an endpoint and a UI control.
4. **Breadcrumb navigation** — Folder CRUD is in PROJECT.md but not the navigation UX. Required for nested folder usability.
5. **Cloud storage connection status indicator** — PROJECT.md doesn't specify what happens when cloud storage is unreachable. Silent failure = data loss. Must show `ACTIVE | REQUIRES_REAUTH | ERROR` state and fall back to local storage with a clear message.
6. **Admin impersonation is an explicit architectural exclusion** — Must be documented as excluded, not just left unbuilt. Directly contradicts the privacy-first core value.
---
## Critical Architectural Decisions (Lock Before Building)
These cannot be safely retrofitted:
**1. JWT in httpOnly cookies**
Refresh token: `httpOnly; Secure; SameSite=Strict` cookie. Access token: Pinia memory only. Never `localStorage`. Vue Router guard silently refreshes before redirecting to login. Axios `withCredentials: true`.
**2. HKDF per-user key derivation for cloud credentials**
`HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`. Master key in `CLOUD_CREDS_KEY` env var only. Salt in users table. Design before writing the first line of credential storage — cannot be added later without re-encrypting everything.
**3. Presigned MinIO URL flow**
FastAPI generates signed PUT URL → browser uploads directly to MinIO → FastAPI confirms object and commits quota atomically. FastAPI handles metadata only; bytes never pass through the API layer. Object keys: `{user_id}/{document_id}/{uuid4()}{ext}`. Human-readable filename in DB only.
**4. Atomic PostgreSQL quota enforcement**
`UPDATE quotas SET used_bytes = used_bytes + $delta WHERE user_id = $uid AND (used_bytes + $delta) <= limit_bytes RETURNING used_bytes`. If 0 rows returned, delete the MinIO object and return 413. Never perform quota arithmetic in Python between two DB statements.
**5. BackgroundTasks replacement before horizontal scaling**
FastAPI `BackgroundTasks` is per-instance — classification tasks cannot distribute across containers. Replace with Celery + Redis or pgqueuer (PostgreSQL-backed, no Redis dependency) before scaling to N instances. Decide during Phase 3 planning.
**Additional locked decisions:**
- Refresh tokens are opaque UUIDs stored hashed in DB (not JWTs); access tokens are short-lived JWTs (15 min).
- `refresh_tokens` table has `family_id` — on reuse of a rotated token, revoke entire family and emit security alert.
- Audit log uses `BIGSERIAL` PK; app DB user has INSERT + SELECT only (no UPDATE/DELETE).
- Admin endpoints for cloud connections return only `provider, display_name, connected_at, status` — never `credentials_enc`.
- Every document/folder endpoint asserts `resource.user_id == current_user.id` via centralized `assert_document_access()`.
---
## 5-Phase Migration Sequence
### Phase 1 — Infrastructure Foundation
Wire PostgreSQL + MinIO into Docker Compose. Create `db/models.py` with full schema. Alembic initial migration. Async session dependency. No API changes — flat-file code still runs. Gate: all services boot cleanly; migrations apply; no behavior change.
### Phase 2 — Users and Authentication
Users, refresh_tokens, quotas tables. Auth endpoints (register, login, refresh, TOTP, password reset, forced logout). TOTP with backup codes. Password reset does NOT auto-login (routes through TOTP gate). `get_current_user` + `get_current_admin` FastAPI dependencies. Admin user management endpoints. Vue auth store (Pinia memory + httpOnly cookie), Router guard, Axios interceptors. Gate: admin JWT returns 403 on document endpoints; backup codes issued and acknowledged at enrollment.
### Phase 3 — Document Migration to PostgreSQL + MinIO
Dual-write window: new uploads write to both stores. Migration script copies historical flat-file data to PostgreSQL + MinIO. Count reconciliation assertion (go/no-go gate). Flip read source to PostgreSQL. Remove JSON write path. Presigned URL flow for all uploads/downloads. `asyncio.to_thread()` wrapping all MinIO SDK calls. Gate: concurrent upload test at 99% quota — only one succeeds.
### Phase 4 — Multi-User Isolation, Quotas, Folders, Sharing
All queries gain `WHERE user_id = current_user.id`. Quota bar (80%/95% warnings). Folder CRUD + breadcrumbs. Document move + sort. Share by handle + "Shared with me" folder. Audit log wired to all events. Admin audit viewer. In-browser PDF preview. Gate: negative-access test (admin cannot retrieve any document content); quota reconciliation drift <1%.
### Phase 5 — Cloud Storage Backends
`StorageBackend` ABC + factory (mirrors `ai/` pattern). `MinIOBackend`, `OneDriveBackend`, `GoogleDriveBackend`, `NextcloudBackend`, `WebDAVBackend`. OAuth2 connect/disconnect flows. Connection status UX. HKDF key derivation for all credentials. `delete_user_files()` on account deletion. Gate: mock `invalid_grant` → REQUIRES_REAUTH (not 500); account deletion asserts `delete_user_files()` per connection.
---
## Top 5 Pitfalls by Risk
| # | Pitfall | Severity | Fix |
|---|---|---|---|
| 1 | JWT in localStorage — XSS bypasses TOTP entirely | CRITICAL | httpOnly cookie for refresh, Pinia memory for access token |
| 2 | Quota race condition — concurrent uploads bypass limit | DATA INTEGRITY | Atomic PostgreSQL `UPDATE ... RETURNING` |
| 3 | TOTP bypass via password reset — full 2FA bypass via email compromise | SECURITY | Reset issues `password_reset_pending` state, not a full session |
| 4 | Single Fernet key for all cloud credentials — catastrophic on key leak | CATASTROPHIC | HKDF per-user derivation before first credential is stored |
| 5 | Path traversal in MinIO keys — cross-user data access | SECURITY | UUID-only MinIO keys; human filename in DB only; never reconstruct key from request parameters |
---
## Confidence Assessment
| Area | Confidence | Notes |
|---|---|---|
| Stack | MEDIUM-HIGH | Core libraries confirmed from FastAPI official release notes (PyJWT, pwdlib, SQLAlchemy 2.0, psycopg v3). Cloud SDK minor versions — verify on PyPI before pinning. |
| Features | MEDIUM | Based on Google Drive, Dropbox, Box, Paperless-ngx knowledge through Aug 2025. |
| Architecture | HIGH | FastAPI DI pattern from official docs; S3 presigned URLs and atomic PostgreSQL quota update are industry standards. |
| Pitfalls | HIGH | OWASP cheat sheets; RFC 9700 refresh token rotation; GDPR Article 17 stable regulatory text. |
**Overall: MEDIUM-HIGH**
---
## Gaps to Resolve During Planning
- Verify cloud SDK minor versions on PyPI before pinning
- Confirm PyOTP `valid_window` default in current docs (recommend `valid_window=1` for ±30s clock drift)
- Decide Celery + Redis vs pgqueuer during Phase 3 (depends on Redis availability in deployment target)
- Audit existing codebase for any existing bcrypt hashes before removing `passlib`
- Validate MinIO Docker Compose public endpoint in Phase 3 acceptance testing (presigned URLs must use host-accessible address, not internal Docker network name)