docs: add domain research (4 dimensions + synthesis)

This commit is contained in:
curo1305
2026-05-21 20:42:16 +02:00
parent 2a298a4276
commit daa7e0f289
5 changed files with 2406 additions and 0 deletions
+550
View File
@@ -0,0 +1,550 @@
# Pitfalls Research
**Domain:** Multi-user SaaS document management platform (FastAPI + Vue 3, PostgreSQL + MinIO, cloud storage integrations)
**Researched:** 2026-05-21
**Confidence:** HIGH (auth/storage/multi-tenancy patterns are well-established; specific FastAPI + MinIO combination is MEDIUM — no web search available)
---
## Critical Pitfalls
### Pitfall 1: JWT in localStorage — XSS Gives Full Account Takeover
**What goes wrong:**
The Vue 3 SPA stores the JWT access token in `localStorage`. Any JavaScript injected via XSS (in file names, document content previewed in the UI, a compromised dependency) can call `localStorage.getItem('token')` and exfiltrate a long-lived credential. The attacker then impersonates the user from any origin, bypasses TOTP entirely (the token is post-authentication), and can access all documents, including cloud storage credentials.
**Why it happens:**
`localStorage` is the path of least resistance in SPAs. It survives page reloads, works with Axios interceptors trivially, and requires no server-side session state. FastAPI tutorials almost universally use `Authorization: Bearer` headers set from localStorage.
**How to avoid:**
- Issue the JWT as an **httpOnly, SameSite=Strict, Secure cookie** — JavaScript cannot read it.
- Use a short-lived access token (15 minutes) in the httpOnly cookie.
- Issue a separate refresh token (httpOnly cookie, longer TTL, `/auth/refresh` path-scoped) to rotate access tokens silently.
- The Vue frontend never holds the raw token string. Axios is configured with `withCredentials: true`; the browser attaches cookies automatically.
- CSRF protection: because `SameSite=Strict` blocks cross-site cookie submission, CSRF tokens are not strictly required for same-origin SPAs, but add a CSRF header check (`X-Requested-With: XMLHttpRequest`) as defence-in-depth.
**Warning signs:**
- `localStorage.getItem` anywhere in auth-related frontend code
- JWT decode functions in frontend code (the frontend should not need to decode a token it can't read)
- `Authorization: Bearer ${token}` header set manually in Axios interceptor
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 2: TOTP Bypass via Password Reset Flow
**What goes wrong:**
The password reset flow issues a one-time token by email. After the user resets their password, the system logs them in and redirects to the dashboard — skipping the TOTP prompt because the session was created through the reset path, not the login path. An attacker who compromises a user's email account can therefore completely bypass TOTP.
**Why it happens:**
Password reset and login are treated as separate code paths. The TOTP check lives in the login handler; the reset handler creates a session directly after credential update without going through the TOTP gate.
**How to avoid:**
- After a successful password reset, issue a partial session or a `password_reset_pending` state, not a full authenticated session.
- Force the user to complete the full login flow (including TOTP if enabled) from the new credentials.
- Alternatively, on password reset completion, invalidate all existing sessions and send the user to the login page (no auto-login).
- Log password resets in the audit trail with IP and user-agent.
**Warning signs:**
- Password reset handler calls the same session-creation function as login but omits 2FA state checks
- No `mfa_verified` flag on sessions (only `authenticated`)
- Users can reach protected endpoints via a token created from the reset path
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 3: Refresh Token Rotation — Stolen Refresh Token Not Detected
**What goes wrong:**
Refresh tokens are issued as single-use rotating credentials. If an attacker steals a refresh token and uses it before the legitimate user does, the server rotates the token and the attacker has a valid new one. The legitimate user's next refresh request fails — but without a detection mechanism the failure just looks like a session expiry, no alert is raised, and the attacker continues with the stolen session indefinitely.
**Why it happens:**
Teams implement the "rotate on use" mechanic without implementing the "revoke family on reuse" detection. The `refresh_tokens` table lacks a `family_id` column linking reissued tokens.
**How to avoid:**
- Store refresh tokens in the database with a `family_id` (UUID for the original issuance chain) and a `revoked` flag.
- When a refresh token is presented: if it is already marked `revoked` (i.e., a previously rotated token), revoke the **entire family** — force logout of all sessions for that user.
- Emit a security alert (audit log + optionally email) when a reuse attempt is detected.
- Refresh tokens should be hashed before storage (bcrypt or SHA-256 with a per-row salt), same as passwords.
**Warning signs:**
- Refresh tokens stored as plain values in DB with no `revoked` column
- No `family_id` linking related rotation chains
- Stolen refresh token detection treated as "v2 feature"
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 4: TOTP — Timing Attack on Code Verification
**What goes wrong:**
TOTP verification uses Python `==` string comparison. Python's `==` on strings is not constant-time — it short-circuits on the first differing character. A sufficiently sophisticated timing oracle (millions of requests from a local network) can distinguish valid from invalid codes, reducing the 6-digit brute-force space. More practically: without rate limiting, an attacker can brute-force all 1,000,000 possible 6-digit codes during a 30-second window.
**Why it happens:**
TOTP libraries (PyOTP) return a string; developers do `if provided == expected`. Rate limiting is added "later" and often never lands.
**How to avoid:**
- Use `hmac.compare_digest(provided, expected)` instead of `==` for TOTP comparison.
- Rate-limit TOTP attempts: 5 attempts per 30-second window, 15-minute lockout on excess.
- The lockout must be stored server-side (Redis or DB), not client-side.
- Accept only the current window and optionally ±1 window for clock drift — do not accept wider ranges.
- Log every failed TOTP attempt with IP.
**Warning signs:**
- `if totp.verify(code):` without inspecting PyOTP's internal comparison method
- No rate limit on `POST /auth/totp/verify`
- TOTP window set to 2+ (default is ±1 window = 90 seconds valid — wider than needed)
**Phase to address:** Auth phase (Phase 1 — Users & Auth)
---
### Pitfall 5: Path Traversal in User File Access (MinIO Object Keys)
**What goes wrong:**
MinIO object keys are constructed from user input: `f"users/{user_id}/{filename}"`. If `filename` contains `../`, `../../`, or URL-encoded equivalents (`%2e%2e%2f`), the resulting key may escape the user's prefix and land in another user's namespace. A request for `../../other_user_id/secret.pdf` resolves to `users/other_user_id/secret.pdf`.
**Why it happens:**
Developers trust that MinIO will sanitize paths. It does not — S3-compatible APIs treat object keys as arbitrary strings. The prefix-based isolation is only as safe as the key construction code.
**How to avoid:**
- Never use raw filenames as object key components. Generate a UUID (or UUID + original extension) as the stored key: `f"users/{user_id}/{uuid4()}{ext}"`.
- Store the human-readable filename in the database metadata row, completely decoupled from the storage key.
- If filenames must appear in keys for any reason, strip or reject any `/`, `\`, `..`, `%` characters before key construction.
- On retrieval, look up the object key from the database row (which is owned by `user_id`) rather than constructing it from user input.
**Warning signs:**
- `object_key = f"users/{user_id}/{request.filename}"`
- File download endpoint accepts a `path` or `filename` query parameter and constructs the key from it
- No database lookup intermediating between "user requests file" and "MinIO key used"
**Phase to address:** Storage migration phase (Phase 2 — DB + MinIO migration)
---
### Pitfall 6: Quota Race Condition — Concurrent Uploads Bypass the Limit
**What goes wrong:**
Two upload requests arrive simultaneously for a user at 99 MB of a 100 MB quota. Both read quota usage as 99 MB, both pass the `99 + 1 < 100` check, both proceed to upload — the user ends at 101 MB. At larger scale (many simultaneous large uploads) the overage can be significant.
**Why it happens:**
Quota enforcement is implemented as: read current usage → check → write file → update usage. This is a classic check-then-act race when the check and the write are not atomic.
**How to avoid:**
- Enforce quota atomically using a `SELECT ... FOR UPDATE` on the quota row before uploading, or use a PostgreSQL advisory lock keyed on `user_id`.
- Better: use an optimistic update: `UPDATE quotas SET used_bytes = used_bytes + $new WHERE user_id = $uid AND used_bytes + $new <= limit_bytes RETURNING used_bytes`. If 0 rows are updated, the quota was exceeded — reject before touching MinIO.
- Only update `used_bytes` after the MinIO upload succeeds, but hold the lock/reservation through the upload, or use a two-phase: reserve bytes → upload → confirm, with a cleanup job for stuck reservations.
- Never read quota, do arithmetic in Python, then write back as two separate statements.
**Warning signs:**
- `current = get_quota(user_id); if current + size <= limit: upload()`
- No database transaction wrapping quota check and update
- Quota table updated with `SET used_bytes = $computed_value` (full overwrite) rather than `used_bytes + delta`
**Phase to address:** Storage migration phase (Phase 2 — DB + MinIO migration) and Quotas phase
---
### Pitfall 7: Admin Privilege Escalation via Missing Ownership Checks
**What goes wrong:**
An admin API endpoint is gated on `is_admin=True` but document-access endpoints only check `is_authenticated`. An admin user calling `GET /api/documents/{id}` with any document ID can read any user's document because the handler checks authentication but not `document.owner_id == current_user.id`. The privacy-first model is violated without any special exploit — just a correctly authenticated request.
**Why it happens:**
Authorization is implemented as authentication + role checks ("is this user an admin?") without resource-level ownership verification ("does this user own this resource?"). FastAPI dependency injection makes it easy to write `current_user = Depends(get_current_user)` and forget to check `resource.user_id == current_user.id`.
**How to avoid:**
- Every document/file/folder endpoint must assert `resource.user_id == current_user.id` (or check share grants). This check cannot be optional or deferred.
- Admins access user account metadata via separate admin-scoped endpoints that explicitly exclude document content, file URLs, and cloud credentials.
- Write a test: log in as admin, attempt `GET /documents/{document_owned_by_other_user}`, assert `403`.
- Use a centralized `assert_document_access(document, current_user)` function rather than inline checks to prevent omissions.
**Warning signs:**
- Document endpoints that check `if not current_user` but not `if document.user_id != current_user.id`
- Admin endpoints that return document content or presigned URLs
- No explicit tests for cross-user access attempts
**Phase to address:** Auth phase (Phase 1) and every subsequent phase that adds resource-access endpoints
---
### Pitfall 8: Cloud Credential Leakage via Admin Query
**What goes wrong:**
Cloud storage credentials (OAuth tokens, Nextcloud passwords) are encrypted at rest. But a query like `SELECT * FROM cloud_connections WHERE user_id = $uid` returns the ciphertext. If an admin dashboard endpoint runs this query and serializes the full row to JSON, the ciphertext ships to the admin browser. An admin cannot decrypt it without the key — but the ciphertext is now in browser history, proxy logs, and admin audit records. If the encryption key is later exposed, all credentials decrypt retroactively.
**Why it happens:**
ORM models serialize all columns by default. Developers add `encrypted_credentials` to the model and forget to exclude it from admin-facing serializers.
**How to avoid:**
- The `cloud_connections` table's credential column (`encrypted_token`, `encrypted_refresh_token`) must be **excluded** from all serialization by default.
- Use explicit Pydantic response models for every endpoint — no `orm_mode` with full model pass-through.
- Admin endpoints for cloud connections return only: `provider`, `account_label`, `connected_at`, `last_used_at`, `status` — never the credential column, not even ciphertext.
- Audit log cloud credential access separately: the only code path that should ever read the encrypted column is the storage adapter, not admin or user-info endpoints.
**Warning signs:**
- `CloudConnection` Pydantic schema includes `encrypted_token` field
- Admin user-detail endpoint returns full cloud connection rows
- ORM model uses `model_config = ConfigDict(from_attributes=True)` without explicit field exclusions
**Phase to address:** Cloud storage integration phase (Phase 3 or 4)
---
### Pitfall 9: Flat-file to PostgreSQL Migration — Data Loss During Cutover
**What goes wrong:**
The migration script reads all JSON files, transforms them to DB rows, and then a flag switches the app to use PostgreSQL. Documents uploaded during the migration window are written to the old JSON store (because the flag has not flipped yet) and are missed by the migration script, which ran before they arrived. After cutover, those documents are invisible.
**Why it happens:**
Migrations are planned as offline events ("we'll take the app down for 5 minutes") and then discovered to be impractical — the app is used 24/7 or downtime feels risky. The team runs the migration online without dual-write.
**How to avoid:**
- Plan the migration in three phases:
1. **Dual-write**: deploy code that writes to both JSON and PostgreSQL, reads from JSON. All new documents land in both stores.
2. **Backfill**: run migration script to copy historical JSON records to PostgreSQL. New records are already there.
3. **Cutover**: flip read source to PostgreSQL, verify counts match, remove JSON write path.
- The dual-write window can be as short as one deployment cycle.
- Include a reconciliation check: `assert doc_count_json == doc_count_db` before cutting over.
- Keep the old JSON store read-only for 1 week post-cutover as a rollback option.
**Warning signs:**
- Migration is planned as "take app down, run script, bring back up"
- No count-reconciliation step
- No rollback plan documented before migration begins
**Phase to address:** DB + MinIO migration phase (Phase 2)
---
### Pitfall 10: Blocking I/O Inside Async FastAPI Handlers (Existing Issue)
**What goes wrong:**
The codebase already uses synchronous `filelock` and `open()` inside `async def` handlers (CONCERNS.md item 6). After migration to PostgreSQL + MinIO, if synchronous DB drivers (psycopg2) or synchronous MinIO client calls replace the file I/O without wrapping in `asyncio.to_thread()`, the event loop stalls on every I/O operation. Under concurrent load (multiple users uploading), requests queue behind each other even though the hardware is idle.
**Why it happens:**
SQLAlchemy sync engine + psycopg2 is the default FastAPI tutorial stack. The MinIO Python SDK (`minio` package) is synchronous. Developers add `await` in front of calls that are not coroutines, get a type error, remove the `await`, and ship blocking code.
**How to avoid:**
- Use SQLAlchemy async engine with `asyncpg` driver: `create_async_engine("postgresql+asyncpg://...")`.
- Wrap all MinIO SDK calls in `asyncio.to_thread()` since there is no official async MinIO client: `await asyncio.to_thread(minio_client.put_object, ...)`.
- Alternatively use `aioboto3` (async boto3) which works with MinIO's S3-compatible API.
- `aiofiles` is already in `requirements.txt` — use it for any remaining local file operations.
- Run `pytest-asyncio` + `asyncio.get_event_loop().set_debug(True)` in tests; the debug mode logs blocking calls.
**Warning signs:**
- `from sqlalchemy import create_engine` (not `create_async_engine`)
- `import psycopg2` anywhere in application code
- `minio_client.put_object(...)` not wrapped in `asyncio.to_thread`
- Uvicorn logs show high request latency even with low CPU usage
**Phase to address:** DB + MinIO migration phase (Phase 2)
---
### Pitfall 11: N+1 Queries on Document Listing
**What goes wrong:**
`GET /api/documents` returns a list of documents, each including folder name, topic names, and share count. The handler fetches the document list, then for each document issues separate queries for folder, topics, and shares. 100 documents = 301+ queries per page load. The existing codebase already has an O(N) disk scan equivalent (CONCERNS.md items 8, 9); the PostgreSQL migration preserves this pattern if not corrected.
**Why it happens:**
SQLAlchemy lazy loading is the default. `document.folder.name` triggers a query. Developers don't see it until production load hits.
**How to avoid:**
- Use SQLAlchemy `joinedload` or `selectinload` options on the document list query to eagerly load related entities.
- The list endpoint should be a single SQL query with JOINs, not a loop.
- Add `EXPLAIN ANALYZE` checks as part of the phase acceptance criteria.
- Enable SQLAlchemy `echo=True` in development to log every SQL statement.
- Pagination at the database level: `LIMIT 50 OFFSET $n` — not "fetch all, slice in Python."
**Warning signs:**
- SQLAlchemy models using default relationship loading without explicit `lazy=` options
- `for doc in documents: doc.folder.name` pattern
- No `joinedload` or `selectinload` in list query definitions
- Query count in tests grows linearly with fixture size
**Phase to address:** DB + MinIO migration phase (Phase 2); enforce in document listing feature phase
---
### Pitfall 12: MinIO Presigned URL Expiry — Stale Links in the UI
**What goes wrong:**
The document preview and download UI displays links generated as MinIO presigned URLs with a 1-hour expiry. The Vue frontend fetches the document list on page load, stores the URLs in Pinia state, and renders them as `<img src>` or `<a href>`. If the user leaves the tab open for 2 hours and then clicks a link, they get a 403 (presigned URL expired). No error is shown; the image just fails to load or the download silently fails.
**Why it happens:**
Presigned URLs feel like permanent links. Teams generate them at list time for convenience and cache them in frontend state without expiry awareness.
**How to avoid:**
- Do not embed presigned URLs in list responses. Return only document metadata.
- Generate presigned URLs **on demand**: a separate `GET /api/documents/{id}/download-url` endpoint generates a short-lived URL (515 minutes) at the moment of user intent.
- Alternatively, proxy document bytes through the FastAPI backend (`GET /api/documents/{id}/file` streams from MinIO) — eliminates presigned URL complexity at the cost of bandwidth.
- If presigned URLs are cached in frontend state, include the expiry timestamp and regenerate before expiry.
**Warning signs:**
- Presigned URLs included in document list response JSON
- Frontend stores presigned URLs in Pinia state without TTL tracking
- No `GET /api/documents/{id}/download-url` endpoint (or equivalent)
- `presigned_url_expiry = 3600` with no frontend refresh logic
**Phase to address:** DB + MinIO migration phase (Phase 2) and document access phase
---
### Pitfall 13: OAuth Token for Cloud Storage — Revocation Not Handled
**What goes wrong:**
A user connects Google Drive via OAuth. The app stores the encrypted access token + refresh token. Six months later, the user revokes the app's access in their Google account settings. The next time DocuVault tries to access their Drive, it gets a 401. The refresh token exchange also fails with `invalid_grant`. The app has no handler for this — it retries, logs an error, and the user sees a generic "storage error" with no path to reconnect.
**Why it happens:**
OAuth happy-path is well-documented. Revocation and `invalid_grant` handling are in footnotes. Teams handle 401 → refresh → retry but don't handle refresh failure → user notification.
**How to avoid:**
- Wrap all cloud storage adapter calls in a two-level error handler:
- Level 1: 401 → attempt refresh → retry.
- Level 2: refresh fails with `invalid_grant` or `token_revoked` → mark connection as `REQUIRES_REAUTH` in DB, **do not retry**, surface a "reconnect required" notice to the user.
- The `cloud_connections` table must have a `status` column: `ACTIVE | REQUIRES_REAUTH | ERROR`.
- The UI must poll or react to `REQUIRES_REAUTH` state and prompt the user to re-authorize, not just show an error toast.
- Never silently swallow a revoked-token error or retry indefinitely.
**Warning signs:**
- Cloud storage adapter raises generic `StorageError` for all OAuth errors without distinguishing revocation
- No `status` or `needs_reauth` column on `cloud_connections`
- Reconnect UI not planned in feature scope
**Phase to address:** Cloud storage integration phase
---
### Pitfall 14: Cloud Storage Rate Limits — No Backoff, No Per-User Throttling
**What goes wrong:**
Multiple users with Google Drive connected trigger simultaneous document uploads. The app hits Google's per-app rate limit (Drive API: 20,000 requests/100 seconds/user, but also global per-project limits). Google returns 429. The app has no retry-with-backoff; uploads fail and the errors are attributed to the individual user, not the platform-level limit.
**Why it happens:**
Rate limits are per-API-key, not per-user. A single misbehaving user or burst of activity for all users shares the same quota. Teams only test with a single user.
**How to avoid:**
- Implement exponential backoff with jitter for all cloud storage adapter calls (start 1s, max 32s, 3 retries).
- Use a task queue (Celery or FastAPI BackgroundTasks with a semaphore) for cloud-destined uploads rather than processing inline in the request handler.
- Log 429 responses separately from other errors — they indicate platform-level throttling, not user errors.
- Per-provider rate limit documentation should be captured in adapter docstrings.
**Warning signs:**
- Cloud upload performed synchronously inside the HTTP request handler with no retry logic
- No `tenacity` or equivalent retry decorator on provider calls
- 429 responses from cloud APIs cause 500 responses to the end user
**Phase to address:** Cloud storage integration phase
---
### Pitfall 15: GDPR Right to Erasure — Cloud Copies Not Deleted
**What goes wrong:**
A user requests account deletion. The app deletes the PostgreSQL rows, the MinIO objects, and the user record. But if the user's documents were stored in Google Drive or OneDrive via the cloud storage adapter, those files remain on the cloud provider. The user's data has not actually been erased from all systems.
**Why it happens:**
Account deletion is implemented against the systems the team controls (PostgreSQL, MinIO). Cloud storage is treated as "the user's own storage" so deletion is skipped. Under GDPR, if the platform wrote files to the user's cloud storage on their behalf as a data processor, erasure obligations may extend there.
**How to avoid:**
- Account deletion must enumerate all `cloud_connections` for the user and call `adapter.delete_all_user_files()` (or equivalent) on each active connection.
- If the cloud connection is already in `REQUIRES_REAUTH` state, the deletion cannot proceed automatically — log a compliance alert and require manual follow-up or notify the user to delete manually.
- Document the data flow in a data map: "files stored in X go to Y" — this is a GDPR Article 30 requirement.
- Right to erasure flow must be tested as part of acceptance criteria, not assumed to work.
**Warning signs:**
- Account deletion handler only operates on PostgreSQL and MinIO
- No `delete_user_files` method in the `StorageBackend` interface
- No GDPR data map in project documentation
**Phase to address:** Cloud storage integration phase and account management phase
---
### Pitfall 16: Encryption Key Management — Single Key for All Users
**What goes wrong:**
All cloud credentials are encrypted with one Fernet key stored in `CLOUD_ENCRYPTION_KEY` env var. If this key is exposed (leaked `.env`, compromised deployment config, insider threat), every user's cloud credentials decrypt at once. The privacy-first model collapses entirely.
**Why it happens:**
One key is simpler than per-user keys. Fernet key derivation per user is slightly more complex to implement.
**How to avoid:**
- Derive a per-user encryption key from the master key + a user-specific salt: `HKDF(master_key, salt=user_id_bytes, info=b"cloud-credentials")`. The salt is stored in the users table.
- Even if the master key leaks, an attacker still needs each user's salt to decrypt their credentials.
- Rotate the master key on a schedule: re-encrypt all stored credentials with the new key in a background job.
- Audit log any code path that accesses the encryption key — this should only ever happen in the storage adapter, never in API handlers.
- Never log the encryption key or derived key material, even at DEBUG level.
**Warning signs:**
- Single `ENCRYPTION_KEY` env var with no per-user derivation
- Encryption/decryption called directly in API endpoint handlers (not in the adapter layer)
- No key rotation plan documented
**Phase to address:** Cloud storage integration phase (design the key derivation before writing the first line of credential storage code)
---
### Pitfall 17: Shared Document — Quota Not Charged to Sharer Correctly After Revoke
**What goes wrong:**
User A shares a document with User B. The document is not duplicated — it stays in User A's storage and User A's quota. User A revokes the share. If the quota update on revoke has a bug (or is missing), User A's used bytes may drift from reality over time. At scale, quota inaccuracies accumulate and users can either be locked out prematurely or exceed limits invisibly.
**Why it happens:**
Sharing is implemented as a metadata-only operation (a `shares` table row), but quota accounting is only re-examined on upload and delete. Edge cases (revoke, owner deletion while share is active, cloud storage document quota) are skipped.
**How to avoid:**
- Quota is a property of documents, not shares. Ensure the quota model is: "sum of `file_size` for all documents where `owner_id = user_id`." Shares do not affect quota calculation.
- Recalculate quota from source-of-truth (a `SUM` query) as a periodic background job and reconcile against the cached `used_bytes` value. Alert if drift exceeds 1%.
- The `DELETE /documents/{id}` handler must atomically decrement quota and delete the DB row and MinIO object in a single transaction (DB) + best-effort cleanup (MinIO).
**Warning signs:**
- `used_bytes` stored as a counter updated by application code, never reconciled against a database SUM
- Share revocation handler doesn't touch quota (correct if quota is owner-only, but must be verified)
- Document deletion does not atomically update quota
**Phase to address:** Quotas and sharing phases
---
## Technical Debt Patterns
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Single global encryption key for cloud credentials | Simple env var config | Single point of failure; one leak decrypts all users | Never for a SaaS product |
| localStorage JWT storage | Easy Axios integration | XSS → full account takeover, TOTP bypass | Never; httpOnly cookies have identical DX once set up |
| Synchronous DB driver (psycopg2) with FastAPI | Familiar, simple | Event loop blocking under any load | Never for new greenfield code; acceptable only during an explicitly time-boxed spike |
| Presigned URL in list response | One fewer endpoint | Stale URLs, user-facing failures after expiry | Only for short-lived signed URLs (<5 min) with TTL tracking in frontend |
| Skip per-user key derivation | Fewer moving parts | Catastrophic blast radius on key leak | Never for cloud credential encryption |
| Inline quota check (read-check-write) without atomic update | Simple code | Silent quota bypass under concurrent load | Never for enforced limits |
| Soft-coding 8-char UUID prefix as document IDs (existing) | Shorter IDs | Collision risk at scale; insecure for auth tokens | Replace with full UUID before multi-user goes live |
---
## Integration Gotchas
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Google Drive OAuth | Storing `access_token` only; ignoring `refresh_token` | Store both; refresh_token is issued only on first authorization — store immediately or it's lost |
| OneDrive / MSAL | Treating MSAL token cache as permanent | MSAL token cache can be invalidated by Microsoft; always handle `invalid_grant` and re-auth prompt |
| Nextcloud | Using Basic Auth with plain credentials | Use App Passwords (Nextcloud-specific token), never store the user's Nextcloud master password |
| MinIO | Using `minio.get_object()` for large files without streaming | Stream the response: `response.stream(32768)` or use boto3 streaming; loading full file to memory crashes on large PDFs |
| MinIO | Treating bucket names as security boundaries | Bucket names are not credentials; isolation is enforced by object key prefix + IAM policy. Set MinIO IAM so the app's service account can only access the designated bucket |
| PostgreSQL | Using `autocommit=True` for all connections | Quota updates and document record creation must be transactional; autocommit makes rollback impossible |
| PyOTP (TOTP) | Using default `valid_window=0` then switching to `valid_window=2` for "user convenience" | Window=1 (±30s) tolerates clock skew. Window=2+ increases brute-force surface disproportionately |
---
## Performance Traps
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| O(N) metadata scan on every document list (existing pattern) | List endpoint slows linearly with document count | Paginated SQL query with index on `(user_id, created_at)` | ~500 documents per user |
| Topic count scan on every topic fetch (existing pattern) | `GET /topics` latency spikes as documents grow | Materialized counter column updated on insert/delete, or cached with 30s TTL | ~1,000 documents |
| Synchronous MinIO upload inline in HTTP handler | Requests time out on large files; one upload blocks all other requests | Background task queue; return 202 Accepted with job ID | First user uploading a file >10 MB |
| Eager loading all cloud connections on user login | Login latency grows with number of connected providers | Lazy-load cloud connection status; only check on storage page | >3 cloud providers per user |
| Full-text search via `ILIKE '%term%'` on document content | Full table scan on every search | PostgreSQL `tsvector` full-text index on extracted text column | ~10,000 documents platform-wide |
| JWT validation on every request without caching public key | Repeated public key fetch (if using asymmetric JWT) | Cache the signing key in memory; never refetch on every request | High request volume |
---
## Security Mistakes
| Mistake | Risk | Prevention |
|---------|------|------------|
| CORS `allow_origins=["*"]` retained from existing codebase | Cross-origin requests from attacker-controlled pages | Set exact origin (frontend URL) in production; never wildcard with credentials |
| File type validation defined but not enforced (existing) | Executable or malicious file uploads | Enforce MIME type check at upload boundary; also validate magic bytes (not just Content-Type header) |
| No file size limit at HTTP boundary (existing) | Memory exhaustion, DoS via large upload | Add `max_upload_bytes` limit in FastAPI middleware before reading body |
| API keys in plaintext JSON (existing) | Credential leakage via Docker volume mount | Env vars for all secrets; never serialize keys to disk in application data directory |
| Password reset without invalidating existing sessions | Attacker keeps session after victim resets password | Invalidate all sessions for user on successful password reset |
| Audit log includes document metadata (filenames) | Filename can reveal document content | Audit log stores only document ID, event type, timestamp, IP — no filenames |
| Admin can read `cloud_connections.encrypted_token` column | Ciphertext exposure; retroactive decryption if key leaks | Exclude credential columns from all admin serializers by policy |
| No breach check on registration password | Users reuse passwords from breached services | Integrate HIBP k-anonymity API on registration |
---
## UX Pitfalls
| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| Silent presigned URL expiry (image fails to load, download does nothing) | User thinks the app is broken; no recovery path | On-demand URL generation; show explicit "link expired, regenerating..." state |
| TOTP enrollment without backup codes | User loses phone → permanently locked out of account | Issue 810 single-use backup codes at TOTP enrollment; require user to acknowledge |
| Quota error shown only as "Upload failed" | User doesn't know why; may try repeatedly | Return structured quota error: `{"error": "quota_exceeded", "used": 98MB, "limit": 100MB}` |
| Cloud storage reconnect required, but no in-app prompt | User's uploads silently fail until they notice the settings page | Banner notification or upload-time prompt: "Your Google Drive connection needs re-authorization" |
| Folder delete with documents inside silently deletes documents | Data loss without confirmation | Require explicit confirmation listing document count; offer "move to root" alternative |
| Share revoke with no notification to recipient | Recipient sees documents disappear without explanation | Audit-trail entry visible to recipient: "Access to [folder] was revoked by [owner]" |
---
## "Looks Done But Isn't" Checklist
- [ ] **TOTP enrollment:** Backup codes issued, stored hashed, and acknowledged by user before TOTP is marked active
- [ ] **Password reset:** Does NOT create a full authenticated session — routes through login with TOTP check
- [ ] **Document delete:** Atomically decrements quota AND removes MinIO object AND removes DB row in a single transaction
- [ ] **Account delete:** Deletes from MinIO, PostgreSQL, AND calls `adapter.delete_user_files()` for each cloud connection
- [ ] **Cloud disconnect:** Revokes OAuth token at provider (not just deletes local record) — provider refresh tokens remain valid until explicitly revoked
- [ ] **Quota display:** Shows real-time usage, not cached value from registration — must query SUM from DB
- [ ] **Sharing:** Shared documents do NOT copy to recipient's quota — verify with a `SELECT SUM(file_size) WHERE owner_id = recipient_id` before and after share
- [ ] **Migration:** Document count in PostgreSQL equals document count in old JSON store before cutover flag is flipped
- [ ] **Presigned URLs:** Never in list responses — verified by API contract test asserting no `url` field in list endpoint response body
- [ ] **Admin isolation:** Admin token cannot retrieve document content — verified by dedicated negative-access test
---
## Recovery Strategies
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| JWT in localStorage discovered post-launch | HIGH | Rotate all tokens (force logout all users), redeploy with httpOnly cookies, communicate to users |
| Quota race condition caused 10% of users to exceed quota | MEDIUM | Run `UPDATE quotas SET used_bytes = (SELECT SUM(file_size) FROM documents WHERE owner_id = user_id)` reconciliation; enforce atomic updates going forward |
| Flat-file migration data loss (documents missing post-cutover) | HIGH | Restore from pre-migration backup of JSON data directory; re-run migration with dual-write; reconcile counts |
| Encryption key leaked (single-key model) | CRITICAL | Immediately rotate key; re-encrypt all credentials with new key; notify affected users; assume all cloud credentials compromised — prompt all users to revoke and reconnect |
| Admin escalation (admin accessed user documents) | HIGH | Audit log review to determine scope; GDPR breach notification if EU users affected (72h window); patch authorization layer; force session invalidation |
| Cloud connection revoked but app retried indefinitely | LOW | Mark all stuck connections as `REQUIRES_REAUTH`; send user notification; add `invalid_grant` handler to adapter |
---
## Pitfall-to-Phase Mapping
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| JWT in localStorage | Phase 1 — Auth | Confirm no `localStorage.setItem('token', ...)` in frontend; auth cookie is httpOnly in browser DevTools |
| TOTP bypass via password reset | Phase 1 — Auth | Integration test: reset password → assert no full session created → assert login still requires TOTP |
| Refresh token reuse detection | Phase 1 — Auth | Test: use a rotated (old) refresh token → assert 401 and full family revocation |
| TOTP timing attack / brute force | Phase 1 — Auth | Rate limit test: 6th TOTP attempt within window → assert 429 |
| Path traversal in MinIO keys | Phase 2 — DB + MinIO migration | Unit test: filename with `../` → assert stored key is UUID-based, not user-provided |
| Quota race condition | Phase 2 — DB + MinIO migration | Concurrent upload test: 2 threads uploading to full quota → assert only one succeeds |
| Admin privilege escalation | Phase 1 + every phase adding resource endpoints | Negative access test: admin JWT → `GET /documents/{other_user_doc_id}` → assert 403 |
| Cloud credential leakage via admin query | Phase 3 — Cloud storage | Test: admin list user cloud connections → assert response contains no `token` or `encrypted_` fields |
| Flat-file to PostgreSQL data loss | Phase 2 — DB + MinIO migration | Count reconciliation: `assert json_doc_count == db_doc_count` in migration script |
| Blocking I/O in async handlers | Phase 2 — DB + MinIO migration | Load test with 10 concurrent requests; assert no event loop warnings; check SQLAlchemy driver is asyncpg |
| N+1 queries on document list | Phase 2 — DB + MinIO migration | Enable SQLAlchemy echo; assert document list query count = 1 regardless of result size |
| Presigned URL expiry | Phase 2 — DB + MinIO migration | API contract test: list endpoint response contains no presigned URL fields |
| OAuth revocation not handled | Phase 3 — Cloud storage | Mock `invalid_grant` from OAuth provider → assert connection status set to `REQUIRES_REAUTH`, not retried |
| Rate limits without backoff | Phase 3 — Cloud storage | Mock 429 from provider → assert exponential backoff, not immediate 500 to user |
| GDPR right to erasure incomplete | Phase 3 — Cloud storage + account mgmt | Account deletion test: assert cloud adapter `delete_user_files` called for each active connection |
| Single encryption key | Phase 3 — Cloud storage | Code review gate: assert per-user HKDF derivation; no single-key decryption in any code path |
| Quota drift from sharing/revoke | Sharing phase | Reconciliation query: `SUM(file_size WHERE owner_id = user)` vs `used_bytes` — assert < 1% drift |
---
## Sources
- Project context: `/Users/nik/Documents/Progamming/document_scanner/.planning/PROJECT.md`
- Codebase audit: `/Users/nik/Documents/Progamming/document_scanner/.planning/codebase/CONCERNS.md`
- Auth pitfalls: OWASP JWT Security Cheat Sheet, OWASP Authentication Cheat Sheet (well-established, HIGH confidence)
- OAuth token lifecycle: RFC 6749 Section 10.4 (refresh token revocation), Google OAuth error codes documentation (HIGH confidence from training data)
- MinIO/S3 path traversal: AWS S3 object key documentation; known class of vulnerability in multi-tenant S3-prefix isolation (HIGH confidence)
- Quota race conditions: Classic check-then-act concurrency pattern; PostgreSQL SELECT FOR UPDATE documentation (HIGH confidence)
- GDPR Article 17 (right to erasure) and Article 30 (records of processing): well-established regulatory requirements (HIGH confidence)
- N+1 query pattern: SQLAlchemy documentation on relationship loading strategies (HIGH confidence)
- TOTP RFC 6238; PyOTP library behavior from training data (MEDIUM — verify PyOTP `valid_window` defaults against current docs before implementation)
---
*Pitfalls research for: DocuVault — multi-user SaaS document management (FastAPI + Vue 3 + PostgreSQL + MinIO)*
*Researched: 2026-05-21*