# Phase 3: Document Migration & Multi-User Isolation - Context **Gathered:** 2026-05-23 **Status:** Ready for planning ## Phase Boundary Enforce per-user ownership on all documents: make `documents.user_id` NOT NULL (Phase 1 D-03 deferred to here), add `get_current_user` guards to all `/api/documents/*` endpoints (Phase 2 D-07 deferred to here), implement presigned PUT URL upload flow, enforce atomic quota on upload and delete, wire per-user AI classification config from DB, and retire the flat-file settings system. Existing document UI continues to work — updated to use the new two-step upload flow. This phase does NOT include folder navigation, sharing, or PDF preview (Phase 4). It does NOT include cloud storage backends (Phase 5). The quota bar frontend component is included (STORE-04 is scoped here per REQUIREMENTS.md traceability). STORE-08 (Celery+Redis) was completed in Phase 1 — no work needed. ## Implementation Decisions ### Null-User Record Cleanup - **D-01:** All documents with `user_id=NULL` are deleted (both DB rows and their MinIO objects) before the NOT NULL constraint is added. These are dev/test data only — consistent with Phase 1 D-04 which deleted flat-file test data with the same reasoning. Zero production data loss. - **D-02:** Cleanup is baked into the Alembic migration's `upgrade()` function — the migration first deletes all null-user Document rows (and calls the storage backend to delete corresponding MinIO objects), then adds the `NOT NULL` constraint to `documents.user_id`. One command, atomic flow. - **D-03:** After null-user cleanup, reconcile quota `used_bytes` from actual document data: `UPDATE quotas SET used_bytes = (SELECT COALESCE(SUM(size_bytes), 0) FROM documents WHERE documents.user_id = quotas.user_id)`. Phase 3 starts with accurate quota state for all users. ### Presigned Upload Flow - **D-04:** Phase 3 implements direct-to-MinIO presigned PUT uploads per CLAUDE.md architectural rule ("bytes never pass through the API layer"). The existing multipart POST-to-FastAPI upload endpoint is replaced. - **D-05:** Two-step upload flow: - Step 1 — `POST /api/documents/upload-url`: FastAPI creates a `Document` row (`status='pending'`), generates a presigned PUT URL (15-min TTL), returns `{upload_url, document_id}`. Quota is NOT reserved at this step. - Step 2 — Frontend PUTs bytes directly to MinIO using the presigned URL. - Step 3 — `POST /api/documents/{id}/confirm`: FastAPI retrieves file size from MinIO stat (authoritative), runs atomic quota UPDATE, updates Document row (`status='uploaded'`), and enqueues `extract_and_classify.delay(document_id)`. - **D-06:** Abandoned uploads (presigned URL fetched but `/confirm` never called): Celery beat periodic task deletes `Document` rows older than 1 hour with `status='pending'` and their MinIO objects. Quota is never reserved for pending rows — no cleanup of quota needed. - **D-07:** Quota is enforced atomically at the `/confirm` step using the file size retrieved from MinIO stat (not client-supplied). The atomic SQL pattern (from CLAUDE.md) applies: `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes`. A 413 response is returned if the UPDATE returns no rows (quota exceeded). Document delete atomically decrements: `UPDATE quotas SET used_bytes = GREATEST(0, used_bytes - $delta)`. ### Topics Isolation Model - **D-08:** Layered topic namespace: system topics (`user_id=NULL`) are visible to all users as defaults; per-user topics (`user_id=current_user.id`) are visible only to that user. A user's topic list is the union of system topics + their own topics. - **D-09:** Only admin can create, edit, and delete system topics via a new `POST /api/admin/topics` endpoint. Regular users can only CRUD their own per-user topics via `/api/topics/*` (now auth-gated with `get_current_user`). - **D-10:** All existing topics in the DB (currently `user_id=NULL` from Phase 1/2 test sessions) are deleted in Phase 3 migration — consistent with null-user document cleanup. Admin seeds system topics fresh post-Phase 3. - **D-11:** AI classification receives system topics + user's own topics as the existing-topics input. New AI-suggested topics are created in the user's namespace (`user_id=current_user.id`), not as system topics. ### Settings Flat-File Retirement - **D-12:** `/api/settings` endpoint is removed entirely in Phase 3. `services/storage.py` `load_settings()` / `save_settings()` flat-file functions are deleted. `settings.json` is deleted. All AI config comes from DB (`users.ai_provider` / `users.ai_model` set by admin). - **D-13:** System prompt moves to a `SYSTEM_PROMPT` env var in `config.py` (optional). If not set, `services/classifier.py` uses a hardcoded default prompt string. No DB table needed. - **D-14:** Celery `extract_and_classify` task resolves AI config via `doc.user_id → users.ai_provider + users.ai_model` (a second DB lookup within the same task session). No `user_id` parameter added to the task signature. - **D-15:** If `user.ai_provider` is `None` (user has no admin-assigned AI config), classifier falls back to `DEFAULT_AI_PROVIDER` + `DEFAULT_AI_MODEL` env vars (both optional in `config.py`; code default: `"ollama"` / `"llama3.2"`). ### Auth Guards - **D-16:** All `/api/documents/*` endpoints gain `get_current_user` dependency (Phase 2 D-07 fulfilled). Every handler asserts `document.user_id == current_user.id` before returning — 404 (not 403) for cross-user access to avoid information leakage. Admin role returns 403 on all document endpoints per Phase 3 SC4 (completing Phase 2 SC5 via D-07). - **D-17:** `/api/topics/*` gains `get_current_user`. Topic queries filter by `user_id IN (current_user.id, NULL)` — user sees their own topics + system topics. ## Canonical References **Downstream agents MUST read these before planning or implementing.** ### Requirements - `.planning/REQUIREMENTS.md` — STORE-03 (atomic quota enforce), STORE-04 (quota bar UI), STORE-05 (upload rejection error), STORE-06 (atomic quota decrement on delete), STORE-08 (Celery+Redis — done in Phase 1), SEC-04 (DB-lookup file access), DOC-03 (per-user AI provider), DOC-04 (system topics + per-user overrides), DOC-05 (classification uses user's assigned provider) ### Roadmap & Success Criteria - `.planning/ROADMAP.md` — Phase 3 goal and all 5 success criteria (especially SC2: concurrent quota race, SC4: 403 on cross-user access + admin 403, SC5: per-user AI classification) ### Architecture Constraints - `CLAUDE.md` — Key Architectural Rules: presigned MinIO URL flow (bytes never through API), MinIO key schema, atomic quota UPDATE pattern, SEC-04 enforcement, admin endpoints never return document content ### Prior Phase Decisions - `.planning/phases/01-infrastructure-foundation/01-CONTEXT.md` — D-03 (documents.user_id nullable in Phase 1), D-05 (storage service replaced), D-06 (MinIO key schema), D-08/D-09 (Celery+Redis wired) - `.planning/phases/02-users-authentication/02-CONTEXT.md` — D-07 (documents endpoints stay public in Phase 2, gain guards in Phase 3), D-08/D-09 (admin endpoints, CORS) ### Project Decisions - `.planning/PROJECT.md` — Core Value: per-user isolation; Key Decisions: PostgreSQL+MinIO rationale, atomic quota UPDATE, privacy-first admin model ## Existing Code Insights ### Reusable Assets - `backend/deps/auth.py` — `get_current_user` and `get_current_admin` FastAPI dependencies ready to inject into document/topic endpoints - `backend/db/models.py` — `Document`, `Quota`, `Topic`, `DocumentTopic` ORM models complete; `documents.user_id` is nullable (change to NOT NULL in Phase 3 migration); `quotas.used_bytes` and `limit_bytes` are in place - `backend/storage/minio_backend.py` — `MinIOBackend.put_object()` and `delete_object()` — extend with `generate_presigned_put_url()` for Phase 3 upload flow; add `stat_object()` to retrieve file size after upload - `backend/storage/base.py` — `StorageBackend` ABC — add `generate_presigned_put_url(...)` abstract method - `backend/tasks/document_tasks.py` — `extract_and_classify` task; update `_run()` to look up `doc.user_id → user.ai_provider/ai_model` and pass user config to classifier - `backend/services/classifier.py` — update to accept `ai_provider` and `ai_model` parameters instead of reading from `load_settings()` - `backend/celery_app.py` — Celery beat schedule: add periodic task for abandoned upload cleanup ### Established Patterns - **Atomic quota UPDATE** — `UPDATE quotas SET used_bytes = used_bytes + $delta WHERE (used_bytes + $delta) <= limit_bytes RETURNING used_bytes` — use `session.execute(text(...))` with bound params; check `result.rowcount` to detect quota exceeded - **Service layer boundary** — `services/classifier.py` is pure Python, no FastAPI coupling; call with explicit parameters rather than reading global config - **`get_current_user` injection** — Phase 2 pattern: `current_user: User = Depends(get_current_user)` in each handler; `current_user: User = Depends(get_current_admin)` for admin-only routes - **`asyncio.to_thread()`** for MinIO sync SDK calls (established in Phase 1 `storage/minio_backend.py`) ### Integration Points - `backend/api/documents.py` — replace existing upload handler with upload-url + confirm endpoints; add `get_current_user` to all handlers; add `document.user_id == current_user.id` ownership assertion - `backend/api/topics.py` — add `get_current_user`; filter all topic queries by `user_id IN (current_user.id, NULL)` - `backend/services/storage.py` — remove `load_settings()` / `save_settings()`; update `save_upload()` to accept `user_id` parameter; update `delete_document()` to decrement quota - `backend/config.py` — add `SYSTEM_PROMPT`, `DEFAULT_AI_PROVIDER`, `DEFAULT_AI_MODEL` optional env vars - `frontend/src/stores/documents.js` (or equivalent) — update upload flow from single multipart POST to two-step: get upload URL, PUT to MinIO, call confirm - `frontend/src/components/layout/AppSidebar.vue` — add quota bar (current/limit in MB, amber at 80%, red at 95%) — STORE-04 ### Constraints from Prior Phases - MinIO key schema `{user_id}/{document_id}/{uuid4()}{ext}` is locked (Phase 1 D-06) — enforced in `MinIOBackend.put_object()` - `documents.user_id` is currently nullable — Phase 3 Alembic migration makes it NOT NULL after cleanup - Celery+Redis already wired and operational — no infrastructure changes needed - `BackupCode` model and `backup_codes` table exist from Phase 2 — no changes needed ## Specific Ideas - Phase 3 Alembic migration is `0003_multi_user_isolation.py` — cleanup + NOT NULL + topic cleanup + quota reconciliation in one migration - Presigned PUT URL TTL: 15 minutes (matches typical upload timeout for large documents) - Abandoned upload cleanup: Celery beat task running every 30 minutes, deletes `pending` Document rows older than 1 hour - `stat_object()` for MinIO: use MinIO SDK `stat_object(bucket, key)` → `.size` attribute to get authoritative file size at confirm time - Quota exceeded response: HTTP 413 with body `{"detail": {"used_bytes": N, "limit_bytes": M, "rejected_bytes": K}}` - Per-user topic query: `WHERE (topics.user_id = :uid OR topics.user_id IS NULL)` with an index on `topics.user_id` - Frontend quota bar: fetch from new `GET /api/me/quota` endpoint returning `{used_bytes, limit_bytes}` — add this endpoint to the auth API ## Deferred Ideas - Presigned GET URLs for document downloads — Phase 4 (DOC-02: PDF preview proxied through app). Phase 3 does not expose presigned GET URLs to the browser. - Per-user system prompt overrides — out of scope for v1; system prompt is global via env var - Quota reservation at upload-url initiation with client-supplied size — decided against in favor of confirm-time enforcement - MinIO event notification webhook approach — deferred; two-step confirm is sufficient for Phase 3 --- *Phase: 3-Document Migration & Multi-User Isolation* *Context gathered: 2026-05-23*