747303246a
Folders, Sharing, Quotas & Document UX — plans verified (0 blockers, 2 non-blocking warnings). Covers FOLD-01..05, SHARE-01..05, SEC-08/09, ADMIN-06, DOC-01/02. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1089 lines
59 KiB
Markdown
1089 lines
59 KiB
Markdown
# Phase 4: Folders, Sharing, Quotas & Document UX — Research
|
||
|
||
**Researched:** 2026-05-25
|
||
**Domain:** FastAPI folder/share CRUD, PostgreSQL tsvector, MinIO streaming proxy, Celery beat, Vue 3 folder navigation
|
||
**Confidence:** HIGH
|
||
|
||
---
|
||
|
||
<user_constraints>
|
||
## User Constraints (from CONTEXT.md)
|
||
|
||
### Locked Decisions
|
||
|
||
**D-01** Hybrid layout — AppSidebar shows top-level folders only. Sub-folders + breadcrumb in main content area. Top-level folders clickable directly in sidebar.
|
||
|
||
**D-02** Unlimited nesting depth (no API or UI cap). `Folder.parent_id` self-referential FK is authoritative. Breadcrumbs truncate at depth > 4 (show first + "..." + last 2).
|
||
|
||
**D-03** Non-empty folder delete: warning modal with document count. Confirm → cascade-delete all documents (MinIO + DB + quota). Cancel → no action. Documents are NOT moved to root — they are destroyed.
|
||
|
||
**D-04** Exact handle input — no autocomplete. API returns 404 if handle not found; UI shows "User not found" error.
|
||
|
||
**D-05** Share button on DocumentCard (inline icon button). Modal: (a) handle input, (b) Share button, (c) current recipients list with Revoke per row.
|
||
|
||
**D-06** "Shared with me" is a fixed virtual folder entry in AppSidebar, rendered above the user's own folder list. Filtered by `shares.recipient_id = current_user.id`. Zero quota charged to recipient.
|
||
|
||
**D-07** Share permission is `view` only for Phase 4. `edit` deferred.
|
||
|
||
**D-08** Streaming proxy endpoint: `GET /api/documents/{id}/content`. Returns bytes via FastAPI `StreamingResponse`. Supports `Range` headers. `Content-Disposition: inline`. No presigned URL ever generated or exposed. Uses `get_regular_user` dep.
|
||
|
||
**D-09** Native browser PDF rendering — no PDF.js. Zero frontend dependencies added. Content-Type header drives browser rendering.
|
||
|
||
**D-10** `users.pdf_open_mode` column (String, default `'in_app'`). Exposed via `PATCH /api/me/preferences`. `in_app` = modal with `<iframe>`; `new_tab` = `window.open`.
|
||
|
||
**D-11** PostgreSQL tsvector GIN index on `documents.extracted_text`. `plainto_tsquery('english', :q)`. Search scope: user's own docs only. `GET /api/documents?q=<query>`.
|
||
|
||
**D-12** Inline search bar in document list view. Debounced, minimum 2 chars. Folder-context aware.
|
||
|
||
**D-13** All 4 audit event categories logged: auth events, document operations, folder/share operations, admin actions.
|
||
|
||
**D-14** Audit log written inline in handlers (no middleware, no Celery task). Shared helper `write_audit_log(session, event_type, user_id, actor_id, resource_id, ip_address, metadata_dict)`.
|
||
|
||
**D-15** Admin audit log viewer (`/admin` → AuditLog tab): paginated table, filters (date range, user, action type). No document content, filenames, or extracted text in any log entry.
|
||
|
||
**D-16** Manual CSV and JSON export button in admin audit log view. Exports current filtered result set.
|
||
|
||
**D-17** Daily automated CSV backup via Celery beat. Exports previous day's audit rows to MinIO `audit-logs` bucket (`audit-logs/YYYY-MM-DD.csv`). Bucket is private. Admins can download from admin panel. Rows NOT deleted from DB.
|
||
|
||
**D-18** `credentials_enc` excluded from all serializers (SEC-08). Explicit Pydantic response model enforcement on all user-facing endpoints that touch `CloudConnection`.
|
||
|
||
**D-19** Account deletion (admin-triggered) runs `delete_user_files()` before removing DB records (SEC-09). Cloud connection cleanup deferred to Phase 5.
|
||
|
||
### Claude's Discretion
|
||
|
||
None specified — all significant decisions are locked.
|
||
|
||
### Deferred Ideas (OUT OF SCOPE)
|
||
|
||
- Per-day document scan quota tiers (future billing milestone)
|
||
- Share permission levels beyond `view` (`edit`, `comment` deferred)
|
||
- Presigned GET URLs for non-PDF document download
|
||
- Audit log row archival/deletion policy (DB rows are not auto-deleted)
|
||
</user_constraints>
|
||
|
||
---
|
||
|
||
<phase_requirements>
|
||
## Phase Requirements
|
||
|
||
| ID | Description | Research Support |
|
||
|----|-------------|------------------|
|
||
| FOLD-01 | User can create, rename, and delete folders (delete confirms content count before proceeding) | `Folder` model complete in `db/models.py`; cascade-delete pattern via SQLAlchemy + atomic quota decrement established in Phase 3 |
|
||
| FOLD-02 | User can move documents between folders | `Document.folder_id` FK exists and is nullable; PATCH endpoint updates `folder_id` with ownership assertion on both doc and target folder |
|
||
| FOLD-03 | Breadcrumb navigation renders current folder path; each segment clickable | Recursive path query via `WITH RECURSIVE` CTE or iterative parent-chain walk; Vue computed property for breadcrumb array |
|
||
| FOLD-04 | Document list supports sort by name, date uploaded, and file size | SQLAlchemy `order_by()` with enum sort param; frontend `sort` query param |
|
||
| FOLD-05 | Full-text search via PostgreSQL tsvector index on extracted text | Expression-index approach `to_tsvector('english', extracted_text)` + GIN; `plainto_tsquery` in `WHERE` clause; Alembic migration adds index manually (not as Computed column to avoid autogenerate noise) |
|
||
| SHARE-01 | User can share a document with another user by their unique handle | `Share` model complete; `User.handle` unique lookup; exact match, no autocomplete |
|
||
| SHARE-02 | Shared documents appear in "Shared with me" virtual folder; zero quota charged to recipient | Filter `shares.recipient_id = current_user.id`; no quota row touched; virtual folder is a UI/API filter, not a DB folder |
|
||
| SHARE-03 | Shared access view-only by default; owner controls permission level | `Share.permission` defaults to `'view'`; Phase 4 only implements `view`; permission field returned in list responses |
|
||
| SHARE-04 | Owner can revoke share; revocation is immediate | `DELETE /api/shares/{share_id}` with owner assertion; no async cleanup needed |
|
||
| SHARE-05 | Documents shared with others display a "shared" indicator | Document list response includes `is_shared: bool` derived from EXISTS subquery on `shares.owner_id` |
|
||
| SEC-08 | `credentials_enc` excluded from all API serializers | Explicit `CloudConnectionOut` Pydantic response model excluding `credentials_enc` |
|
||
| SEC-09 | Account deletion triggers `delete_user_files()` per cloud connection before DB removal | Implemented in admin delete-user endpoint; iterates documents, deletes MinIO objects, decrements quota |
|
||
| ADMIN-06 | Admin can view audit log filtered by date range, user, and action type (metadata only) | `AuditLog` model complete; paginated query with filters; admin endpoint uses `get_current_admin` |
|
||
| DOC-01 | User can view document metadata and extracted text | Existing `GET /api/documents/{id}` already returns extracted_text; Phase 4 ensures it is included in response |
|
||
| DOC-02 | In-browser PDF preview via PDF.js proxy; no presigned URLs exposed | `GET /api/documents/{id}/content` — StreamingResponse with Range header support, bytes from MinIO via `asyncio.to_thread` |
|
||
</phase_requirements>
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
Phase 4 builds entirely on the existing ORM schema (all required models — `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `Share.recipient_id` — are already in `db/models.py`). The work is predominantly plumbing: creating new API router modules for folders, shares, and audit log; adding a single Alembic migration for the `users.pdf_open_mode` column and the tsvector GIN expression index; extending the Celery beat schedule with a daily audit export task; and extending the Vue 3 frontend with folder navigation, sharing modals, and the settings preference toggle.
|
||
|
||
The highest-risk areas are: (1) the PDF streaming proxy with Range header support (needs careful byte-range parsing to handle PDF viewer seeking), (2) the tsvector GIN index and corresponding Alembic migration (Alembic autogenerate has known quirks with `to_tsvector()` expression indexes — the migration must be written manually), and (3) the folder cascade-delete with atomic quota decrement for multiple documents (each document's size_bytes must be summed and decremented in a single atomic UPDATE or via a pre-computed sum).
|
||
|
||
No new Python or npm packages are required for Phase 4. The decision to use native browser PDF rendering (D-09) eliminates the PDF.js dependency. All needed libraries — FastAPI, SQLAlchemy, MinIO SDK, Celery, Pydantic — are already pinned in `requirements.txt`. The frontend needs only Tailwind utility classes; no new npm packages.
|
||
|
||
**Primary recommendation:** Build in 6 plans: Wave 0 test scaffolds + migration; folders backend; shares backend + audit service; streaming proxy + search; audit log admin API + Celery export task; frontend folder navigation + sharing + settings.
|
||
|
||
---
|
||
|
||
## Architectural Responsibility Map
|
||
|
||
| Capability | Primary Tier | Secondary Tier | Rationale |
|
||
|------------|-------------|----------------|-----------|
|
||
| Folder CRUD + breadcrumb path | API / Backend | Database / Storage | Ownership assertion and cascade-delete logic belongs in the API layer; path reconstruction is a DB query |
|
||
| Document move between folders | API / Backend | Database / Storage | Ownership assertions on both document and target folder must be co-located in the API handler |
|
||
| Full-text search (tsvector) | Database / Storage | API / Backend | GIN index and `plainto_tsquery` execute in PostgreSQL; API adds scope filter (`user_id`) |
|
||
| Share grant / revoke | API / Backend | Database / Storage | IDOR prevention requires ownership assertion before DB write; share table is the single source of truth |
|
||
| "Shared with me" virtual folder | API / Backend | — | Pure query filter on `shares.recipient_id`; no DB folder row; quota enforcement is in the API layer |
|
||
| PDF streaming proxy | API / Backend | Database / Storage | Bytes flow from MinIO through FastAPI to the browser; presigned URLs never reach the browser |
|
||
| Per-user PDF open preference | API / Backend | Database / Storage | User preference column (`users.pdf_open_mode`) read/written via `/api/me/preferences` |
|
||
| Audit log write | API / Backend | — | `write_audit_log()` called inline in each handler after successful operation; no middleware |
|
||
| Audit log viewer + export | API / Backend | Database / Storage | Admin-only paginated query with filters; CSV export is a StreamingResponse from a filtered DB query |
|
||
| Daily audit backup (Celery beat) | API / Backend (worker) | Database / Storage | Celery beat task runs in the worker; queries DB and uploads CSV to MinIO `audit-logs` bucket |
|
||
| Folder navigation UI state | Frontend Server (SSR) → Browser / Client | — | Folder ID and breadcrumb path held in Vue `ref()` in `HomeView.vue`; no server-side state |
|
||
| Share modal + indicator | Browser / Client | — | Share button on DocumentCard; modal state is local component `ref()`; share list in documents store |
|
||
| PDF open preference toggle | Browser / Client | API / Backend | Toggle in SettingsView.vue reads/writes preference via `PATCH /api/me/preferences` |
|
||
| `credentials_enc` exclusion (SEC-08) | API / Backend | — | Pydantic response model `CloudConnectionOut` excludes the field; never reaches serializer |
|
||
|
||
---
|
||
|
||
## Standard Stack
|
||
|
||
### Core (all already in requirements.txt / package.json)
|
||
|
||
| Library | Version | Purpose | Why Standard |
|
||
|---------|---------|---------|--------------|
|
||
| FastAPI | 0.136+ | API router for folders, shares, audit, proxy, preferences | Already in use; `APIRouter` + `StreamingResponse` needed for proxy |
|
||
| SQLAlchemy 2.0 async | 2.0+ | ORM queries for all new endpoints | Established pattern in project; async session already wired |
|
||
| Alembic | 1.18+ | Migration 0004 — `users.pdf_open_mode` column + tsvector GIN index | Established migration toolchain |
|
||
| MinIO Python SDK | 7.2+ | `get_object()` for streaming proxy; `put_object()` for audit CSV upload | Already in use; `asyncio.to_thread()` pattern established |
|
||
| Celery + Redis | 5.5+ | Daily audit export beat task | Already in use; `beat_schedule` in `celery_app.py` |
|
||
| Pydantic v2 | 2.0+ | `CloudConnectionOut` response model excluding `credentials_enc` (SEC-08) | Already in use for all request/response models |
|
||
| Vue 3 (Options API) | 3.4+ | Folder nav, share modal, settings toggle | Already in use; `ref()` + `watch()` patterns established |
|
||
|
||
[VERIFIED: codebase — requirements.txt and package.json]
|
||
|
||
### No New Packages Required
|
||
|
||
Phase 4 introduces zero new dependencies. Key rationale:
|
||
- PDF viewer: D-09 chose native browser rendering — no PDF.js
|
||
- Debounce: hand-rolled `setTimeout`/`clearTimeout` in Vue `watch` callback (< 10 lines, no lodash needed)
|
||
- CSV export: Python stdlib `csv.DictWriter` with `io.StringIO` — no pandas
|
||
- Search: PostgreSQL built-in `plainto_tsquery` — no Elasticsearch or external search engine
|
||
|
||
[VERIFIED: codebase — no new deps required]
|
||
|
||
---
|
||
|
||
## Package Legitimacy Audit
|
||
|
||
No new packages are installed in Phase 4. All runtime dependencies are carried forward from Phases 1–3 with pinned versions in `requirements.txt`.
|
||
|
||
**Packages removed due to slopcheck [SLOP] verdict:** none
|
||
**Packages flagged as suspicious [SUS]:** none
|
||
**New packages to install:** none
|
||
|
||
---
|
||
|
||
## Architecture Patterns
|
||
|
||
### System Architecture Diagram
|
||
|
||
```
|
||
Browser
|
||
│
|
||
├── GET /api/documents?q=search&folder_id=X&sort=name
|
||
│ FastAPI → SQLAlchemy → PostgreSQL (tsvector @@ plainto_tsquery, scoped to user)
|
||
│ ← JSON: [{id, filename, size_bytes, is_shared, folder_id, ...}]
|
||
│
|
||
├── POST /api/folders / PATCH /api/folders/{id} / DELETE /api/folders/{id}
|
||
│ FastAPI → ownership assertion → SQLAlchemy → PostgreSQL
|
||
│ DELETE: sum(doc.size_bytes) → atomic quota decrement → MinIO delete_object per doc
|
||
│
|
||
├── POST /api/shares / DELETE /api/shares/{id}
|
||
│ FastAPI → ownership assertion (owner_id = current_user.id) → SQLAlchemy → PostgreSQL
|
||
│ GET /api/shares/received → filter shares.recipient_id = current_user.id
|
||
│
|
||
├── GET /api/documents/{id}/content [Range: bytes=X-Y]
|
||
│ FastAPI (get_regular_user) → ownership OR share-access check → SQLAlchemy (object_key)
|
||
│ → MinIO get_object() via asyncio.to_thread() → StreamingResponse (206 or 200)
|
||
│ ← bytes with Content-Type, Content-Disposition: inline, Accept-Ranges: bytes
|
||
│ NOTE: no presigned URL generated; bytes flow through FastAPI
|
||
│
|
||
├── PATCH /api/me/preferences
|
||
│ FastAPI → SQLAlchemy UPDATE users.pdf_open_mode → ← {pdf_open_mode}
|
||
│
|
||
└── GET /api/admin/audit-log [?start=&end=&user_id=&event_type=&page=]
|
||
FastAPI (get_current_admin) → SQLAlchemy → PostgreSQL
|
||
← JSON: {items: [{id, event_type, user_id, actor_id, ip_address, metadata_, created_at}], total}
|
||
|
||
GET /api/admin/audit-log/export?format=csv|json
|
||
FastAPI (get_current_admin) → query → StreamingResponse (text/csv or application/json)
|
||
|
||
Celery Beat Worker (daily at midnight UTC)
|
||
│
|
||
└── audit_log_daily_export task
|
||
→ AsyncSessionLocal → SELECT audit_log WHERE date = yesterday
|
||
→ csv.DictWriter → io.BytesIO
|
||
→ MinIO put_object("audit-logs", "YYYY-MM-DD.csv", bytes)
|
||
```
|
||
|
||
### Recommended Project Structure (new files only)
|
||
|
||
```
|
||
backend/
|
||
├── api/
|
||
│ ├── folders.py # FOLD-01, FOLD-02, FOLD-03, FOLD-04, FOLD-05
|
||
│ ├── shares.py # SHARE-01..05
|
||
│ └── audit.py # ADMIN-06 (admin viewer + export)
|
||
├── services/
|
||
│ └── audit.py # write_audit_log() helper
|
||
├── tasks/
|
||
│ └── audit_tasks.py # audit_log_daily_export Celery beat task
|
||
└── migrations/versions/
|
||
└── 0004_phase4_pdf_open_mode_tsvector.py
|
||
|
||
frontend/src/
|
||
├── components/
|
||
│ ├── documents/
|
||
│ │ └── ShareModal.vue # D-05
|
||
│ ├── layout/
|
||
│ │ └── BreadcrumbNav.vue # FOLD-03
|
||
│ └── admin/
|
||
│ └── AdminAuditLogTab.vue # ADMIN-06 (D-15, D-16)
|
||
├── stores/
|
||
│ └── folders.js # folder state + actions
|
||
└── views/
|
||
└── SettingsView.vue # extend with PDF preference toggle (D-10)
|
||
```
|
||
|
||
### Pattern 1: Ownership Assertion (established in Phase 3 — D-16)
|
||
|
||
**What:** Every resource endpoint asserts `resource.user_id == current_user.id`. Cross-user access returns 404 (not 403) to prevent attacker enumeration of resource IDs.
|
||
|
||
**When to use:** All folder CRUD, document move, share grant/revoke (owner side), streaming proxy.
|
||
|
||
```python
|
||
# Source: backend/api/documents.py (established pattern)
|
||
doc = await session.get(Document, uid)
|
||
if doc is None or doc.user_id != current_user.id:
|
||
raise HTTPException(status_code=404, detail="Document not found")
|
||
```
|
||
|
||
**Folder variant:**
|
||
```python
|
||
folder = await session.get(Folder, folder_id)
|
||
if folder is None or folder.user_id != current_user.id:
|
||
raise HTTPException(status_code=404, detail="Folder not found")
|
||
```
|
||
|
||
[VERIFIED: codebase — backend/api/documents.py]
|
||
|
||
### Pattern 2: Atomic Quota Decrement for Folder Cascade-Delete
|
||
|
||
**What:** When deleting a folder, collect all document `size_bytes` first, then issue a single atomic quota UPDATE. Never read-then-write in Python.
|
||
|
||
**When to use:** `DELETE /api/folders/{id}` when folder is non-empty.
|
||
|
||
```python
|
||
# Source: established atomic pattern (CLAUDE.md + Phase 3 D-07)
|
||
# Step 1: collect document IDs and total size within the folder subtree
|
||
result = await session.execute(
|
||
select(Document.id, Document.size_bytes, Document.object_key)
|
||
.where(Document.folder_id == folder_id, Document.user_id == current_user.id)
|
||
)
|
||
docs = result.all()
|
||
total_bytes = sum(row.size_bytes for row in docs)
|
||
|
||
# Step 2: atomic quota decrement (CASE WHEN pattern — SQLite compatible)
|
||
await session.execute(
|
||
text(
|
||
"UPDATE quotas SET used_bytes = "
|
||
"CASE WHEN used_bytes > :delta THEN used_bytes - :delta ELSE 0 END "
|
||
"WHERE user_id = :uid"
|
||
),
|
||
{"delta": total_bytes, "uid": str(current_user.id)},
|
||
)
|
||
|
||
# Step 3: delete MinIO objects (best-effort)
|
||
for row in docs:
|
||
try:
|
||
await get_storage_backend().delete_object(row.object_key)
|
||
except Exception:
|
||
pass # MinIO cleanup is best-effort
|
||
|
||
# Step 4: delete documents (cascade via FK) then folder
|
||
await session.execute(
|
||
delete(Document).where(Document.folder_id == folder_id)
|
||
)
|
||
await session.delete(folder)
|
||
await session.commit()
|
||
```
|
||
|
||
[ASSUMED: the CASE WHEN pattern for SQLite compat is established in STATE.md; verified in existing delete_document service function]
|
||
|
||
### Pattern 3: PDF Streaming Proxy with Range Headers
|
||
|
||
**What:** `GET /api/documents/{id}/content` fetches bytes from MinIO and streams them with Range header support. Status 206 for range requests; 200 for full fetches.
|
||
|
||
**When to use:** DOC-02. Required for PDF viewer to seek pages without re-downloading the entire file.
|
||
|
||
```python
|
||
# Source: FastAPI discussion #7718 (CITED: github.com/fastapi/fastapi/discussions/7718)
|
||
from fastapi import Request
|
||
from fastapi.responses import StreamingResponse
|
||
|
||
def _parse_range(range_header: str, file_size: int) -> tuple[int, int]:
|
||
"""Parse 'bytes=start-end' Range header. Raises 416 on invalid range."""
|
||
try:
|
||
h = range_header.replace("bytes=", "").split("-")
|
||
start = int(h[0]) if h[0] != "" else 0
|
||
end = int(h[1]) if h[1] != "" else file_size - 1
|
||
except (ValueError, IndexError):
|
||
raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
|
||
if start > end or start < 0 or end >= file_size:
|
||
raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
|
||
return start, end
|
||
|
||
@router.get("/{doc_id}/content")
|
||
async def stream_document_content(
|
||
doc_id: str,
|
||
request: Request,
|
||
session: AsyncSession = Depends(get_db),
|
||
current_user: User = Depends(get_regular_user),
|
||
):
|
||
# Ownership assertion (DOC-02 also allows shared-with-me access — see Pattern 4)
|
||
doc = await session.get(Document, uid)
|
||
if doc is None or doc.user_id != current_user.id:
|
||
# Also allow if document is shared with current_user
|
||
share = await _get_share_for_user(session, uid, current_user.id)
|
||
if share is None:
|
||
raise HTTPException(404, "Document not found")
|
||
# Use the document from the share's owner context
|
||
# (doc is already loaded; ownership check skipped for shared access)
|
||
|
||
# Fetch all bytes from MinIO (asyncio.to_thread wraps sync SDK)
|
||
file_bytes = await get_storage_backend().get_object(doc.object_key)
|
||
file_size = len(file_bytes)
|
||
|
||
range_header = request.headers.get("range")
|
||
headers = {
|
||
"content-type": doc.content_type,
|
||
"content-disposition": f"inline; filename=\"{doc.filename}\"",
|
||
"accept-ranges": "bytes",
|
||
"content-length": str(file_size),
|
||
}
|
||
|
||
if range_header:
|
||
start, end = _parse_range(range_header, file_size)
|
||
chunk = file_bytes[start:end + 1]
|
||
headers["content-range"] = f"bytes {start}-{end}/{file_size}"
|
||
headers["content-length"] = str(len(chunk))
|
||
return StreamingResponse(
|
||
iter([chunk]),
|
||
status_code=206,
|
||
headers=headers,
|
||
)
|
||
|
||
return StreamingResponse(
|
||
iter([file_bytes]),
|
||
status_code=200,
|
||
headers=headers,
|
||
)
|
||
```
|
||
|
||
**Critical:** `get_regular_user` dep ensures admins cannot access document content (DOC-02 / CLAUDE.md). No presigned URL is generated at any point.
|
||
|
||
[CITED: https://github.com/fastapi/fastapi/discussions/7718]
|
||
[CITED: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/206]
|
||
|
||
### Pattern 4: Shared-Document Access for Streaming Proxy
|
||
|
||
**What:** The streaming proxy must allow both document owners AND recipients to access content.
|
||
|
||
**When to use:** Only in `GET /api/documents/{id}/content` — sharing grants read access to the content stream.
|
||
|
||
```python
|
||
# Source: CONTEXT.md D-08 — recipient can preview shared document
|
||
async def _can_access_document(
|
||
session: AsyncSession, doc: Document, current_user_id: uuid.UUID
|
||
) -> bool:
|
||
"""True if user owns the document OR has an active share as recipient."""
|
||
if doc.user_id == current_user_id:
|
||
return True
|
||
result = await session.execute(
|
||
select(Share).where(
|
||
Share.document_id == doc.id,
|
||
Share.recipient_id == current_user_id,
|
||
)
|
||
)
|
||
return result.scalar_one_or_none() is not None
|
||
```
|
||
|
||
[ASSUMED: the share-based access for the streaming proxy is implied by D-08 and SHARE-02; not explicitly called out in a prior phase decision]
|
||
|
||
### Pattern 5: PostgreSQL tsvector Full-Text Search
|
||
|
||
**What:** GIN expression index on `documents.extracted_text` for full-text search. Query uses `plainto_tsquery` (accepts natural language; no operator syntax required from user).
|
||
|
||
**When to use:** FOLD-05 — `GET /api/documents?q=<query>`.
|
||
|
||
**Migration (manual — do NOT use Computed() to avoid Alembic autogenerate noise):**
|
||
```python
|
||
# Source: CITED: https://www.postgresql.org/docs/current/textsearch-tables.html
|
||
# In migration 0004:
|
||
op.execute(
|
||
"CREATE INDEX ix_documents_fts ON documents "
|
||
"USING GIN (to_tsvector('english', coalesce(extracted_text, '')))"
|
||
)
|
||
```
|
||
|
||
**Query pattern:**
|
||
```python
|
||
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
|
||
from sqlalchemy import func, text
|
||
|
||
stmt = (
|
||
select(Document)
|
||
.where(
|
||
Document.user_id == current_user.id,
|
||
func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
|
||
func.plainto_tsquery("english", q)
|
||
),
|
||
)
|
||
.order_by(Document.created_at.desc())
|
||
)
|
||
```
|
||
|
||
**Why `plainto_tsquery` not `to_tsquery`:** `plainto_tsquery` accepts unstructured natural language ("invoice march 2024") without requiring the user to supply `&` and `|` operators. Simpler and safer for a search bar input.
|
||
|
||
**Why expression index not generated/stored column:** PostgreSQL generated columns with `TSVECTOR` type require PostgreSQL 12+ and add schema complexity. The expression index approach is simpler and Alembic-compatible (written as raw SQL in the migration; autogenerate will not re-detect it if managed manually).
|
||
|
||
**Important Alembic caveat:** `include_schemas=True` autogenerate WILL repeatedly flag the expression index for recreation (known Alembic issue: github.com/sqlalchemy/alembic/issues/1390). The migration MUST be written manually and committed once. Do not run `alembic revision --autogenerate` after adding the GIN index without reviewing the output.
|
||
|
||
[CITED: https://www.postgresql.org/docs/current/textsearch-tables.html]
|
||
[CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html]
|
||
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]
|
||
|
||
### Pattern 6: write_audit_log() Helper
|
||
|
||
**What:** Shared service function called inline in each handler after a successful operation. Does NOT raise on failure (audit log writing is best-effort — do not fail the primary operation).
|
||
|
||
**When to use:** Every handler in Phase 4 that performs state-changing operations; plus back-filling into `auth.py` and `admin.py`.
|
||
|
||
```python
|
||
# Source: CONTEXT.md D-14
|
||
# backend/services/audit.py
|
||
from __future__ import annotations
|
||
import uuid
|
||
from typing import Optional
|
||
from sqlalchemy.ext.asyncio import AsyncSession
|
||
from db.models import AuditLog
|
||
import logging
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
async def write_audit_log(
|
||
session: AsyncSession,
|
||
event_type: str,
|
||
user_id: Optional[uuid.UUID],
|
||
actor_id: Optional[uuid.UUID],
|
||
resource_id: Optional[uuid.UUID],
|
||
ip_address: Optional[str],
|
||
metadata_: Optional[dict] = None,
|
||
) -> None:
|
||
"""Write an audit log entry. Never raises — audit failure is non-fatal."""
|
||
try:
|
||
entry = AuditLog(
|
||
event_type=event_type,
|
||
user_id=user_id,
|
||
actor_id=actor_id,
|
||
resource_id=resource_id,
|
||
ip_address=ip_address,
|
||
metadata_=metadata_,
|
||
)
|
||
session.add(entry)
|
||
await session.flush() # flush within the handler's existing transaction
|
||
except Exception as exc:
|
||
logger.warning("audit log write failed: %s", exc)
|
||
# Do not re-raise — audit failure must never abort the primary operation
|
||
```
|
||
|
||
**Critical: `session.flush()` not `session.commit()`** — audit log writes within the same transaction as the primary operation, so the primary operation's `session.commit()` commits the audit entry too. This avoids partial writes where the operation succeeds but the audit entry is orphaned.
|
||
|
||
[ASSUMED: flush-not-commit pattern is the safest approach given transactional semantics; not explicitly documented in a prior phase decision]
|
||
|
||
### Pattern 7: CSV Export (admin endpoint)
|
||
|
||
**What:** `GET /api/admin/audit-log/export?format=csv` returns a StreamingResponse with `text/csv` content type and `Content-Disposition: attachment`.
|
||
|
||
```python
|
||
# Source: Python stdlib csv module
|
||
import csv
|
||
import io
|
||
from fastapi.responses import StreamingResponse
|
||
|
||
async def export_audit_log(
|
||
session: AsyncSession,
|
||
start: Optional[datetime],
|
||
end: Optional[datetime],
|
||
user_id: Optional[uuid.UUID],
|
||
event_type: Optional[str],
|
||
format: str = "csv",
|
||
):
|
||
rows = await _query_audit_log(session, start, end, user_id, event_type)
|
||
output = io.StringIO()
|
||
writer = csv.DictWriter(output, fieldnames=[
|
||
"id", "event_type", "user_id", "actor_id",
|
||
"resource_id", "ip_address", "metadata_", "created_at"
|
||
])
|
||
writer.writeheader()
|
||
for row in rows:
|
||
writer.writerow({
|
||
"id": row.id,
|
||
"event_type": row.event_type,
|
||
"user_id": str(row.user_id) if row.user_id else "",
|
||
"actor_id": str(row.actor_id) if row.actor_id else "",
|
||
"resource_id": str(row.resource_id) if row.resource_id else "",
|
||
"ip_address": str(row.ip_address) if row.ip_address else "",
|
||
"metadata_": str(row.metadata_) if row.metadata_ else "",
|
||
"created_at": row.created_at.isoformat(),
|
||
})
|
||
output.seek(0)
|
||
return StreamingResponse(
|
||
iter([output.getvalue()]),
|
||
media_type="text/csv",
|
||
headers={"Content-Disposition": f"attachment; filename=audit-export.csv"},
|
||
)
|
||
```
|
||
|
||
[CITED: Python stdlib csv documentation — https://docs.python.org/3/library/csv.html]
|
||
|
||
### Pattern 8: Celery Beat — Daily Audit Export Task
|
||
|
||
**What:** A new task `audit_log_daily_export` added to `beat_schedule`. Uses `crontab(hour=0, minute=0)` for midnight UTC.
|
||
|
||
```python
|
||
# Source: CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
|
||
from celery.schedules import crontab
|
||
|
||
# In celery_app.py beat_schedule dict — ADD to existing schedule:
|
||
"audit-log-daily-export": {
|
||
"task": "tasks.audit_tasks.audit_log_daily_export",
|
||
"schedule": crontab(hour=0, minute=0), # midnight UTC
|
||
},
|
||
```
|
||
|
||
**Task implementation pattern (mirrors cleanup_abandoned_uploads):**
|
||
```python
|
||
@celery_app.task(name="tasks.audit_tasks.audit_log_daily_export")
|
||
def audit_log_daily_export() -> dict:
|
||
return asyncio.run(_run_daily_export())
|
||
|
||
async def _run_daily_export() -> dict:
|
||
from datetime import date, timedelta
|
||
import csv, io
|
||
from db.session import AsyncSessionLocal
|
||
from db.models import AuditLog
|
||
from storage import get_storage_backend
|
||
from sqlalchemy import select
|
||
|
||
yesterday = date.today() - timedelta(days=1)
|
||
start = datetime(yesterday.year, yesterday.month, yesterday.day, tzinfo=timezone.utc)
|
||
end = start + timedelta(days=1)
|
||
|
||
async with AsyncSessionLocal() as session:
|
||
result = await session.execute(
|
||
select(AuditLog)
|
||
.where(AuditLog.created_at >= start, AuditLog.created_at < end)
|
||
.order_by(AuditLog.created_at)
|
||
)
|
||
rows = result.scalars().all()
|
||
|
||
output = io.StringIO()
|
||
writer = csv.DictWriter(output, fieldnames=[...])
|
||
writer.writeheader()
|
||
for row in rows:
|
||
writer.writerow({...})
|
||
|
||
csv_bytes = output.getvalue().encode("utf-8")
|
||
key = f"audit-logs/{yesterday.isoformat()}.csv"
|
||
backend = get_storage_backend()
|
||
await backend.put_object_raw(
|
||
bucket="audit-logs",
|
||
key=key,
|
||
data=io.BytesIO(csv_bytes),
|
||
length=len(csv_bytes),
|
||
content_type="text/csv",
|
||
)
|
||
return {"exported": len(rows), "key": key}
|
||
```
|
||
|
||
**Note:** `MinIOBackend` may need a `put_object_raw(bucket, key, data, length, content_type)` method (vs the existing `put_object` which constructs the key schema). The audit-logs bucket uses a different key scheme. Add this method in the migration plan.
|
||
|
||
[CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html]
|
||
|
||
### Pattern 9: Pydantic Response Model Exclusion (SEC-08)
|
||
|
||
**What:** Explicit `response_model=CloudConnectionOut` where `CloudConnectionOut` omits `credentials_enc`. This is the safe-by-default Pydantic v2 approach.
|
||
|
||
```python
|
||
# Source: CITED: https://fastapi.tiangolo.com/tutorial/response-model/
|
||
from pydantic import BaseModel
|
||
from datetime import datetime
|
||
|
||
class CloudConnectionOut(BaseModel):
|
||
id: str
|
||
provider: str
|
||
display_name: str
|
||
status: str
|
||
connected_at: datetime
|
||
# credentials_enc is deliberately absent — never serialized (SEC-08)
|
||
|
||
model_config = {"from_attributes": True}
|
||
```
|
||
|
||
**Phase 4 scope:** No cloud connection endpoints exist yet (Phase 5). Phase 4's obligation is to ensure the `CloudConnectionOut` model is defined and used as `response_model` on any admin endpoint that touches `cloud_connections`. Since no cloud connection endpoints exist yet, Phase 4 creates the model in `schemas/cloud.py` (or inline in `api/admin.py`) so Phase 5 cannot accidentally expose the field.
|
||
|
||
[CITED: https://fastapi.tiangolo.com/tutorial/response-model/]
|
||
|
||
### Pattern 10: Vue 3 Debounced Search (no external dependency)
|
||
|
||
**What:** Watch the search query ref with a manual debounce using `setTimeout`/`clearTimeout`. Does not require lodash or VueUse.
|
||
|
||
```javascript
|
||
// Source: Vue 3 Composition API watch pattern (Options API compatible via watch option)
|
||
// frontend/src/stores/documents.js
|
||
import { ref, watch } from 'vue'
|
||
|
||
const searchQuery = ref('')
|
||
let _searchTimer = null
|
||
|
||
watch(searchQuery, (newVal) => {
|
||
clearTimeout(_searchTimer)
|
||
if (newVal.length < 2) {
|
||
// Clear results or show all
|
||
return
|
||
}
|
||
_searchTimer = setTimeout(() => {
|
||
fetchDocuments({ q: newVal, folderId: currentFolderId.value })
|
||
}, 300)
|
||
})
|
||
```
|
||
|
||
[ASSUMED: 300ms is conventional for search debounce; no official Vue doc specifies this value]
|
||
|
||
### Anti-Patterns to Avoid
|
||
|
||
- **Using `session.commit()` inside `write_audit_log()`:** Must use `flush()` to stay within the handler's transaction. A separate commit risks committing the audit entry without the primary operation.
|
||
- **Generating presigned URLs in the streaming proxy:** D-08 explicitly prohibits this. Never call `presigned_get_url()` from the proxy endpoint. The bytes must flow through FastAPI.
|
||
- **`to_tsquery` instead of `plainto_tsquery`:** `to_tsquery` requires the user to supply `&`, `|`, and `!` operators. Search bar input uses `plainto_tsquery` which accepts natural language phrases.
|
||
- **Using `response_model_exclude` on CloudConnection:** Prefer defining an explicit `CloudConnectionOut` model that never includes `credentials_enc`. `response_model_exclude` is fragile (requires remembering to pass the set on every endpoint). Whitelist is safer.
|
||
- **Storing breadcrumb state in the URL query string:** Breadcrumb is derived from the current `folder_id` parameter — only the current folder ID needs to be in the URL (or in Vue state). The full path is reconstructed from the DB on navigation.
|
||
- **Recursive SQL for breadcrumb without depth limit:** `WITH RECURSIVE` CTE is correct but should include a depth guard (`WHERE depth < 50`) to prevent infinite recursion on a corrupted self-referential FK.
|
||
|
||
---
|
||
|
||
## Don't Hand-Roll
|
||
|
||
| Problem | Don't Build | Use Instead | Why |
|
||
|---------|-------------|-------------|-----|
|
||
| Range header parsing | Custom byte-range logic from scratch | Pattern 3 above (from FastAPI discussion #7718) | Browser PDFs use RFC 7233 range requests; edge cases (open-ended ranges `bytes=500-`) must be handled |
|
||
| Full-text search ranking | BM25 in Python | PostgreSQL `ts_rank()` + `plainto_tsquery` | Runs in the DB; GIN index makes it fast; no external service needed |
|
||
| CSV serialization | String concatenation with escaping | Python stdlib `csv.DictWriter` | Handles quoting, escaping, newlines in field values automatically |
|
||
| Audit log middleware | Celery task or FastAPI middleware for audit writes | Inline `write_audit_log()` after successful operation | Middleware fires on request start — cannot know if operation succeeded; Celery task adds latency and potential missed writes on worker failure |
|
||
| Tsvector maintenance | Application-layer `extracted_text` → tsvector sync | PostgreSQL GIN expression index `to_tsvector('english', extracted_text)` | DB recalculates automatically on UPDATE; no trigger or application sync needed |
|
||
| Debounce utility | npm package (lodash, VueUse) | `setTimeout`/`clearTimeout` in watch | < 10 lines; no new dependency justified |
|
||
|
||
**Key insight:** PostgreSQL handles full-text search natively with GIN indexes at query time — no external search infrastructure (Elasticsearch, Typesense, Meilisearch) is needed or justified at this scale.
|
||
|
||
---
|
||
|
||
## Common Pitfalls
|
||
|
||
### Pitfall 1: Folder Cascade-Delete Misses Sub-folder Documents
|
||
|
||
**What goes wrong:** `DELETE /api/folders/{id}` deletes direct child documents but leaves orphaned documents in sub-folders. Quota becomes incorrect.
|
||
|
||
**Why it happens:** A naive `WHERE folder_id = :id` only catches immediate children. Self-referential FK with unlimited nesting requires a recursive query to collect the full subtree.
|
||
|
||
**How to avoid:** Use `WITH RECURSIVE` CTE to collect all folder IDs in the subtree before deleting:
|
||
|
||
```sql
|
||
WITH RECURSIVE subtree AS (
|
||
SELECT id FROM folders WHERE id = :root_id AND user_id = :uid
|
||
UNION ALL
|
||
SELECT f.id FROM folders f
|
||
JOIN subtree s ON f.parent_id = s.id
|
||
WHERE f.user_id = :uid AND depth < 50
|
||
)
|
||
SELECT d.id, d.size_bytes, d.object_key
|
||
FROM documents d
|
||
JOIN subtree s ON d.folder_id = s.id
|
||
WHERE d.user_id = :uid
|
||
```
|
||
|
||
**Warning signs:** After deleting a folder tree, `quotas.used_bytes` does not match `SUM(documents.size_bytes)`.
|
||
|
||
[ASSUMED: SQLite does not support WITH RECURSIVE for test runs — integration tests requiring this pattern must use PostgreSQL (INTEGRATION=1)]
|
||
|
||
### Pitfall 2: Alembic Autogenerate Repeatedly Detects tsvector GIN Index
|
||
|
||
**What goes wrong:** After adding the GIN expression index in migration 0004, running `alembic revision --autogenerate` again generates a migration that drops and recreates the same index.
|
||
|
||
**Why it happens:** Alembic's autogenerate does not understand functional/expression indexes on PostgreSQL when the expression involves function calls like `to_tsvector()`. Known issue in Alembic.
|
||
|
||
**How to avoid:** Write the GIN index migration manually (as raw `op.execute(CREATE INDEX...)`) and do NOT add the index to the ORM model's `__table_args__`. Keep it only in the migration. Document with a comment `# managed manually — do not autogenerate`.
|
||
|
||
**Warning signs:** Repeated `alembic revision --autogenerate` output shows DROP INDEX / CREATE INDEX for `ix_documents_fts` on every run.
|
||
|
||
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]
|
||
|
||
### Pitfall 3: PDF Streaming Proxy — Admin Access
|
||
|
||
**What goes wrong:** Admin user calls `GET /api/documents/{id}/content` and receives document bytes.
|
||
|
||
**Why it happens:** Using `get_current_user` instead of `get_regular_user` on the proxy endpoint.
|
||
|
||
**How to avoid:** The proxy endpoint MUST use `get_regular_user` dep, which raises HTTP 403 for `user.role == 'admin'`. This is established in Phase 3 (D-16) and must be applied to the proxy endpoint.
|
||
|
||
**Warning signs:** Security test `test_admin_cannot_access_document_content` fails or is absent.
|
||
|
||
[VERIFIED: codebase — backend/deps/auth.py `get_regular_user` raises 403 for admin role]
|
||
|
||
### Pitfall 4: Share IDOR — Recipient Accesses Another Recipient's Share
|
||
|
||
**What goes wrong:** User A shares document with User B. User C (also a recipient of a different share on the same document, or just any user) can enumerate share IDs and revoke or view shares they don't own.
|
||
|
||
**Why it happens:** `DELETE /api/shares/{share_id}` checks only that the share exists, not that `share.owner_id == current_user.id`.
|
||
|
||
**How to avoid:** Ownership assertion on all share mutation endpoints:
|
||
```python
|
||
share = await session.get(Share, share_id)
|
||
if share is None or share.owner_id != current_user.id:
|
||
raise HTTPException(404, "Share not found")
|
||
```
|
||
|
||
**Warning signs:** Security test `test_share_revoke_wrong_owner_returns_404` is absent.
|
||
|
||
[ASSUMED: standard IDOR prevention pattern per CLAUDE.md; test must be written]
|
||
|
||
### Pitfall 5: write_audit_log Does Not Include IP Address
|
||
|
||
**What goes wrong:** `ip_address` column is left NULL for all events because the handler does not extract it from the request.
|
||
|
||
**Why it happens:** The `request.client.host` extraction is not wired into `write_audit_log()` calls.
|
||
|
||
**How to avoid:** Inject `request: Request` into each handler and pass `request.client.host` as `ip_address` to `write_audit_log()`. For handlers behind a reverse proxy, use `request.headers.get("X-Forwarded-For", request.client.host)`.
|
||
|
||
[ASSUMED: IP extraction from FastAPI Request is standard; X-Forwarded-For handling depends on deployment]
|
||
|
||
### Pitfall 6: Folder Name Uniqueness Constraint Violation
|
||
|
||
**What goes wrong:** Creating two folders with the same name under the same parent raises an IntegrityError from the `uq_folders_user_parent_name` constraint, which surfaces as HTTP 500.
|
||
|
||
**Why it happens:** The `Folder` model has `UniqueConstraint("user_id", "parent_id", "name")`. The API handler does not check for duplicates before inserting.
|
||
|
||
**How to avoid:** Catch `IntegrityError` from the ORM insert and return HTTP 409 Conflict:
|
||
```python
|
||
from sqlalchemy.exc import IntegrityError
|
||
try:
|
||
session.add(folder)
|
||
await session.commit()
|
||
except IntegrityError:
|
||
await session.rollback()
|
||
raise HTTPException(409, "A folder with that name already exists here")
|
||
```
|
||
|
||
**Warning signs:** POST /api/folders returns 500 on duplicate name.
|
||
|
||
[VERIFIED: codebase — db/models.py Folder `UniqueConstraint("user_id", "parent_id", "name")`]
|
||
|
||
### Pitfall 7: "Shared with Me" Leaks Document Content
|
||
|
||
**What goes wrong:** The `GET /api/documents` list endpoint accidentally returns documents shared with the current user (not just owned by them), exposing filenames and extracted text to recipients.
|
||
|
||
**Why it happens:** A join on the `shares` table is added to the list query without scoping by ownership.
|
||
|
||
**How to avoid:** The standard list endpoint (`GET /api/documents`) MUST filter `WHERE documents.user_id = current_user.id` only. Shared documents are a separate endpoint `GET /api/shares/received`. Extracted text is not returned in the shared documents list view — only metadata.
|
||
|
||
[VERIFIED: codebase — services/storage.py `list_metadata` already scopes by `user_id`]
|
||
|
||
### Pitfall 8: MinIO Audit-Logs Bucket Does Not Exist
|
||
|
||
**What goes wrong:** The daily Celery beat task fails with `NoSuchBucket` error on first run.
|
||
|
||
**Why it happens:** The `audit-logs` bucket is not created in the migration or application lifespan.
|
||
|
||
**How to avoid:** Create the `audit-logs` bucket in migration 0004's `upgrade()` function (same pattern as the main documents bucket, gated on `MINIO_ENDPOINT` env var to allow SQLite test runs):
|
||
```python
|
||
# In migration 0004 post-DDL step
|
||
minio_endpoint = os.environ.get("MINIO_ENDPOINT")
|
||
if minio_endpoint:
|
||
client = Minio(minio_endpoint, ...)
|
||
if not client.bucket_exists("audit-logs"):
|
||
client.make_bucket("audit-logs")
|
||
```
|
||
|
||
[VERIFIED: codebase — migration 0003 shows the `MINIO_ENDPOINT`-gated pattern; audit-logs bucket creation follows same approach]
|
||
|
||
---
|
||
|
||
## Runtime State Inventory
|
||
|
||
> Phase 4 is a feature addition (not a rename/refactor). No runtime state migration is required.
|
||
|
||
| Category | Items Found | Action Required |
|
||
|----------|-------------|------------------|
|
||
| Stored data | No existing `pdf_open_mode` values; column is NEW (default `'in_app'`) | Alembic migration 0004 adds column with server_default — no data migration |
|
||
| Stored data | Existing `audit_log` table is EMPTY (no events written in prior phases) | No migration; Phase 4 backfills audit writes into handlers going forward |
|
||
| Live service config | `audit-logs` MinIO bucket does not exist | Created in migration 0004 post-DDL step (gated on MINIO_ENDPOINT) |
|
||
| OS-registered state | None — verified by codebase scan | None |
|
||
| Secrets/env vars | No new env vars required for Phase 4 features | None |
|
||
| Build artifacts | No stale egg-info or compiled artifacts from prior phases | None |
|
||
|
||
---
|
||
|
||
## Environment Availability
|
||
|
||
| Dependency | Required By | Available | Version | Fallback |
|
||
|------------|------------|-----------|---------|----------|
|
||
| PostgreSQL | tsvector GIN index, CTE folder cascade | [runtime — Docker Compose] | 15+ | SQLite for unit tests (no tsvector; integration tests require INTEGRATION=1) |
|
||
| MinIO | Streaming proxy, audit CSV upload | [runtime — Docker Compose] | latest | Mock in tests (established pattern) |
|
||
| Redis | Celery beat daily export task | [runtime — Docker Compose] | latest | Not mockable for beat tasks; integration tests skip beat scheduling |
|
||
| Python csv (stdlib) | CSV export endpoint + Celery export task | ✓ | stdlib | — |
|
||
|
||
**Missing dependencies with no fallback:**
|
||
- PostgreSQL tsvector: cannot be tested without a real PostgreSQL instance. Tests that exercise FTS must be marked with `pytest.mark.skipif(not live_services_available, ...)` or use the `INTEGRATION=1` pattern.
|
||
|
||
**Missing dependencies with fallback:**
|
||
- SQLite in unit tests: most folder/share CRUD tests run fine on SQLite. Only FTS and `WITH RECURSIVE` CTE tests require PostgreSQL.
|
||
|
||
---
|
||
|
||
## Validation Architecture
|
||
|
||
### Test Framework
|
||
|
||
| Property | Value |
|
||
|----------|-------|
|
||
| Framework | pytest + pytest-asyncio (already configured) |
|
||
| Config file | `pytest.ini` or `pyproject.toml` (check existing) |
|
||
| Quick run command | `pytest backend/tests/test_folders.py backend/tests/test_shares.py -x` |
|
||
| Full suite command | `cd backend && pytest -v` |
|
||
|
||
### Phase Requirements → Test Map
|
||
|
||
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|
||
|--------|----------|-----------|-------------------|-------------|
|
||
| FOLD-01 | Create/rename/delete folder | Integration | `pytest backend/tests/test_folders.py::test_create_folder -x` | ❌ Wave 0 |
|
||
| FOLD-01 | Delete non-empty folder → warning count | Integration | `pytest backend/tests/test_folders.py::test_delete_folder_content_count -x` | ❌ Wave 0 |
|
||
| FOLD-02 | Move document to folder | Integration | `pytest backend/tests/test_folders.py::test_move_document -x` | ❌ Wave 0 |
|
||
| FOLD-02 | Move document — ownership assertion | Integration (negative) | `pytest backend/tests/test_folders.py::test_move_wrong_owner_404 -x` | ❌ Wave 0 |
|
||
| FOLD-03 | Breadcrumb path returned in folder response | Unit | `pytest backend/tests/test_folders.py::test_breadcrumb_path -x` | ❌ Wave 0 |
|
||
| FOLD-04 | Sort by name/date/size | Integration | `pytest backend/tests/test_folders.py::test_document_sort -x` | ❌ Wave 0 |
|
||
| FOLD-05 | tsvector search returns matching docs | Integration (PostgreSQL required) | `pytest backend/tests/test_folders.py::test_fts_search -x -m integration` | ❌ Wave 0 |
|
||
| SHARE-01 | Share by handle — handle not found → 404 | Integration | `pytest backend/tests/test_shares.py::test_share_handle_not_found -x` | ❌ Wave 0 |
|
||
| SHARE-01 | Share by handle — success | Integration | `pytest backend/tests/test_shares.py::test_share_success -x` | ❌ Wave 0 |
|
||
| SHARE-02 | Shared doc in recipient virtual folder | Integration | `pytest backend/tests/test_shares.py::test_shared_with_me -x` | ❌ Wave 0 |
|
||
| SHARE-02 | Shared doc — zero quota charged to recipient | Integration | `pytest backend/tests/test_shares.py::test_share_no_quota_impact -x` | ❌ Wave 0 |
|
||
| SHARE-04 | Revoke share — immediate for recipient | Integration | `pytest backend/tests/test_shares.py::test_revoke_share -x` | ❌ Wave 0 |
|
||
| SHARE-01..04 | Share IDOR — wrong owner cannot revoke | Security (negative) | `pytest backend/tests/test_shares.py::test_share_revoke_wrong_owner_404 -x` | ❌ Wave 0 |
|
||
| SEC-08 | credentials_enc absent from all responses | Security (negative) | `pytest backend/tests/test_security.py::test_credentials_enc_not_in_response -x` | ❌ Wave 0 |
|
||
| SEC-09 | Account deletion deletes user files | Integration | `pytest backend/tests/test_admin_api.py::test_delete_user_cleans_files -x` | ❌ Wave 0 |
|
||
| DOC-02 | PDF proxy returns bytes, 200 or 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_200 -x` | ❌ Wave 0 |
|
||
| DOC-02 | PDF proxy — no presigned URL in response | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_no_presigned_url -x` | ❌ Wave 0 |
|
||
| DOC-02 | PDF proxy — Range header → 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_206_range -x` | ❌ Wave 0 |
|
||
| DOC-02 | PDF proxy — admin blocked (403) | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_admin_403 -x` | ❌ Wave 0 |
|
||
| ADMIN-06 | Audit log viewer — no document content in entries | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_no_doc_content -x` | ❌ Wave 0 |
|
||
| ADMIN-06 | Audit log viewer — admin only | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_regular_user_403 -x` | ❌ Wave 0 |
|
||
|
||
### Sampling Rate
|
||
|
||
- **Per task commit:** `pytest backend/tests/test_folders.py backend/tests/test_shares.py backend/tests/test_audit.py backend/tests/test_documents.py -x`
|
||
- **Per wave merge:** `cd backend && pytest -v`
|
||
- **Phase gate:** Full suite green before `/gsd:verify-work`
|
||
|
||
### Wave 0 Gaps
|
||
|
||
- [ ] `backend/tests/test_folders.py` — covers FOLD-01..05
|
||
- [ ] `backend/tests/test_shares.py` — covers SHARE-01..05 + IDOR security tests
|
||
- [ ] `backend/tests/test_audit.py` — covers ADMIN-06 + no-doc-content security tests
|
||
- [ ] `backend/tests/test_documents.py` — add proxy tests (test_content_stream_*) to existing file
|
||
- [ ] `backend/tests/test_security.py` — add SEC-08, SEC-09 tests (may be in test_admin_api.py)
|
||
|
||
---
|
||
|
||
## Security Domain
|
||
|
||
### Applicable ASVS Categories
|
||
|
||
| ASVS Category | Applies | Standard Control |
|
||
|---------------|---------|-----------------|
|
||
| V2 Authentication | yes | `get_regular_user` / `get_current_admin` deps on all new endpoints |
|
||
| V3 Session Management | no | No new session mechanisms introduced |
|
||
| V4 Access Control | yes | Ownership assertion on all folder/share endpoints; `get_regular_user` rejects admin on proxy |
|
||
| V5 Input Validation | yes | Pydantic models on all request bodies; `doc_id` parsed via `uuid.UUID()` |
|
||
| V6 Cryptography | no | No new cryptographic operations; credentials still encrypted via Phase 2 HKDF pattern |
|
||
|
||
### Known Threat Patterns for this Phase
|
||
|
||
| Pattern | STRIDE | Standard Mitigation |
|
||
|---------|--------|---------------------|
|
||
| Share IDOR — revoke another user's share | Elevation of privilege | `share.owner_id == current_user.id` → 404 on mismatch |
|
||
| Share IDOR — access shared doc content without share | Information disclosure | `_can_access_document()` check in proxy endpoint: owner OR active share |
|
||
| PDF proxy leaking presigned URL | Information disclosure | `get_object()` fetches bytes directly; presigned URL never generated in proxy handler |
|
||
| Admin accessing document content via proxy | Broken access control | `get_regular_user` dep raises 403 for admin role |
|
||
| Folder IDOR — delete another user's folder | Tampering | `folder.user_id == current_user.id` → 404 on mismatch |
|
||
| Audit log containing document content | Sensitive data exposure | `write_audit_log()` metadata_ MUST NOT include `filename`, `extracted_text`, or file bytes |
|
||
| Audit log admin access by regular user | Broken access control | `GET /api/admin/audit-log` uses `get_current_admin` dep |
|
||
| Path traversal via `folder_id` parameter | Information disclosure | All folder/document lookups via DB primary key; `uuid.UUID()` parse validates format |
|
||
| Tsvector search scope leak — user sees others' docs | Information disclosure | FTS query MUST include `Document.user_id == current_user.id` scope filter |
|
||
| credentials_enc in serialized admin response | Sensitive data exposure | `CloudConnectionOut` excludes field; never in any response |
|
||
|
||
### Security Gate Checklist (Phase 4 specific)
|
||
|
||
- [ ] `bandit -r backend/` — zero HIGH findings in new files (`api/folders.py`, `api/shares.py`, `api/audit.py`, `services/audit.py`, `tasks/audit_tasks.py`)
|
||
- [ ] `pip audit` — zero new critical/high CVEs (no new packages)
|
||
- [ ] `npm audit --audit-level=high` — zero (no new npm packages)
|
||
- [ ] `test_share_revoke_wrong_owner_404` passes
|
||
- [ ] `test_content_stream_admin_403` passes
|
||
- [ ] `test_content_stream_no_presigned_url` passes
|
||
- [ ] `test_audit_log_no_doc_content` passes — verifies `filename`, `extracted_text` absent from all audit entries
|
||
- [ ] `test_credentials_enc_not_in_response` passes
|
||
- [ ] `test_fts_search_scoped_to_owner` passes — other user's docs not in search results
|
||
|
||
---
|
||
|
||
## Code Examples
|
||
|
||
### Folder Breadcrumb Path Query (PostgreSQL CTE)
|
||
|
||
```sql
|
||
-- Source: CITED: https://www.postgresql.org/docs/current/queries-with.html
|
||
WITH RECURSIVE breadcrumb AS (
|
||
SELECT id, parent_id, name, 1 AS depth
|
||
FROM folders
|
||
WHERE id = :folder_id AND user_id = :uid
|
||
UNION ALL
|
||
SELECT f.id, f.parent_id, f.name, b.depth + 1
|
||
FROM folders f
|
||
JOIN breadcrumb b ON f.id = b.parent_id
|
||
WHERE f.user_id = :uid AND b.depth < 50
|
||
)
|
||
SELECT id, name, depth FROM breadcrumb ORDER BY depth DESC;
|
||
```
|
||
|
||
Returns path from root to current folder (root = highest depth value after `ORDER BY depth DESC`). Each segment is `{id, name}` for clickable navigation.
|
||
|
||
### Full-text Search Query (SQLAlchemy)
|
||
|
||
```python
|
||
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
|
||
from sqlalchemy import func
|
||
from sqlalchemy.dialects.postgresql import TSVECTOR
|
||
|
||
stmt = (
|
||
select(Document)
|
||
.where(
|
||
Document.user_id == current_user.id,
|
||
func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
|
||
func.plainto_tsquery("english", q)
|
||
),
|
||
)
|
||
.order_by(Document.created_at.desc())
|
||
.limit(per_page)
|
||
.offset((page - 1) * per_page)
|
||
)
|
||
```
|
||
|
||
### "Shared with me" Virtual Folder Query
|
||
|
||
```python
|
||
# Source: CONTEXT.md D-06
|
||
from sqlalchemy import select
|
||
from db.models import Share, Document
|
||
|
||
stmt = (
|
||
select(Document)
|
||
.join(Share, Share.document_id == Document.id)
|
||
.where(Share.recipient_id == current_user.id)
|
||
.order_by(Document.created_at.desc())
|
||
)
|
||
result = await session.execute(stmt)
|
||
shared_docs = result.scalars().all()
|
||
```
|
||
|
||
### `is_shared` indicator on document list
|
||
|
||
```python
|
||
# Source: CONTEXT.md SHARE-05 — owner's list shows "shared" badge
|
||
from sqlalchemy import exists, select
|
||
|
||
# Add to list_documents query:
|
||
shared_subq = (
|
||
select(Share.document_id)
|
||
.where(Share.owner_id == current_user.id)
|
||
.scalar_subquery()
|
||
)
|
||
# Add to Document select:
|
||
stmt = select(Document, Document.id.in_(shared_subq).label("is_shared"))
|
||
```
|
||
|
||
---
|
||
|
||
## State of the Art
|
||
|
||
| Old Approach | Current Approach | When Changed | Impact |
|
||
|--------------|------------------|--------------|--------|
|
||
| PostgreSQL `to_tsquery` (operator syntax required) | `plainto_tsquery` (natural language input) | PostgreSQL 9.3+ | Search bars should use `plainto_tsquery`; `to_tsquery` for programmatic use only |
|
||
| PDF.js for in-browser PDF viewing | Native browser PDF viewer via `Content-Type: application/pdf` + `Content-Disposition: inline` | Browser support ~2018+ | All modern browsers support native PDF rendering; PDF.js only needed for custom annotation features |
|
||
| Alembic autogenerate for GIN expression indexes | Manual migration SQL | Alembic 1.13+ (known bug) | Must write `op.execute("CREATE INDEX ... USING GIN")` manually |
|
||
| `asyncio.run()` in Celery tasks | `asyncio.run(_async_func())` is the standard bridge | Established in Phase 3 document_tasks.py | Celery sync task calls `asyncio.run()` to enter async context |
|
||
|
||
**Deprecated/outdated:**
|
||
- `to_tsvector` with stored Computed column: adds schema complexity; expression index is simpler and equally performant for this scale
|
||
- Separate audit log middleware: D-14 decision is inline writes; middleware approach not used
|
||
|
||
---
|
||
|
||
## Assumptions Log
|
||
|
||
| # | Claim | Section | Risk if Wrong |
|
||
|---|-------|---------|---------------|
|
||
| A1 | The streaming proxy's shared-document access check (Pattern 4) is implied by D-08 and SHARE-02 | Pattern 4, Security Domain | If recipients cannot access the proxy, "Shared with me" is useless; planner must confirm access rule |
|
||
| A2 | `session.flush()` (not `commit()`) inside `write_audit_log()` is the correct transactional pattern | Pattern 6 | Using `commit()` would create two separate transactions; if primary operation fails after audit commit, audit entry is orphaned with no corresponding operation |
|
||
| A3 | 300ms debounce interval for search | Pattern 10 | If too short, excessive API calls; if too long, UX feels sluggish. Standard convention, low risk |
|
||
| A4 | `WITH RECURSIVE` CTE for subtree cascade-delete is the correct PostgreSQL approach; SQLite tests must use INTEGRATION=1 | Pitfall 1, Common Pitfalls | SQLite does not support `WITH RECURSIVE` — all subtree cascade tests require PostgreSQL |
|
||
| A5 | `MinIOBackend.put_object_raw()` method needs to be added for audit-logs bucket (different key scheme from documents) | Pattern 8 | If not added, Celery export task cannot upload to `audit-logs` bucket using the correct key |
|
||
| A6 | The `audit-logs` MinIO bucket creation is safe to gate on `MINIO_ENDPOINT` env var (same as migration 0003 pattern) | Pitfall 8 | If MinIO is present but bucket creation is skipped, first export task fails |
|
||
|
||
**If this table is empty:** Not applicable — some assumptions remain in this research.
|
||
|
||
---
|
||
|
||
## Open Questions (RESOLVED)
|
||
|
||
1. **Subtree folder delete — CTE vs. iterative in Python**
|
||
- What we know: `WITH RECURSIVE` CTE works in PostgreSQL; SQLite does not support it
|
||
- What's unclear: Should the planner use the CTE (PostgreSQL-only, integration test) or iterative Python (works in SQLite, slower for deep trees)?
|
||
- Recommendation: Use CTE for the implementation (real database is PostgreSQL); mark the cascade-delete tests as `pytest.mark.skipif(not live_services_available, ...)` or use `INTEGRATION=1`
|
||
|
||
2. **Shared document access in proxy — `_can_access_document()` adds a DB query per request**
|
||
- What we know: Every `GET /api/documents/{id}/content` call currently does one DB lookup; adding share check adds a second
|
||
- What's unclear: Is the extra query acceptable given Phase 4 scale (single-user → small multi-user)?
|
||
- Recommendation: Add the share check unconditionally; at Phase 4 scale a second indexed query is negligible
|
||
|
||
3. **`PATCH /api/me/preferences` endpoint path**
|
||
- What we know: D-10 specifies this endpoint for pdf_open_mode
|
||
- What's unclear: Should it be on `/api/auth/me/preferences` (alongside `/api/auth/me`) or `/api/me/preferences` (new router)?
|
||
- Recommendation: Place on `/api/auth/me/preferences` (same router prefix as `/api/auth/me`) to keep auth-related user settings in the same module
|
||
|
||
---
|
||
|
||
## Sources
|
||
|
||
### Primary (HIGH confidence)
|
||
- `backend/db/models.py` — all ORM models verified: `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `User` (missing `pdf_open_mode`), `CloudConnection` (`credentials_enc` column confirmed)
|
||
- `backend/api/documents.py` — ownership assertion pattern, `get_regular_user` dep, quota UPDATE pattern
|
||
- `backend/deps/auth.py` — `get_regular_user` raises 403 for admin; `get_current_admin` pattern
|
||
- `backend/celery_app.py` — existing `beat_schedule` structure; `cleanup_abandoned_uploads` as template
|
||
- `backend/storage/minio_backend.py` — `get_object()` confirmed; `asyncio.to_thread()` pattern confirmed
|
||
- `backend/services/storage.py` — `list_metadata` scoping by `user_id` confirmed
|
||
|
||
### Secondary (MEDIUM confidence)
|
||
- [FastAPI discussions #7718](https://github.com/fastapi/fastapi/discussions/7718) — Range header StreamingResponse pattern verified by multiple community confirmations
|
||
- [PostgreSQL docs — Full-text search tables](https://www.postgresql.org/docs/current/textsearch-tables.html) — GIN expression index SQL verified from official docs
|
||
- [SQLAlchemy PostgreSQL dialect docs](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html) — `func.to_tsvector`, `plainto_tsquery` support confirmed
|
||
- [Celery periodic tasks docs](https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html) — `crontab(hour=0, minute=0)` pattern
|
||
- [FastAPI response model docs](https://fastapi.tiangolo.com/tutorial/response-model/) — `response_model=CloudConnectionOut` exclusion pattern
|
||
|
||
### Tertiary (LOW confidence)
|
||
- [Alembic issue #1390](https://github.com/sqlalchemy/alembic/issues/1390) — tsvector autogenerate bug confirmed by issue thread; marked for validation
|
||
|
||
---
|
||
|
||
## Metadata
|
||
|
||
**Confidence breakdown:**
|
||
- Standard stack: HIGH — all packages already in requirements.txt; codebase verified
|
||
- Architecture: HIGH — all ORM models exist and were verified in codebase
|
||
- Pitfalls: MEDIUM — Pitfalls 1, 4 are confirmed from codebase + security requirements; Pitfall 2 confirmed from Alembic issue tracker
|
||
- Code patterns: MEDIUM — Range header pattern from FastAPI community discussion; PostgreSQL FTS from official docs
|
||
|
||
**Research date:** 2026-05-25
|
||
**Valid until:** 2026-06-25 (stable FastAPI/SQLAlchemy stack; no fast-moving dependencies)
|