Files
curo1305 747303246a docs(04): create phase 4 plan (9 plans, 7 waves)
Folders, Sharing, Quotas & Document UX — plans verified (0 blockers,
2 non-blocking warnings). Covers FOLD-01..05, SHARE-01..05, SEC-08/09,
ADMIN-06, DOC-01/02.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:20:16 +02:00

1089 lines
59 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 4: Folders, Sharing, Quotas & Document UX — Research
**Researched:** 2026-05-25
**Domain:** FastAPI folder/share CRUD, PostgreSQL tsvector, MinIO streaming proxy, Celery beat, Vue 3 folder navigation
**Confidence:** HIGH
---
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
**D-01** Hybrid layout — AppSidebar shows top-level folders only. Sub-folders + breadcrumb in main content area. Top-level folders clickable directly in sidebar.
**D-02** Unlimited nesting depth (no API or UI cap). `Folder.parent_id` self-referential FK is authoritative. Breadcrumbs truncate at depth > 4 (show first + "..." + last 2).
**D-03** Non-empty folder delete: warning modal with document count. Confirm → cascade-delete all documents (MinIO + DB + quota). Cancel → no action. Documents are NOT moved to root — they are destroyed.
**D-04** Exact handle input — no autocomplete. API returns 404 if handle not found; UI shows "User not found" error.
**D-05** Share button on DocumentCard (inline icon button). Modal: (a) handle input, (b) Share button, (c) current recipients list with Revoke per row.
**D-06** "Shared with me" is a fixed virtual folder entry in AppSidebar, rendered above the user's own folder list. Filtered by `shares.recipient_id = current_user.id`. Zero quota charged to recipient.
**D-07** Share permission is `view` only for Phase 4. `edit` deferred.
**D-08** Streaming proxy endpoint: `GET /api/documents/{id}/content`. Returns bytes via FastAPI `StreamingResponse`. Supports `Range` headers. `Content-Disposition: inline`. No presigned URL ever generated or exposed. Uses `get_regular_user` dep.
**D-09** Native browser PDF rendering — no PDF.js. Zero frontend dependencies added. Content-Type header drives browser rendering.
**D-10** `users.pdf_open_mode` column (String, default `'in_app'`). Exposed via `PATCH /api/me/preferences`. `in_app` = modal with `<iframe>`; `new_tab` = `window.open`.
**D-11** PostgreSQL tsvector GIN index on `documents.extracted_text`. `plainto_tsquery('english', :q)`. Search scope: user's own docs only. `GET /api/documents?q=<query>`.
**D-12** Inline search bar in document list view. Debounced, minimum 2 chars. Folder-context aware.
**D-13** All 4 audit event categories logged: auth events, document operations, folder/share operations, admin actions.
**D-14** Audit log written inline in handlers (no middleware, no Celery task). Shared helper `write_audit_log(session, event_type, user_id, actor_id, resource_id, ip_address, metadata_dict)`.
**D-15** Admin audit log viewer (`/admin` → AuditLog tab): paginated table, filters (date range, user, action type). No document content, filenames, or extracted text in any log entry.
**D-16** Manual CSV and JSON export button in admin audit log view. Exports current filtered result set.
**D-17** Daily automated CSV backup via Celery beat. Exports previous day's audit rows to MinIO `audit-logs` bucket (`audit-logs/YYYY-MM-DD.csv`). Bucket is private. Admins can download from admin panel. Rows NOT deleted from DB.
**D-18** `credentials_enc` excluded from all serializers (SEC-08). Explicit Pydantic response model enforcement on all user-facing endpoints that touch `CloudConnection`.
**D-19** Account deletion (admin-triggered) runs `delete_user_files()` before removing DB records (SEC-09). Cloud connection cleanup deferred to Phase 5.
### Claude's Discretion
None specified — all significant decisions are locked.
### Deferred Ideas (OUT OF SCOPE)
- Per-day document scan quota tiers (future billing milestone)
- Share permission levels beyond `view` (`edit`, `comment` deferred)
- Presigned GET URLs for non-PDF document download
- Audit log row archival/deletion policy (DB rows are not auto-deleted)
</user_constraints>
---
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| FOLD-01 | User can create, rename, and delete folders (delete confirms content count before proceeding) | `Folder` model complete in `db/models.py`; cascade-delete pattern via SQLAlchemy + atomic quota decrement established in Phase 3 |
| FOLD-02 | User can move documents between folders | `Document.folder_id` FK exists and is nullable; PATCH endpoint updates `folder_id` with ownership assertion on both doc and target folder |
| FOLD-03 | Breadcrumb navigation renders current folder path; each segment clickable | Recursive path query via `WITH RECURSIVE` CTE or iterative parent-chain walk; Vue computed property for breadcrumb array |
| FOLD-04 | Document list supports sort by name, date uploaded, and file size | SQLAlchemy `order_by()` with enum sort param; frontend `sort` query param |
| FOLD-05 | Full-text search via PostgreSQL tsvector index on extracted text | Expression-index approach `to_tsvector('english', extracted_text)` + GIN; `plainto_tsquery` in `WHERE` clause; Alembic migration adds index manually (not as Computed column to avoid autogenerate noise) |
| SHARE-01 | User can share a document with another user by their unique handle | `Share` model complete; `User.handle` unique lookup; exact match, no autocomplete |
| SHARE-02 | Shared documents appear in "Shared with me" virtual folder; zero quota charged to recipient | Filter `shares.recipient_id = current_user.id`; no quota row touched; virtual folder is a UI/API filter, not a DB folder |
| SHARE-03 | Shared access view-only by default; owner controls permission level | `Share.permission` defaults to `'view'`; Phase 4 only implements `view`; permission field returned in list responses |
| SHARE-04 | Owner can revoke share; revocation is immediate | `DELETE /api/shares/{share_id}` with owner assertion; no async cleanup needed |
| SHARE-05 | Documents shared with others display a "shared" indicator | Document list response includes `is_shared: bool` derived from EXISTS subquery on `shares.owner_id` |
| SEC-08 | `credentials_enc` excluded from all API serializers | Explicit `CloudConnectionOut` Pydantic response model excluding `credentials_enc` |
| SEC-09 | Account deletion triggers `delete_user_files()` per cloud connection before DB removal | Implemented in admin delete-user endpoint; iterates documents, deletes MinIO objects, decrements quota |
| ADMIN-06 | Admin can view audit log filtered by date range, user, and action type (metadata only) | `AuditLog` model complete; paginated query with filters; admin endpoint uses `get_current_admin` |
| DOC-01 | User can view document metadata and extracted text | Existing `GET /api/documents/{id}` already returns extracted_text; Phase 4 ensures it is included in response |
| DOC-02 | In-browser PDF preview via PDF.js proxy; no presigned URLs exposed | `GET /api/documents/{id}/content` — StreamingResponse with Range header support, bytes from MinIO via `asyncio.to_thread` |
</phase_requirements>
---
## Summary
Phase 4 builds entirely on the existing ORM schema (all required models — `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `Share.recipient_id` — are already in `db/models.py`). The work is predominantly plumbing: creating new API router modules for folders, shares, and audit log; adding a single Alembic migration for the `users.pdf_open_mode` column and the tsvector GIN expression index; extending the Celery beat schedule with a daily audit export task; and extending the Vue 3 frontend with folder navigation, sharing modals, and the settings preference toggle.
The highest-risk areas are: (1) the PDF streaming proxy with Range header support (needs careful byte-range parsing to handle PDF viewer seeking), (2) the tsvector GIN index and corresponding Alembic migration (Alembic autogenerate has known quirks with `to_tsvector()` expression indexes — the migration must be written manually), and (3) the folder cascade-delete with atomic quota decrement for multiple documents (each document's size_bytes must be summed and decremented in a single atomic UPDATE or via a pre-computed sum).
No new Python or npm packages are required for Phase 4. The decision to use native browser PDF rendering (D-09) eliminates the PDF.js dependency. All needed libraries — FastAPI, SQLAlchemy, MinIO SDK, Celery, Pydantic — are already pinned in `requirements.txt`. The frontend needs only Tailwind utility classes; no new npm packages.
**Primary recommendation:** Build in 6 plans: Wave 0 test scaffolds + migration; folders backend; shares backend + audit service; streaming proxy + search; audit log admin API + Celery export task; frontend folder navigation + sharing + settings.
---
## Architectural Responsibility Map
| Capability | Primary Tier | Secondary Tier | Rationale |
|------------|-------------|----------------|-----------|
| Folder CRUD + breadcrumb path | API / Backend | Database / Storage | Ownership assertion and cascade-delete logic belongs in the API layer; path reconstruction is a DB query |
| Document move between folders | API / Backend | Database / Storage | Ownership assertions on both document and target folder must be co-located in the API handler |
| Full-text search (tsvector) | Database / Storage | API / Backend | GIN index and `plainto_tsquery` execute in PostgreSQL; API adds scope filter (`user_id`) |
| Share grant / revoke | API / Backend | Database / Storage | IDOR prevention requires ownership assertion before DB write; share table is the single source of truth |
| "Shared with me" virtual folder | API / Backend | — | Pure query filter on `shares.recipient_id`; no DB folder row; quota enforcement is in the API layer |
| PDF streaming proxy | API / Backend | Database / Storage | Bytes flow from MinIO through FastAPI to the browser; presigned URLs never reach the browser |
| Per-user PDF open preference | API / Backend | Database / Storage | User preference column (`users.pdf_open_mode`) read/written via `/api/me/preferences` |
| Audit log write | API / Backend | — | `write_audit_log()` called inline in each handler after successful operation; no middleware |
| Audit log viewer + export | API / Backend | Database / Storage | Admin-only paginated query with filters; CSV export is a StreamingResponse from a filtered DB query |
| Daily audit backup (Celery beat) | API / Backend (worker) | Database / Storage | Celery beat task runs in the worker; queries DB and uploads CSV to MinIO `audit-logs` bucket |
| Folder navigation UI state | Frontend Server (SSR) → Browser / Client | — | Folder ID and breadcrumb path held in Vue `ref()` in `HomeView.vue`; no server-side state |
| Share modal + indicator | Browser / Client | — | Share button on DocumentCard; modal state is local component `ref()`; share list in documents store |
| PDF open preference toggle | Browser / Client | API / Backend | Toggle in SettingsView.vue reads/writes preference via `PATCH /api/me/preferences` |
| `credentials_enc` exclusion (SEC-08) | API / Backend | — | Pydantic response model `CloudConnectionOut` excludes the field; never reaches serializer |
---
## Standard Stack
### Core (all already in requirements.txt / package.json)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| FastAPI | 0.136+ | API router for folders, shares, audit, proxy, preferences | Already in use; `APIRouter` + `StreamingResponse` needed for proxy |
| SQLAlchemy 2.0 async | 2.0+ | ORM queries for all new endpoints | Established pattern in project; async session already wired |
| Alembic | 1.18+ | Migration 0004 — `users.pdf_open_mode` column + tsvector GIN index | Established migration toolchain |
| MinIO Python SDK | 7.2+ | `get_object()` for streaming proxy; `put_object()` for audit CSV upload | Already in use; `asyncio.to_thread()` pattern established |
| Celery + Redis | 5.5+ | Daily audit export beat task | Already in use; `beat_schedule` in `celery_app.py` |
| Pydantic v2 | 2.0+ | `CloudConnectionOut` response model excluding `credentials_enc` (SEC-08) | Already in use for all request/response models |
| Vue 3 (Options API) | 3.4+ | Folder nav, share modal, settings toggle | Already in use; `ref()` + `watch()` patterns established |
[VERIFIED: codebase — requirements.txt and package.json]
### No New Packages Required
Phase 4 introduces zero new dependencies. Key rationale:
- PDF viewer: D-09 chose native browser rendering — no PDF.js
- Debounce: hand-rolled `setTimeout`/`clearTimeout` in Vue `watch` callback (< 10 lines, no lodash needed)
- CSV export: Python stdlib `csv.DictWriter` with `io.StringIO` — no pandas
- Search: PostgreSQL built-in `plainto_tsquery` — no Elasticsearch or external search engine
[VERIFIED: codebase — no new deps required]
---
## Package Legitimacy Audit
No new packages are installed in Phase 4. All runtime dependencies are carried forward from Phases 13 with pinned versions in `requirements.txt`.
**Packages removed due to slopcheck [SLOP] verdict:** none
**Packages flagged as suspicious [SUS]:** none
**New packages to install:** none
---
## Architecture Patterns
### System Architecture Diagram
```
Browser
├── GET /api/documents?q=search&folder_id=X&sort=name
│ FastAPI → SQLAlchemy → PostgreSQL (tsvector @@ plainto_tsquery, scoped to user)
│ ← JSON: [{id, filename, size_bytes, is_shared, folder_id, ...}]
├── POST /api/folders / PATCH /api/folders/{id} / DELETE /api/folders/{id}
│ FastAPI → ownership assertion → SQLAlchemy → PostgreSQL
│ DELETE: sum(doc.size_bytes) → atomic quota decrement → MinIO delete_object per doc
├── POST /api/shares / DELETE /api/shares/{id}
│ FastAPI → ownership assertion (owner_id = current_user.id) → SQLAlchemy → PostgreSQL
│ GET /api/shares/received → filter shares.recipient_id = current_user.id
├── GET /api/documents/{id}/content [Range: bytes=X-Y]
│ FastAPI (get_regular_user) → ownership OR share-access check → SQLAlchemy (object_key)
│ → MinIO get_object() via asyncio.to_thread() → StreamingResponse (206 or 200)
│ ← bytes with Content-Type, Content-Disposition: inline, Accept-Ranges: bytes
│ NOTE: no presigned URL generated; bytes flow through FastAPI
├── PATCH /api/me/preferences
│ FastAPI → SQLAlchemy UPDATE users.pdf_open_mode → ← {pdf_open_mode}
└── GET /api/admin/audit-log [?start=&end=&user_id=&event_type=&page=]
FastAPI (get_current_admin) → SQLAlchemy → PostgreSQL
← JSON: {items: [{id, event_type, user_id, actor_id, ip_address, metadata_, created_at}], total}
GET /api/admin/audit-log/export?format=csv|json
FastAPI (get_current_admin) → query → StreamingResponse (text/csv or application/json)
Celery Beat Worker (daily at midnight UTC)
└── audit_log_daily_export task
→ AsyncSessionLocal → SELECT audit_log WHERE date = yesterday
→ csv.DictWriter → io.BytesIO
→ MinIO put_object("audit-logs", "YYYY-MM-DD.csv", bytes)
```
### Recommended Project Structure (new files only)
```
backend/
├── api/
│ ├── folders.py # FOLD-01, FOLD-02, FOLD-03, FOLD-04, FOLD-05
│ ├── shares.py # SHARE-01..05
│ └── audit.py # ADMIN-06 (admin viewer + export)
├── services/
│ └── audit.py # write_audit_log() helper
├── tasks/
│ └── audit_tasks.py # audit_log_daily_export Celery beat task
└── migrations/versions/
└── 0004_phase4_pdf_open_mode_tsvector.py
frontend/src/
├── components/
│ ├── documents/
│ │ └── ShareModal.vue # D-05
│ ├── layout/
│ │ └── BreadcrumbNav.vue # FOLD-03
│ └── admin/
│ └── AdminAuditLogTab.vue # ADMIN-06 (D-15, D-16)
├── stores/
│ └── folders.js # folder state + actions
└── views/
└── SettingsView.vue # extend with PDF preference toggle (D-10)
```
### Pattern 1: Ownership Assertion (established in Phase 3 — D-16)
**What:** Every resource endpoint asserts `resource.user_id == current_user.id`. Cross-user access returns 404 (not 403) to prevent attacker enumeration of resource IDs.
**When to use:** All folder CRUD, document move, share grant/revoke (owner side), streaming proxy.
```python
# Source: backend/api/documents.py (established pattern)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
raise HTTPException(status_code=404, detail="Document not found")
```
**Folder variant:**
```python
folder = await session.get(Folder, folder_id)
if folder is None or folder.user_id != current_user.id:
raise HTTPException(status_code=404, detail="Folder not found")
```
[VERIFIED: codebase — backend/api/documents.py]
### Pattern 2: Atomic Quota Decrement for Folder Cascade-Delete
**What:** When deleting a folder, collect all document `size_bytes` first, then issue a single atomic quota UPDATE. Never read-then-write in Python.
**When to use:** `DELETE /api/folders/{id}` when folder is non-empty.
```python
# Source: established atomic pattern (CLAUDE.md + Phase 3 D-07)
# Step 1: collect document IDs and total size within the folder subtree
result = await session.execute(
select(Document.id, Document.size_bytes, Document.object_key)
.where(Document.folder_id == folder_id, Document.user_id == current_user.id)
)
docs = result.all()
total_bytes = sum(row.size_bytes for row in docs)
# Step 2: atomic quota decrement (CASE WHEN pattern — SQLite compatible)
await session.execute(
text(
"UPDATE quotas SET used_bytes = "
"CASE WHEN used_bytes > :delta THEN used_bytes - :delta ELSE 0 END "
"WHERE user_id = :uid"
),
{"delta": total_bytes, "uid": str(current_user.id)},
)
# Step 3: delete MinIO objects (best-effort)
for row in docs:
try:
await get_storage_backend().delete_object(row.object_key)
except Exception:
pass # MinIO cleanup is best-effort
# Step 4: delete documents (cascade via FK) then folder
await session.execute(
delete(Document).where(Document.folder_id == folder_id)
)
await session.delete(folder)
await session.commit()
```
[ASSUMED: the CASE WHEN pattern for SQLite compat is established in STATE.md; verified in existing delete_document service function]
### Pattern 3: PDF Streaming Proxy with Range Headers
**What:** `GET /api/documents/{id}/content` fetches bytes from MinIO and streams them with Range header support. Status 206 for range requests; 200 for full fetches.
**When to use:** DOC-02. Required for PDF viewer to seek pages without re-downloading the entire file.
```python
# Source: FastAPI discussion #7718 (CITED: github.com/fastapi/fastapi/discussions/7718)
from fastapi import Request
from fastapi.responses import StreamingResponse
def _parse_range(range_header: str, file_size: int) -> tuple[int, int]:
"""Parse 'bytes=start-end' Range header. Raises 416 on invalid range."""
try:
h = range_header.replace("bytes=", "").split("-")
start = int(h[0]) if h[0] != "" else 0
end = int(h[1]) if h[1] != "" else file_size - 1
except (ValueError, IndexError):
raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
if start > end or start < 0 or end >= file_size:
raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
return start, end
@router.get("/{doc_id}/content")
async def stream_document_content(
doc_id: str,
request: Request,
session: AsyncSession = Depends(get_db),
current_user: User = Depends(get_regular_user),
):
# Ownership assertion (DOC-02 also allows shared-with-me access — see Pattern 4)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
# Also allow if document is shared with current_user
share = await _get_share_for_user(session, uid, current_user.id)
if share is None:
raise HTTPException(404, "Document not found")
# Use the document from the share's owner context
# (doc is already loaded; ownership check skipped for shared access)
# Fetch all bytes from MinIO (asyncio.to_thread wraps sync SDK)
file_bytes = await get_storage_backend().get_object(doc.object_key)
file_size = len(file_bytes)
range_header = request.headers.get("range")
headers = {
"content-type": doc.content_type,
"content-disposition": f"inline; filename=\"{doc.filename}\"",
"accept-ranges": "bytes",
"content-length": str(file_size),
}
if range_header:
start, end = _parse_range(range_header, file_size)
chunk = file_bytes[start:end + 1]
headers["content-range"] = f"bytes {start}-{end}/{file_size}"
headers["content-length"] = str(len(chunk))
return StreamingResponse(
iter([chunk]),
status_code=206,
headers=headers,
)
return StreamingResponse(
iter([file_bytes]),
status_code=200,
headers=headers,
)
```
**Critical:** `get_regular_user` dep ensures admins cannot access document content (DOC-02 / CLAUDE.md). No presigned URL is generated at any point.
[CITED: https://github.com/fastapi/fastapi/discussions/7718]
[CITED: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/206]
### Pattern 4: Shared-Document Access for Streaming Proxy
**What:** The streaming proxy must allow both document owners AND recipients to access content.
**When to use:** Only in `GET /api/documents/{id}/content` — sharing grants read access to the content stream.
```python
# Source: CONTEXT.md D-08 — recipient can preview shared document
async def _can_access_document(
session: AsyncSession, doc: Document, current_user_id: uuid.UUID
) -> bool:
"""True if user owns the document OR has an active share as recipient."""
if doc.user_id == current_user_id:
return True
result = await session.execute(
select(Share).where(
Share.document_id == doc.id,
Share.recipient_id == current_user_id,
)
)
return result.scalar_one_or_none() is not None
```
[ASSUMED: the share-based access for the streaming proxy is implied by D-08 and SHARE-02; not explicitly called out in a prior phase decision]
### Pattern 5: PostgreSQL tsvector Full-Text Search
**What:** GIN expression index on `documents.extracted_text` for full-text search. Query uses `plainto_tsquery` (accepts natural language; no operator syntax required from user).
**When to use:** FOLD-05 — `GET /api/documents?q=<query>`.
**Migration (manual — do NOT use Computed() to avoid Alembic autogenerate noise):**
```python
# Source: CITED: https://www.postgresql.org/docs/current/textsearch-tables.html
# In migration 0004:
op.execute(
"CREATE INDEX ix_documents_fts ON documents "
"USING GIN (to_tsvector('english', coalesce(extracted_text, '')))"
)
```
**Query pattern:**
```python
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func, text
stmt = (
select(Document)
.where(
Document.user_id == current_user.id,
func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
func.plainto_tsquery("english", q)
),
)
.order_by(Document.created_at.desc())
)
```
**Why `plainto_tsquery` not `to_tsquery`:** `plainto_tsquery` accepts unstructured natural language ("invoice march 2024") without requiring the user to supply `&` and `|` operators. Simpler and safer for a search bar input.
**Why expression index not generated/stored column:** PostgreSQL generated columns with `TSVECTOR` type require PostgreSQL 12+ and add schema complexity. The expression index approach is simpler and Alembic-compatible (written as raw SQL in the migration; autogenerate will not re-detect it if managed manually).
**Important Alembic caveat:** `include_schemas=True` autogenerate WILL repeatedly flag the expression index for recreation (known Alembic issue: github.com/sqlalchemy/alembic/issues/1390). The migration MUST be written manually and committed once. Do not run `alembic revision --autogenerate` after adding the GIN index without reviewing the output.
[CITED: https://www.postgresql.org/docs/current/textsearch-tables.html]
[CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html]
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]
### Pattern 6: write_audit_log() Helper
**What:** Shared service function called inline in each handler after a successful operation. Does NOT raise on failure (audit log writing is best-effort — do not fail the primary operation).
**When to use:** Every handler in Phase 4 that performs state-changing operations; plus back-filling into `auth.py` and `admin.py`.
```python
# Source: CONTEXT.md D-14
# backend/services/audit.py
from __future__ import annotations
import uuid
from typing import Optional
from sqlalchemy.ext.asyncio import AsyncSession
from db.models import AuditLog
import logging
logger = logging.getLogger(__name__)
async def write_audit_log(
session: AsyncSession,
event_type: str,
user_id: Optional[uuid.UUID],
actor_id: Optional[uuid.UUID],
resource_id: Optional[uuid.UUID],
ip_address: Optional[str],
metadata_: Optional[dict] = None,
) -> None:
"""Write an audit log entry. Never raises — audit failure is non-fatal."""
try:
entry = AuditLog(
event_type=event_type,
user_id=user_id,
actor_id=actor_id,
resource_id=resource_id,
ip_address=ip_address,
metadata_=metadata_,
)
session.add(entry)
await session.flush() # flush within the handler's existing transaction
except Exception as exc:
logger.warning("audit log write failed: %s", exc)
# Do not re-raise — audit failure must never abort the primary operation
```
**Critical: `session.flush()` not `session.commit()`** — audit log writes within the same transaction as the primary operation, so the primary operation's `session.commit()` commits the audit entry too. This avoids partial writes where the operation succeeds but the audit entry is orphaned.
[ASSUMED: flush-not-commit pattern is the safest approach given transactional semantics; not explicitly documented in a prior phase decision]
### Pattern 7: CSV Export (admin endpoint)
**What:** `GET /api/admin/audit-log/export?format=csv` returns a StreamingResponse with `text/csv` content type and `Content-Disposition: attachment`.
```python
# Source: Python stdlib csv module
import csv
import io
from fastapi.responses import StreamingResponse
async def export_audit_log(
session: AsyncSession,
start: Optional[datetime],
end: Optional[datetime],
user_id: Optional[uuid.UUID],
event_type: Optional[str],
format: str = "csv",
):
rows = await _query_audit_log(session, start, end, user_id, event_type)
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=[
"id", "event_type", "user_id", "actor_id",
"resource_id", "ip_address", "metadata_", "created_at"
])
writer.writeheader()
for row in rows:
writer.writerow({
"id": row.id,
"event_type": row.event_type,
"user_id": str(row.user_id) if row.user_id else "",
"actor_id": str(row.actor_id) if row.actor_id else "",
"resource_id": str(row.resource_id) if row.resource_id else "",
"ip_address": str(row.ip_address) if row.ip_address else "",
"metadata_": str(row.metadata_) if row.metadata_ else "",
"created_at": row.created_at.isoformat(),
})
output.seek(0)
return StreamingResponse(
iter([output.getvalue()]),
media_type="text/csv",
headers={"Content-Disposition": f"attachment; filename=audit-export.csv"},
)
```
[CITED: Python stdlib csv documentation — https://docs.python.org/3/library/csv.html]
### Pattern 8: Celery Beat — Daily Audit Export Task
**What:** A new task `audit_log_daily_export` added to `beat_schedule`. Uses `crontab(hour=0, minute=0)` for midnight UTC.
```python
# Source: CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
from celery.schedules import crontab
# In celery_app.py beat_schedule dict — ADD to existing schedule:
"audit-log-daily-export": {
"task": "tasks.audit_tasks.audit_log_daily_export",
"schedule": crontab(hour=0, minute=0), # midnight UTC
},
```
**Task implementation pattern (mirrors cleanup_abandoned_uploads):**
```python
@celery_app.task(name="tasks.audit_tasks.audit_log_daily_export")
def audit_log_daily_export() -> dict:
return asyncio.run(_run_daily_export())
async def _run_daily_export() -> dict:
from datetime import date, timedelta
import csv, io
from db.session import AsyncSessionLocal
from db.models import AuditLog
from storage import get_storage_backend
from sqlalchemy import select
yesterday = date.today() - timedelta(days=1)
start = datetime(yesterday.year, yesterday.month, yesterday.day, tzinfo=timezone.utc)
end = start + timedelta(days=1)
async with AsyncSessionLocal() as session:
result = await session.execute(
select(AuditLog)
.where(AuditLog.created_at >= start, AuditLog.created_at < end)
.order_by(AuditLog.created_at)
)
rows = result.scalars().all()
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=[...])
writer.writeheader()
for row in rows:
writer.writerow({...})
csv_bytes = output.getvalue().encode("utf-8")
key = f"audit-logs/{yesterday.isoformat()}.csv"
backend = get_storage_backend()
await backend.put_object_raw(
bucket="audit-logs",
key=key,
data=io.BytesIO(csv_bytes),
length=len(csv_bytes),
content_type="text/csv",
)
return {"exported": len(rows), "key": key}
```
**Note:** `MinIOBackend` may need a `put_object_raw(bucket, key, data, length, content_type)` method (vs the existing `put_object` which constructs the key schema). The audit-logs bucket uses a different key scheme. Add this method in the migration plan.
[CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html]
### Pattern 9: Pydantic Response Model Exclusion (SEC-08)
**What:** Explicit `response_model=CloudConnectionOut` where `CloudConnectionOut` omits `credentials_enc`. This is the safe-by-default Pydantic v2 approach.
```python
# Source: CITED: https://fastapi.tiangolo.com/tutorial/response-model/
from pydantic import BaseModel
from datetime import datetime
class CloudConnectionOut(BaseModel):
id: str
provider: str
display_name: str
status: str
connected_at: datetime
# credentials_enc is deliberately absent — never serialized (SEC-08)
model_config = {"from_attributes": True}
```
**Phase 4 scope:** No cloud connection endpoints exist yet (Phase 5). Phase 4's obligation is to ensure the `CloudConnectionOut` model is defined and used as `response_model` on any admin endpoint that touches `cloud_connections`. Since no cloud connection endpoints exist yet, Phase 4 creates the model in `schemas/cloud.py` (or inline in `api/admin.py`) so Phase 5 cannot accidentally expose the field.
[CITED: https://fastapi.tiangolo.com/tutorial/response-model/]
### Pattern 10: Vue 3 Debounced Search (no external dependency)
**What:** Watch the search query ref with a manual debounce using `setTimeout`/`clearTimeout`. Does not require lodash or VueUse.
```javascript
// Source: Vue 3 Composition API watch pattern (Options API compatible via watch option)
// frontend/src/stores/documents.js
import { ref, watch } from 'vue'
const searchQuery = ref('')
let _searchTimer = null
watch(searchQuery, (newVal) => {
clearTimeout(_searchTimer)
if (newVal.length < 2) {
// Clear results or show all
return
}
_searchTimer = setTimeout(() => {
fetchDocuments({ q: newVal, folderId: currentFolderId.value })
}, 300)
})
```
[ASSUMED: 300ms is conventional for search debounce; no official Vue doc specifies this value]
### Anti-Patterns to Avoid
- **Using `session.commit()` inside `write_audit_log()`:** Must use `flush()` to stay within the handler's transaction. A separate commit risks committing the audit entry without the primary operation.
- **Generating presigned URLs in the streaming proxy:** D-08 explicitly prohibits this. Never call `presigned_get_url()` from the proxy endpoint. The bytes must flow through FastAPI.
- **`to_tsquery` instead of `plainto_tsquery`:** `to_tsquery` requires the user to supply `&`, `|`, and `!` operators. Search bar input uses `plainto_tsquery` which accepts natural language phrases.
- **Using `response_model_exclude` on CloudConnection:** Prefer defining an explicit `CloudConnectionOut` model that never includes `credentials_enc`. `response_model_exclude` is fragile (requires remembering to pass the set on every endpoint). Whitelist is safer.
- **Storing breadcrumb state in the URL query string:** Breadcrumb is derived from the current `folder_id` parameter — only the current folder ID needs to be in the URL (or in Vue state). The full path is reconstructed from the DB on navigation.
- **Recursive SQL for breadcrumb without depth limit:** `WITH RECURSIVE` CTE is correct but should include a depth guard (`WHERE depth < 50`) to prevent infinite recursion on a corrupted self-referential FK.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Range header parsing | Custom byte-range logic from scratch | Pattern 3 above (from FastAPI discussion #7718) | Browser PDFs use RFC 7233 range requests; edge cases (open-ended ranges `bytes=500-`) must be handled |
| Full-text search ranking | BM25 in Python | PostgreSQL `ts_rank()` + `plainto_tsquery` | Runs in the DB; GIN index makes it fast; no external service needed |
| CSV serialization | String concatenation with escaping | Python stdlib `csv.DictWriter` | Handles quoting, escaping, newlines in field values automatically |
| Audit log middleware | Celery task or FastAPI middleware for audit writes | Inline `write_audit_log()` after successful operation | Middleware fires on request start — cannot know if operation succeeded; Celery task adds latency and potential missed writes on worker failure |
| Tsvector maintenance | Application-layer `extracted_text` → tsvector sync | PostgreSQL GIN expression index `to_tsvector('english', extracted_text)` | DB recalculates automatically on UPDATE; no trigger or application sync needed |
| Debounce utility | npm package (lodash, VueUse) | `setTimeout`/`clearTimeout` in watch | < 10 lines; no new dependency justified |
**Key insight:** PostgreSQL handles full-text search natively with GIN indexes at query time — no external search infrastructure (Elasticsearch, Typesense, Meilisearch) is needed or justified at this scale.
---
## Common Pitfalls
### Pitfall 1: Folder Cascade-Delete Misses Sub-folder Documents
**What goes wrong:** `DELETE /api/folders/{id}` deletes direct child documents but leaves orphaned documents in sub-folders. Quota becomes incorrect.
**Why it happens:** A naive `WHERE folder_id = :id` only catches immediate children. Self-referential FK with unlimited nesting requires a recursive query to collect the full subtree.
**How to avoid:** Use `WITH RECURSIVE` CTE to collect all folder IDs in the subtree before deleting:
```sql
WITH RECURSIVE subtree AS (
SELECT id FROM folders WHERE id = :root_id AND user_id = :uid
UNION ALL
SELECT f.id FROM folders f
JOIN subtree s ON f.parent_id = s.id
WHERE f.user_id = :uid AND depth < 50
)
SELECT d.id, d.size_bytes, d.object_key
FROM documents d
JOIN subtree s ON d.folder_id = s.id
WHERE d.user_id = :uid
```
**Warning signs:** After deleting a folder tree, `quotas.used_bytes` does not match `SUM(documents.size_bytes)`.
[ASSUMED: SQLite does not support WITH RECURSIVE for test runs — integration tests requiring this pattern must use PostgreSQL (INTEGRATION=1)]
### Pitfall 2: Alembic Autogenerate Repeatedly Detects tsvector GIN Index
**What goes wrong:** After adding the GIN expression index in migration 0004, running `alembic revision --autogenerate` again generates a migration that drops and recreates the same index.
**Why it happens:** Alembic's autogenerate does not understand functional/expression indexes on PostgreSQL when the expression involves function calls like `to_tsvector()`. Known issue in Alembic.
**How to avoid:** Write the GIN index migration manually (as raw `op.execute(CREATE INDEX...)`) and do NOT add the index to the ORM model's `__table_args__`. Keep it only in the migration. Document with a comment `# managed manually — do not autogenerate`.
**Warning signs:** Repeated `alembic revision --autogenerate` output shows DROP INDEX / CREATE INDEX for `ix_documents_fts` on every run.
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]
### Pitfall 3: PDF Streaming Proxy — Admin Access
**What goes wrong:** Admin user calls `GET /api/documents/{id}/content` and receives document bytes.
**Why it happens:** Using `get_current_user` instead of `get_regular_user` on the proxy endpoint.
**How to avoid:** The proxy endpoint MUST use `get_regular_user` dep, which raises HTTP 403 for `user.role == 'admin'`. This is established in Phase 3 (D-16) and must be applied to the proxy endpoint.
**Warning signs:** Security test `test_admin_cannot_access_document_content` fails or is absent.
[VERIFIED: codebase — backend/deps/auth.py `get_regular_user` raises 403 for admin role]
### Pitfall 4: Share IDOR — Recipient Accesses Another Recipient's Share
**What goes wrong:** User A shares document with User B. User C (also a recipient of a different share on the same document, or just any user) can enumerate share IDs and revoke or view shares they don't own.
**Why it happens:** `DELETE /api/shares/{share_id}` checks only that the share exists, not that `share.owner_id == current_user.id`.
**How to avoid:** Ownership assertion on all share mutation endpoints:
```python
share = await session.get(Share, share_id)
if share is None or share.owner_id != current_user.id:
raise HTTPException(404, "Share not found")
```
**Warning signs:** Security test `test_share_revoke_wrong_owner_returns_404` is absent.
[ASSUMED: standard IDOR prevention pattern per CLAUDE.md; test must be written]
### Pitfall 5: write_audit_log Does Not Include IP Address
**What goes wrong:** `ip_address` column is left NULL for all events because the handler does not extract it from the request.
**Why it happens:** The `request.client.host` extraction is not wired into `write_audit_log()` calls.
**How to avoid:** Inject `request: Request` into each handler and pass `request.client.host` as `ip_address` to `write_audit_log()`. For handlers behind a reverse proxy, use `request.headers.get("X-Forwarded-For", request.client.host)`.
[ASSUMED: IP extraction from FastAPI Request is standard; X-Forwarded-For handling depends on deployment]
### Pitfall 6: Folder Name Uniqueness Constraint Violation
**What goes wrong:** Creating two folders with the same name under the same parent raises an IntegrityError from the `uq_folders_user_parent_name` constraint, which surfaces as HTTP 500.
**Why it happens:** The `Folder` model has `UniqueConstraint("user_id", "parent_id", "name")`. The API handler does not check for duplicates before inserting.
**How to avoid:** Catch `IntegrityError` from the ORM insert and return HTTP 409 Conflict:
```python
from sqlalchemy.exc import IntegrityError
try:
session.add(folder)
await session.commit()
except IntegrityError:
await session.rollback()
raise HTTPException(409, "A folder with that name already exists here")
```
**Warning signs:** POST /api/folders returns 500 on duplicate name.
[VERIFIED: codebase — db/models.py Folder `UniqueConstraint("user_id", "parent_id", "name")`]
### Pitfall 7: "Shared with Me" Leaks Document Content
**What goes wrong:** The `GET /api/documents` list endpoint accidentally returns documents shared with the current user (not just owned by them), exposing filenames and extracted text to recipients.
**Why it happens:** A join on the `shares` table is added to the list query without scoping by ownership.
**How to avoid:** The standard list endpoint (`GET /api/documents`) MUST filter `WHERE documents.user_id = current_user.id` only. Shared documents are a separate endpoint `GET /api/shares/received`. Extracted text is not returned in the shared documents list view — only metadata.
[VERIFIED: codebase — services/storage.py `list_metadata` already scopes by `user_id`]
### Pitfall 8: MinIO Audit-Logs Bucket Does Not Exist
**What goes wrong:** The daily Celery beat task fails with `NoSuchBucket` error on first run.
**Why it happens:** The `audit-logs` bucket is not created in the migration or application lifespan.
**How to avoid:** Create the `audit-logs` bucket in migration 0004's `upgrade()` function (same pattern as the main documents bucket, gated on `MINIO_ENDPOINT` env var to allow SQLite test runs):
```python
# In migration 0004 post-DDL step
minio_endpoint = os.environ.get("MINIO_ENDPOINT")
if minio_endpoint:
client = Minio(minio_endpoint, ...)
if not client.bucket_exists("audit-logs"):
client.make_bucket("audit-logs")
```
[VERIFIED: codebase — migration 0003 shows the `MINIO_ENDPOINT`-gated pattern; audit-logs bucket creation follows same approach]
---
## Runtime State Inventory
> Phase 4 is a feature addition (not a rename/refactor). No runtime state migration is required.
| Category | Items Found | Action Required |
|----------|-------------|------------------|
| Stored data | No existing `pdf_open_mode` values; column is NEW (default `'in_app'`) | Alembic migration 0004 adds column with server_default — no data migration |
| Stored data | Existing `audit_log` table is EMPTY (no events written in prior phases) | No migration; Phase 4 backfills audit writes into handlers going forward |
| Live service config | `audit-logs` MinIO bucket does not exist | Created in migration 0004 post-DDL step (gated on MINIO_ENDPOINT) |
| OS-registered state | None — verified by codebase scan | None |
| Secrets/env vars | No new env vars required for Phase 4 features | None |
| Build artifacts | No stale egg-info or compiled artifacts from prior phases | None |
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| PostgreSQL | tsvector GIN index, CTE folder cascade | [runtime — Docker Compose] | 15+ | SQLite for unit tests (no tsvector; integration tests require INTEGRATION=1) |
| MinIO | Streaming proxy, audit CSV upload | [runtime — Docker Compose] | latest | Mock in tests (established pattern) |
| Redis | Celery beat daily export task | [runtime — Docker Compose] | latest | Not mockable for beat tasks; integration tests skip beat scheduling |
| Python csv (stdlib) | CSV export endpoint + Celery export task | ✓ | stdlib | — |
**Missing dependencies with no fallback:**
- PostgreSQL tsvector: cannot be tested without a real PostgreSQL instance. Tests that exercise FTS must be marked with `pytest.mark.skipif(not live_services_available, ...)` or use the `INTEGRATION=1` pattern.
**Missing dependencies with fallback:**
- SQLite in unit tests: most folder/share CRUD tests run fine on SQLite. Only FTS and `WITH RECURSIVE` CTE tests require PostgreSQL.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | pytest + pytest-asyncio (already configured) |
| Config file | `pytest.ini` or `pyproject.toml` (check existing) |
| Quick run command | `pytest backend/tests/test_folders.py backend/tests/test_shares.py -x` |
| Full suite command | `cd backend && pytest -v` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| FOLD-01 | Create/rename/delete folder | Integration | `pytest backend/tests/test_folders.py::test_create_folder -x` | ❌ Wave 0 |
| FOLD-01 | Delete non-empty folder → warning count | Integration | `pytest backend/tests/test_folders.py::test_delete_folder_content_count -x` | ❌ Wave 0 |
| FOLD-02 | Move document to folder | Integration | `pytest backend/tests/test_folders.py::test_move_document -x` | ❌ Wave 0 |
| FOLD-02 | Move document — ownership assertion | Integration (negative) | `pytest backend/tests/test_folders.py::test_move_wrong_owner_404 -x` | ❌ Wave 0 |
| FOLD-03 | Breadcrumb path returned in folder response | Unit | `pytest backend/tests/test_folders.py::test_breadcrumb_path -x` | ❌ Wave 0 |
| FOLD-04 | Sort by name/date/size | Integration | `pytest backend/tests/test_folders.py::test_document_sort -x` | ❌ Wave 0 |
| FOLD-05 | tsvector search returns matching docs | Integration (PostgreSQL required) | `pytest backend/tests/test_folders.py::test_fts_search -x -m integration` | ❌ Wave 0 |
| SHARE-01 | Share by handle — handle not found → 404 | Integration | `pytest backend/tests/test_shares.py::test_share_handle_not_found -x` | ❌ Wave 0 |
| SHARE-01 | Share by handle — success | Integration | `pytest backend/tests/test_shares.py::test_share_success -x` | ❌ Wave 0 |
| SHARE-02 | Shared doc in recipient virtual folder | Integration | `pytest backend/tests/test_shares.py::test_shared_with_me -x` | ❌ Wave 0 |
| SHARE-02 | Shared doc — zero quota charged to recipient | Integration | `pytest backend/tests/test_shares.py::test_share_no_quota_impact -x` | ❌ Wave 0 |
| SHARE-04 | Revoke share — immediate for recipient | Integration | `pytest backend/tests/test_shares.py::test_revoke_share -x` | ❌ Wave 0 |
| SHARE-01..04 | Share IDOR — wrong owner cannot revoke | Security (negative) | `pytest backend/tests/test_shares.py::test_share_revoke_wrong_owner_404 -x` | ❌ Wave 0 |
| SEC-08 | credentials_enc absent from all responses | Security (negative) | `pytest backend/tests/test_security.py::test_credentials_enc_not_in_response -x` | ❌ Wave 0 |
| SEC-09 | Account deletion deletes user files | Integration | `pytest backend/tests/test_admin_api.py::test_delete_user_cleans_files -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy returns bytes, 200 or 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_200 -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — no presigned URL in response | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_no_presigned_url -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — Range header → 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_206_range -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — admin blocked (403) | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_admin_403 -x` | ❌ Wave 0 |
| ADMIN-06 | Audit log viewer — no document content in entries | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_no_doc_content -x` | ❌ Wave 0 |
| ADMIN-06 | Audit log viewer — admin only | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_regular_user_403 -x` | ❌ Wave 0 |
### Sampling Rate
- **Per task commit:** `pytest backend/tests/test_folders.py backend/tests/test_shares.py backend/tests/test_audit.py backend/tests/test_documents.py -x`
- **Per wave merge:** `cd backend && pytest -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `backend/tests/test_folders.py` — covers FOLD-01..05
- [ ] `backend/tests/test_shares.py` — covers SHARE-01..05 + IDOR security tests
- [ ] `backend/tests/test_audit.py` — covers ADMIN-06 + no-doc-content security tests
- [ ] `backend/tests/test_documents.py` — add proxy tests (test_content_stream_*) to existing file
- [ ] `backend/tests/test_security.py` — add SEC-08, SEC-09 tests (may be in test_admin_api.py)
---
## Security Domain
### Applicable ASVS Categories
| ASVS Category | Applies | Standard Control |
|---------------|---------|-----------------|
| V2 Authentication | yes | `get_regular_user` / `get_current_admin` deps on all new endpoints |
| V3 Session Management | no | No new session mechanisms introduced |
| V4 Access Control | yes | Ownership assertion on all folder/share endpoints; `get_regular_user` rejects admin on proxy |
| V5 Input Validation | yes | Pydantic models on all request bodies; `doc_id` parsed via `uuid.UUID()` |
| V6 Cryptography | no | No new cryptographic operations; credentials still encrypted via Phase 2 HKDF pattern |
### Known Threat Patterns for this Phase
| Pattern | STRIDE | Standard Mitigation |
|---------|--------|---------------------|
| Share IDOR — revoke another user's share | Elevation of privilege | `share.owner_id == current_user.id` → 404 on mismatch |
| Share IDOR — access shared doc content without share | Information disclosure | `_can_access_document()` check in proxy endpoint: owner OR active share |
| PDF proxy leaking presigned URL | Information disclosure | `get_object()` fetches bytes directly; presigned URL never generated in proxy handler |
| Admin accessing document content via proxy | Broken access control | `get_regular_user` dep raises 403 for admin role |
| Folder IDOR — delete another user's folder | Tampering | `folder.user_id == current_user.id` → 404 on mismatch |
| Audit log containing document content | Sensitive data exposure | `write_audit_log()` metadata_ MUST NOT include `filename`, `extracted_text`, or file bytes |
| Audit log admin access by regular user | Broken access control | `GET /api/admin/audit-log` uses `get_current_admin` dep |
| Path traversal via `folder_id` parameter | Information disclosure | All folder/document lookups via DB primary key; `uuid.UUID()` parse validates format |
| Tsvector search scope leak — user sees others' docs | Information disclosure | FTS query MUST include `Document.user_id == current_user.id` scope filter |
| credentials_enc in serialized admin response | Sensitive data exposure | `CloudConnectionOut` excludes field; never in any response |
### Security Gate Checklist (Phase 4 specific)
- [ ] `bandit -r backend/` — zero HIGH findings in new files (`api/folders.py`, `api/shares.py`, `api/audit.py`, `services/audit.py`, `tasks/audit_tasks.py`)
- [ ] `pip audit` — zero new critical/high CVEs (no new packages)
- [ ] `npm audit --audit-level=high` — zero (no new npm packages)
- [ ] `test_share_revoke_wrong_owner_404` passes
- [ ] `test_content_stream_admin_403` passes
- [ ] `test_content_stream_no_presigned_url` passes
- [ ] `test_audit_log_no_doc_content` passes — verifies `filename`, `extracted_text` absent from all audit entries
- [ ] `test_credentials_enc_not_in_response` passes
- [ ] `test_fts_search_scoped_to_owner` passes — other user's docs not in search results
---
## Code Examples
### Folder Breadcrumb Path Query (PostgreSQL CTE)
```sql
-- Source: CITED: https://www.postgresql.org/docs/current/queries-with.html
WITH RECURSIVE breadcrumb AS (
SELECT id, parent_id, name, 1 AS depth
FROM folders
WHERE id = :folder_id AND user_id = :uid
UNION ALL
SELECT f.id, f.parent_id, f.name, b.depth + 1
FROM folders f
JOIN breadcrumb b ON f.id = b.parent_id
WHERE f.user_id = :uid AND b.depth < 50
)
SELECT id, name, depth FROM breadcrumb ORDER BY depth DESC;
```
Returns path from root to current folder (root = highest depth value after `ORDER BY depth DESC`). Each segment is `{id, name}` for clickable navigation.
### Full-text Search Query (SQLAlchemy)
```python
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func
from sqlalchemy.dialects.postgresql import TSVECTOR
stmt = (
select(Document)
.where(
Document.user_id == current_user.id,
func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
func.plainto_tsquery("english", q)
),
)
.order_by(Document.created_at.desc())
.limit(per_page)
.offset((page - 1) * per_page)
)
```
### "Shared with me" Virtual Folder Query
```python
# Source: CONTEXT.md D-06
from sqlalchemy import select
from db.models import Share, Document
stmt = (
select(Document)
.join(Share, Share.document_id == Document.id)
.where(Share.recipient_id == current_user.id)
.order_by(Document.created_at.desc())
)
result = await session.execute(stmt)
shared_docs = result.scalars().all()
```
### `is_shared` indicator on document list
```python
# Source: CONTEXT.md SHARE-05 — owner's list shows "shared" badge
from sqlalchemy import exists, select
# Add to list_documents query:
shared_subq = (
select(Share.document_id)
.where(Share.owner_id == current_user.id)
.scalar_subquery()
)
# Add to Document select:
stmt = select(Document, Document.id.in_(shared_subq).label("is_shared"))
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| PostgreSQL `to_tsquery` (operator syntax required) | `plainto_tsquery` (natural language input) | PostgreSQL 9.3+ | Search bars should use `plainto_tsquery`; `to_tsquery` for programmatic use only |
| PDF.js for in-browser PDF viewing | Native browser PDF viewer via `Content-Type: application/pdf` + `Content-Disposition: inline` | Browser support ~2018+ | All modern browsers support native PDF rendering; PDF.js only needed for custom annotation features |
| Alembic autogenerate for GIN expression indexes | Manual migration SQL | Alembic 1.13+ (known bug) | Must write `op.execute("CREATE INDEX ... USING GIN")` manually |
| `asyncio.run()` in Celery tasks | `asyncio.run(_async_func())` is the standard bridge | Established in Phase 3 document_tasks.py | Celery sync task calls `asyncio.run()` to enter async context |
**Deprecated/outdated:**
- `to_tsvector` with stored Computed column: adds schema complexity; expression index is simpler and equally performant for this scale
- Separate audit log middleware: D-14 decision is inline writes; middleware approach not used
---
## Assumptions Log
| # | Claim | Section | Risk if Wrong |
|---|-------|---------|---------------|
| A1 | The streaming proxy's shared-document access check (Pattern 4) is implied by D-08 and SHARE-02 | Pattern 4, Security Domain | If recipients cannot access the proxy, "Shared with me" is useless; planner must confirm access rule |
| A2 | `session.flush()` (not `commit()`) inside `write_audit_log()` is the correct transactional pattern | Pattern 6 | Using `commit()` would create two separate transactions; if primary operation fails after audit commit, audit entry is orphaned with no corresponding operation |
| A3 | 300ms debounce interval for search | Pattern 10 | If too short, excessive API calls; if too long, UX feels sluggish. Standard convention, low risk |
| A4 | `WITH RECURSIVE` CTE for subtree cascade-delete is the correct PostgreSQL approach; SQLite tests must use INTEGRATION=1 | Pitfall 1, Common Pitfalls | SQLite does not support `WITH RECURSIVE` — all subtree cascade tests require PostgreSQL |
| A5 | `MinIOBackend.put_object_raw()` method needs to be added for audit-logs bucket (different key scheme from documents) | Pattern 8 | If not added, Celery export task cannot upload to `audit-logs` bucket using the correct key |
| A6 | The `audit-logs` MinIO bucket creation is safe to gate on `MINIO_ENDPOINT` env var (same as migration 0003 pattern) | Pitfall 8 | If MinIO is present but bucket creation is skipped, first export task fails |
**If this table is empty:** Not applicable — some assumptions remain in this research.
---
## Open Questions (RESOLVED)
1. **Subtree folder delete — CTE vs. iterative in Python**
- What we know: `WITH RECURSIVE` CTE works in PostgreSQL; SQLite does not support it
- What's unclear: Should the planner use the CTE (PostgreSQL-only, integration test) or iterative Python (works in SQLite, slower for deep trees)?
- Recommendation: Use CTE for the implementation (real database is PostgreSQL); mark the cascade-delete tests as `pytest.mark.skipif(not live_services_available, ...)` or use `INTEGRATION=1`
2. **Shared document access in proxy — `_can_access_document()` adds a DB query per request**
- What we know: Every `GET /api/documents/{id}/content` call currently does one DB lookup; adding share check adds a second
- What's unclear: Is the extra query acceptable given Phase 4 scale (single-user → small multi-user)?
- Recommendation: Add the share check unconditionally; at Phase 4 scale a second indexed query is negligible
3. **`PATCH /api/me/preferences` endpoint path**
- What we know: D-10 specifies this endpoint for pdf_open_mode
- What's unclear: Should it be on `/api/auth/me/preferences` (alongside `/api/auth/me`) or `/api/me/preferences` (new router)?
- Recommendation: Place on `/api/auth/me/preferences` (same router prefix as `/api/auth/me`) to keep auth-related user settings in the same module
---
## Sources
### Primary (HIGH confidence)
- `backend/db/models.py` — all ORM models verified: `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `User` (missing `pdf_open_mode`), `CloudConnection` (`credentials_enc` column confirmed)
- `backend/api/documents.py` — ownership assertion pattern, `get_regular_user` dep, quota UPDATE pattern
- `backend/deps/auth.py``get_regular_user` raises 403 for admin; `get_current_admin` pattern
- `backend/celery_app.py` — existing `beat_schedule` structure; `cleanup_abandoned_uploads` as template
- `backend/storage/minio_backend.py``get_object()` confirmed; `asyncio.to_thread()` pattern confirmed
- `backend/services/storage.py``list_metadata` scoping by `user_id` confirmed
### Secondary (MEDIUM confidence)
- [FastAPI discussions #7718](https://github.com/fastapi/fastapi/discussions/7718) — Range header StreamingResponse pattern verified by multiple community confirmations
- [PostgreSQL docs — Full-text search tables](https://www.postgresql.org/docs/current/textsearch-tables.html) — GIN expression index SQL verified from official docs
- [SQLAlchemy PostgreSQL dialect docs](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html) — `func.to_tsvector`, `plainto_tsquery` support confirmed
- [Celery periodic tasks docs](https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html) — `crontab(hour=0, minute=0)` pattern
- [FastAPI response model docs](https://fastapi.tiangolo.com/tutorial/response-model/) — `response_model=CloudConnectionOut` exclusion pattern
### Tertiary (LOW confidence)
- [Alembic issue #1390](https://github.com/sqlalchemy/alembic/issues/1390) — tsvector autogenerate bug confirmed by issue thread; marked for validation
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all packages already in requirements.txt; codebase verified
- Architecture: HIGH — all ORM models exist and were verified in codebase
- Pitfalls: MEDIUM — Pitfalls 1, 4 are confirmed from codebase + security requirements; Pitfall 2 confirmed from Alembic issue tracker
- Code patterns: MEDIUM — Range header pattern from FastAPI community discussion; PostgreSQL FTS from official docs
**Research date:** 2026-05-25
**Valid until:** 2026-06-25 (stable FastAPI/SQLAlchemy stack; no fast-moving dependencies)