# Phase 4: Folders, Sharing, Quotas & Document UX — Research
**Researched:** 2026-05-25
**Domain:** FastAPI folder/share CRUD, PostgreSQL tsvector, MinIO streaming proxy, Celery beat, Vue 3 folder navigation
**Confidence:** HIGH
---
## User Constraints (from CONTEXT.md)
### Locked Decisions
**D-01** Hybrid layout — AppSidebar shows top-level folders only. Sub-folders + breadcrumb in main content area. Top-level folders clickable directly in sidebar.
**D-02** Unlimited nesting depth (no API or UI cap). `Folder.parent_id` self-referential FK is authoritative. Breadcrumbs truncate at depth > 4 (show first + "..." + last 2).
**D-03** Non-empty folder delete: warning modal with document count. Confirm → cascade-delete all documents (MinIO + DB + quota). Cancel → no action. Documents are NOT moved to root — they are destroyed.
**D-04** Exact handle input — no autocomplete. API returns 404 if handle not found; UI shows "User not found" error.
**D-05** Share button on DocumentCard (inline icon button). Modal: (a) handle input, (b) Share button, (c) current recipients list with Revoke per row.
**D-06** "Shared with me" is a fixed virtual folder entry in AppSidebar, rendered above the user's own folder list. Filtered by `shares.recipient_id = current_user.id`. Zero quota charged to recipient.
**D-07** Share permission is `view` only for Phase 4. `edit` deferred.
**D-08** Streaming proxy endpoint: `GET /api/documents/{id}/content`. Returns bytes via FastAPI `StreamingResponse`. Supports `Range` headers. `Content-Disposition: inline`. No presigned URL ever generated or exposed. Uses `get_regular_user` dep.
**D-09** Native browser PDF rendering — no PDF.js. Zero frontend dependencies added. Content-Type header drives browser rendering.
**D-10** `users.pdf_open_mode` column (String, default `'in_app'`). Exposed via `PATCH /api/me/preferences`. `in_app` = modal with `
---
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|------------------|
| FOLD-01 | User can create, rename, and delete folders (delete confirms content count before proceeding) | `Folder` model complete in `db/models.py`; cascade-delete pattern via SQLAlchemy + atomic quota decrement established in Phase 3 |
| FOLD-02 | User can move documents between folders | `Document.folder_id` FK exists and is nullable; PATCH endpoint updates `folder_id` with ownership assertion on both doc and target folder |
| FOLD-03 | Breadcrumb navigation renders current folder path; each segment clickable | Recursive path query via `WITH RECURSIVE` CTE or iterative parent-chain walk; Vue computed property for breadcrumb array |
| FOLD-04 | Document list supports sort by name, date uploaded, and file size | SQLAlchemy `order_by()` with enum sort param; frontend `sort` query param |
| FOLD-05 | Full-text search via PostgreSQL tsvector index on extracted text | Expression-index approach `to_tsvector('english', extracted_text)` + GIN; `plainto_tsquery` in `WHERE` clause; Alembic migration adds index manually (not as Computed column to avoid autogenerate noise) |
| SHARE-01 | User can share a document with another user by their unique handle | `Share` model complete; `User.handle` unique lookup; exact match, no autocomplete |
| SHARE-02 | Shared documents appear in "Shared with me" virtual folder; zero quota charged to recipient | Filter `shares.recipient_id = current_user.id`; no quota row touched; virtual folder is a UI/API filter, not a DB folder |
| SHARE-03 | Shared access view-only by default; owner controls permission level | `Share.permission` defaults to `'view'`; Phase 4 only implements `view`; permission field returned in list responses |
| SHARE-04 | Owner can revoke share; revocation is immediate | `DELETE /api/shares/{share_id}` with owner assertion; no async cleanup needed |
| SHARE-05 | Documents shared with others display a "shared" indicator | Document list response includes `is_shared: bool` derived from EXISTS subquery on `shares.owner_id` |
| SEC-08 | `credentials_enc` excluded from all API serializers | Explicit `CloudConnectionOut` Pydantic response model excluding `credentials_enc` |
| SEC-09 | Account deletion triggers `delete_user_files()` per cloud connection before DB removal | Implemented in admin delete-user endpoint; iterates documents, deletes MinIO objects, decrements quota |
| ADMIN-06 | Admin can view audit log filtered by date range, user, and action type (metadata only) | `AuditLog` model complete; paginated query with filters; admin endpoint uses `get_current_admin` |
| DOC-01 | User can view document metadata and extracted text | Existing `GET /api/documents/{id}` already returns extracted_text; Phase 4 ensures it is included in response |
| DOC-02 | In-browser PDF preview via PDF.js proxy; no presigned URLs exposed | `GET /api/documents/{id}/content` — StreamingResponse with Range header support, bytes from MinIO via `asyncio.to_thread` |
---
## Summary
Phase 4 builds entirely on the existing ORM schema (all required models — `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `Share.recipient_id` — are already in `db/models.py`). The work is predominantly plumbing: creating new API router modules for folders, shares, and audit log; adding a single Alembic migration for the `users.pdf_open_mode` column and the tsvector GIN expression index; extending the Celery beat schedule with a daily audit export task; and extending the Vue 3 frontend with folder navigation, sharing modals, and the settings preference toggle.
The highest-risk areas are: (1) the PDF streaming proxy with Range header support (needs careful byte-range parsing to handle PDF viewer seeking), (2) the tsvector GIN index and corresponding Alembic migration (Alembic autogenerate has known quirks with `to_tsvector()` expression indexes — the migration must be written manually), and (3) the folder cascade-delete with atomic quota decrement for multiple documents (each document's size_bytes must be summed and decremented in a single atomic UPDATE or via a pre-computed sum).
No new Python or npm packages are required for Phase 4. The decision to use native browser PDF rendering (D-09) eliminates the PDF.js dependency. All needed libraries — FastAPI, SQLAlchemy, MinIO SDK, Celery, Pydantic — are already pinned in `requirements.txt`. The frontend needs only Tailwind utility classes; no new npm packages.
**Primary recommendation:** Build in 6 plans: Wave 0 test scaffolds + migration; folders backend; shares backend + audit service; streaming proxy + search; audit log admin API + Celery export task; frontend folder navigation + sharing + settings.
---
## Architectural Responsibility Map
| Capability | Primary Tier | Secondary Tier | Rationale |
|------------|-------------|----------------|-----------|
| Folder CRUD + breadcrumb path | API / Backend | Database / Storage | Ownership assertion and cascade-delete logic belongs in the API layer; path reconstruction is a DB query |
| Document move between folders | API / Backend | Database / Storage | Ownership assertions on both document and target folder must be co-located in the API handler |
| Full-text search (tsvector) | Database / Storage | API / Backend | GIN index and `plainto_tsquery` execute in PostgreSQL; API adds scope filter (`user_id`) |
| Share grant / revoke | API / Backend | Database / Storage | IDOR prevention requires ownership assertion before DB write; share table is the single source of truth |
| "Shared with me" virtual folder | API / Backend | — | Pure query filter on `shares.recipient_id`; no DB folder row; quota enforcement is in the API layer |
| PDF streaming proxy | API / Backend | Database / Storage | Bytes flow from MinIO through FastAPI to the browser; presigned URLs never reach the browser |
| Per-user PDF open preference | API / Backend | Database / Storage | User preference column (`users.pdf_open_mode`) read/written via `/api/me/preferences` |
| Audit log write | API / Backend | — | `write_audit_log()` called inline in each handler after successful operation; no middleware |
| Audit log viewer + export | API / Backend | Database / Storage | Admin-only paginated query with filters; CSV export is a StreamingResponse from a filtered DB query |
| Daily audit backup (Celery beat) | API / Backend (worker) | Database / Storage | Celery beat task runs in the worker; queries DB and uploads CSV to MinIO `audit-logs` bucket |
| Folder navigation UI state | Frontend Server (SSR) → Browser / Client | — | Folder ID and breadcrumb path held in Vue `ref()` in `HomeView.vue`; no server-side state |
| Share modal + indicator | Browser / Client | — | Share button on DocumentCard; modal state is local component `ref()`; share list in documents store |
| PDF open preference toggle | Browser / Client | API / Backend | Toggle in SettingsView.vue reads/writes preference via `PATCH /api/me/preferences` |
| `credentials_enc` exclusion (SEC-08) | API / Backend | — | Pydantic response model `CloudConnectionOut` excludes the field; never reaches serializer |
---
## Standard Stack
### Core (all already in requirements.txt / package.json)
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| FastAPI | 0.136+ | API router for folders, shares, audit, proxy, preferences | Already in use; `APIRouter` + `StreamingResponse` needed for proxy |
| SQLAlchemy 2.0 async | 2.0+ | ORM queries for all new endpoints | Established pattern in project; async session already wired |
| Alembic | 1.18+ | Migration 0004 — `users.pdf_open_mode` column + tsvector GIN index | Established migration toolchain |
| MinIO Python SDK | 7.2+ | `get_object()` for streaming proxy; `put_object()` for audit CSV upload | Already in use; `asyncio.to_thread()` pattern established |
| Celery + Redis | 5.5+ | Daily audit export beat task | Already in use; `beat_schedule` in `celery_app.py` |
| Pydantic v2 | 2.0+ | `CloudConnectionOut` response model excluding `credentials_enc` (SEC-08) | Already in use for all request/response models |
| Vue 3 (Options API) | 3.4+ | Folder nav, share modal, settings toggle | Already in use; `ref()` + `watch()` patterns established |
[VERIFIED: codebase — requirements.txt and package.json]
### No New Packages Required
Phase 4 introduces zero new dependencies. Key rationale:
- PDF viewer: D-09 chose native browser rendering — no PDF.js
- Debounce: hand-rolled `setTimeout`/`clearTimeout` in Vue `watch` callback (< 10 lines, no lodash needed)
- CSV export: Python stdlib `csv.DictWriter` with `io.StringIO` — no pandas
- Search: PostgreSQL built-in `plainto_tsquery` — no Elasticsearch or external search engine
[VERIFIED: codebase — no new deps required]
---
## Package Legitimacy Audit
No new packages are installed in Phase 4. All runtime dependencies are carried forward from Phases 1–3 with pinned versions in `requirements.txt`.
**Packages removed due to slopcheck [SLOP] verdict:** none
**Packages flagged as suspicious [SUS]:** none
**New packages to install:** none
---
## Architecture Patterns
### System Architecture Diagram
```
Browser
│
├── GET /api/documents?q=search&folder_id=X&sort=name
│ FastAPI → SQLAlchemy → PostgreSQL (tsvector @@ plainto_tsquery, scoped to user)
│ ← JSON: [{id, filename, size_bytes, is_shared, folder_id, ...}]
│
├── POST /api/folders / PATCH /api/folders/{id} / DELETE /api/folders/{id}
│ FastAPI → ownership assertion → SQLAlchemy → PostgreSQL
│ DELETE: sum(doc.size_bytes) → atomic quota decrement → MinIO delete_object per doc
│
├── POST /api/shares / DELETE /api/shares/{id}
│ FastAPI → ownership assertion (owner_id = current_user.id) → SQLAlchemy → PostgreSQL
│ GET /api/shares/received → filter shares.recipient_id = current_user.id
│
├── GET /api/documents/{id}/content [Range: bytes=X-Y]
│ FastAPI (get_regular_user) → ownership OR share-access check → SQLAlchemy (object_key)
│ → MinIO get_object() via asyncio.to_thread() → StreamingResponse (206 or 200)
│ ← bytes with Content-Type, Content-Disposition: inline, Accept-Ranges: bytes
│ NOTE: no presigned URL generated; bytes flow through FastAPI
│
├── PATCH /api/me/preferences
│ FastAPI → SQLAlchemy UPDATE users.pdf_open_mode → ← {pdf_open_mode}
│
└── GET /api/admin/audit-log [?start=&end=&user_id=&event_type=&page=]
FastAPI (get_current_admin) → SQLAlchemy → PostgreSQL
← JSON: {items: [{id, event_type, user_id, actor_id, ip_address, metadata_, created_at}], total}
GET /api/admin/audit-log/export?format=csv|json
FastAPI (get_current_admin) → query → StreamingResponse (text/csv or application/json)
Celery Beat Worker (daily at midnight UTC)
│
└── audit_log_daily_export task
→ AsyncSessionLocal → SELECT audit_log WHERE date = yesterday
→ csv.DictWriter → io.BytesIO
→ MinIO put_object("audit-logs", "YYYY-MM-DD.csv", bytes)
```
### Recommended Project Structure (new files only)
```
backend/
├── api/
│ ├── folders.py # FOLD-01, FOLD-02, FOLD-03, FOLD-04, FOLD-05
│ ├── shares.py # SHARE-01..05
│ └── audit.py # ADMIN-06 (admin viewer + export)
├── services/
│ └── audit.py # write_audit_log() helper
├── tasks/
│ └── audit_tasks.py # audit_log_daily_export Celery beat task
└── migrations/versions/
└── 0004_phase4_pdf_open_mode_tsvector.py
frontend/src/
├── components/
│ ├── documents/
│ │ └── ShareModal.vue # D-05
│ ├── layout/
│ │ └── BreadcrumbNav.vue # FOLD-03
│ └── admin/
│ └── AdminAuditLogTab.vue # ADMIN-06 (D-15, D-16)
├── stores/
│ └── folders.js # folder state + actions
└── views/
└── SettingsView.vue # extend with PDF preference toggle (D-10)
```
### Pattern 1: Ownership Assertion (established in Phase 3 — D-16)
**What:** Every resource endpoint asserts `resource.user_id == current_user.id`. Cross-user access returns 404 (not 403) to prevent attacker enumeration of resource IDs.
**When to use:** All folder CRUD, document move, share grant/revoke (owner side), streaming proxy.
```python
# Source: backend/api/documents.py (established pattern)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
raise HTTPException(status_code=404, detail="Document not found")
```
**Folder variant:**
```python
folder = await session.get(Folder, folder_id)
if folder is None or folder.user_id != current_user.id:
raise HTTPException(status_code=404, detail="Folder not found")
```
[VERIFIED: codebase — backend/api/documents.py]
### Pattern 2: Atomic Quota Decrement for Folder Cascade-Delete
**What:** When deleting a folder, collect all document `size_bytes` first, then issue a single atomic quota UPDATE. Never read-then-write in Python.
**When to use:** `DELETE /api/folders/{id}` when folder is non-empty.
```python
# Source: established atomic pattern (CLAUDE.md + Phase 3 D-07)
# Step 1: collect document IDs and total size within the folder subtree
result = await session.execute(
select(Document.id, Document.size_bytes, Document.object_key)
.where(Document.folder_id == folder_id, Document.user_id == current_user.id)
)
docs = result.all()
total_bytes = sum(row.size_bytes for row in docs)
# Step 2: atomic quota decrement (CASE WHEN pattern — SQLite compatible)
await session.execute(
text(
"UPDATE quotas SET used_bytes = "
"CASE WHEN used_bytes > :delta THEN used_bytes - :delta ELSE 0 END "
"WHERE user_id = :uid"
),
{"delta": total_bytes, "uid": str(current_user.id)},
)
# Step 3: delete MinIO objects (best-effort)
for row in docs:
try:
await get_storage_backend().delete_object(row.object_key)
except Exception:
pass # MinIO cleanup is best-effort
# Step 4: delete documents (cascade via FK) then folder
await session.execute(
delete(Document).where(Document.folder_id == folder_id)
)
await session.delete(folder)
await session.commit()
```
[ASSUMED: the CASE WHEN pattern for SQLite compat is established in STATE.md; verified in existing delete_document service function]
### Pattern 3: PDF Streaming Proxy with Range Headers
**What:** `GET /api/documents/{id}/content` fetches bytes from MinIO and streams them with Range header support. Status 206 for range requests; 200 for full fetches.
**When to use:** DOC-02. Required for PDF viewer to seek pages without re-downloading the entire file.
```python
# Source: FastAPI discussion #7718 (CITED: github.com/fastapi/fastapi/discussions/7718)
from fastapi import Request
from fastapi.responses import StreamingResponse
def _parse_range(range_header: str, file_size: int) -> tuple[int, int]:
"""Parse 'bytes=start-end' Range header. Raises 416 on invalid range."""
try:
h = range_header.replace("bytes=", "").split("-")
start = int(h[0]) if h[0] != "" else 0
end = int(h[1]) if h[1] != "" else file_size - 1
except (ValueError, IndexError):
raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
if start > end or start < 0 or end >= file_size:
raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
return start, end
@router.get("/{doc_id}/content")
async def stream_document_content(
doc_id: str,
request: Request,
session: AsyncSession = Depends(get_db),
current_user: User = Depends(get_regular_user),
):
# Ownership assertion (DOC-02 also allows shared-with-me access — see Pattern 4)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
# Also allow if document is shared with current_user
share = await _get_share_for_user(session, uid, current_user.id)
if share is None:
raise HTTPException(404, "Document not found")
# Use the document from the share's owner context
# (doc is already loaded; ownership check skipped for shared access)
# Fetch all bytes from MinIO (asyncio.to_thread wraps sync SDK)
file_bytes = await get_storage_backend().get_object(doc.object_key)
file_size = len(file_bytes)
range_header = request.headers.get("range")
headers = {
"content-type": doc.content_type,
"content-disposition": f"inline; filename=\"{doc.filename}\"",
"accept-ranges": "bytes",
"content-length": str(file_size),
}
if range_header:
start, end = _parse_range(range_header, file_size)
chunk = file_bytes[start:end + 1]
headers["content-range"] = f"bytes {start}-{end}/{file_size}"
headers["content-length"] = str(len(chunk))
return StreamingResponse(
iter([chunk]),
status_code=206,
headers=headers,
)
return StreamingResponse(
iter([file_bytes]),
status_code=200,
headers=headers,
)
```
**Critical:** `get_regular_user` dep ensures admins cannot access document content (DOC-02 / CLAUDE.md). No presigned URL is generated at any point.
[CITED: https://github.com/fastapi/fastapi/discussions/7718]
[CITED: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/206]
### Pattern 4: Shared-Document Access for Streaming Proxy
**What:** The streaming proxy must allow both document owners AND recipients to access content.
**When to use:** Only in `GET /api/documents/{id}/content` — sharing grants read access to the content stream.
```python
# Source: CONTEXT.md D-08 — recipient can preview shared document
async def _can_access_document(
session: AsyncSession, doc: Document, current_user_id: uuid.UUID
) -> bool:
"""True if user owns the document OR has an active share as recipient."""
if doc.user_id == current_user_id:
return True
result = await session.execute(
select(Share).where(
Share.document_id == doc.id,
Share.recipient_id == current_user_id,
)
)
return result.scalar_one_or_none() is not None
```
[ASSUMED: the share-based access for the streaming proxy is implied by D-08 and SHARE-02; not explicitly called out in a prior phase decision]
### Pattern 5: PostgreSQL tsvector Full-Text Search
**What:** GIN expression index on `documents.extracted_text` for full-text search. Query uses `plainto_tsquery` (accepts natural language; no operator syntax required from user).
**When to use:** FOLD-05 — `GET /api/documents?q=`.
**Migration (manual — do NOT use Computed() to avoid Alembic autogenerate noise):**
```python
# Source: CITED: https://www.postgresql.org/docs/current/textsearch-tables.html
# In migration 0004:
op.execute(
"CREATE INDEX ix_documents_fts ON documents "
"USING GIN (to_tsvector('english', coalesce(extracted_text, '')))"
)
```
**Query pattern:**
```python
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func, text
stmt = (
select(Document)
.where(
Document.user_id == current_user.id,
func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
func.plainto_tsquery("english", q)
),
)
.order_by(Document.created_at.desc())
)
```
**Why `plainto_tsquery` not `to_tsquery`:** `plainto_tsquery` accepts unstructured natural language ("invoice march 2024") without requiring the user to supply `&` and `|` operators. Simpler and safer for a search bar input.
**Why expression index not generated/stored column:** PostgreSQL generated columns with `TSVECTOR` type require PostgreSQL 12+ and add schema complexity. The expression index approach is simpler and Alembic-compatible (written as raw SQL in the migration; autogenerate will not re-detect it if managed manually).
**Important Alembic caveat:** `include_schemas=True` autogenerate WILL repeatedly flag the expression index for recreation (known Alembic issue: github.com/sqlalchemy/alembic/issues/1390). The migration MUST be written manually and committed once. Do not run `alembic revision --autogenerate` after adding the GIN index without reviewing the output.
[CITED: https://www.postgresql.org/docs/current/textsearch-tables.html]
[CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html]
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]
### Pattern 6: write_audit_log() Helper
**What:** Shared service function called inline in each handler after a successful operation. Does NOT raise on failure (audit log writing is best-effort — do not fail the primary operation).
**When to use:** Every handler in Phase 4 that performs state-changing operations; plus back-filling into `auth.py` and `admin.py`.
```python
# Source: CONTEXT.md D-14
# backend/services/audit.py
from __future__ import annotations
import uuid
from typing import Optional
from sqlalchemy.ext.asyncio import AsyncSession
from db.models import AuditLog
import logging
logger = logging.getLogger(__name__)
async def write_audit_log(
session: AsyncSession,
event_type: str,
user_id: Optional[uuid.UUID],
actor_id: Optional[uuid.UUID],
resource_id: Optional[uuid.UUID],
ip_address: Optional[str],
metadata_: Optional[dict] = None,
) -> None:
"""Write an audit log entry. Never raises — audit failure is non-fatal."""
try:
entry = AuditLog(
event_type=event_type,
user_id=user_id,
actor_id=actor_id,
resource_id=resource_id,
ip_address=ip_address,
metadata_=metadata_,
)
session.add(entry)
await session.flush() # flush within the handler's existing transaction
except Exception as exc:
logger.warning("audit log write failed: %s", exc)
# Do not re-raise — audit failure must never abort the primary operation
```
**Critical: `session.flush()` not `session.commit()`** — audit log writes within the same transaction as the primary operation, so the primary operation's `session.commit()` commits the audit entry too. This avoids partial writes where the operation succeeds but the audit entry is orphaned.
[ASSUMED: flush-not-commit pattern is the safest approach given transactional semantics; not explicitly documented in a prior phase decision]
### Pattern 7: CSV Export (admin endpoint)
**What:** `GET /api/admin/audit-log/export?format=csv` returns a StreamingResponse with `text/csv` content type and `Content-Disposition: attachment`.
```python
# Source: Python stdlib csv module
import csv
import io
from fastapi.responses import StreamingResponse
async def export_audit_log(
session: AsyncSession,
start: Optional[datetime],
end: Optional[datetime],
user_id: Optional[uuid.UUID],
event_type: Optional[str],
format: str = "csv",
):
rows = await _query_audit_log(session, start, end, user_id, event_type)
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=[
"id", "event_type", "user_id", "actor_id",
"resource_id", "ip_address", "metadata_", "created_at"
])
writer.writeheader()
for row in rows:
writer.writerow({
"id": row.id,
"event_type": row.event_type,
"user_id": str(row.user_id) if row.user_id else "",
"actor_id": str(row.actor_id) if row.actor_id else "",
"resource_id": str(row.resource_id) if row.resource_id else "",
"ip_address": str(row.ip_address) if row.ip_address else "",
"metadata_": str(row.metadata_) if row.metadata_ else "",
"created_at": row.created_at.isoformat(),
})
output.seek(0)
return StreamingResponse(
iter([output.getvalue()]),
media_type="text/csv",
headers={"Content-Disposition": f"attachment; filename=audit-export.csv"},
)
```
[CITED: Python stdlib csv documentation — https://docs.python.org/3/library/csv.html]
### Pattern 8: Celery Beat — Daily Audit Export Task
**What:** A new task `audit_log_daily_export` added to `beat_schedule`. Uses `crontab(hour=0, minute=0)` for midnight UTC.
```python
# Source: CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
from celery.schedules import crontab
# In celery_app.py beat_schedule dict — ADD to existing schedule:
"audit-log-daily-export": {
"task": "tasks.audit_tasks.audit_log_daily_export",
"schedule": crontab(hour=0, minute=0), # midnight UTC
},
```
**Task implementation pattern (mirrors cleanup_abandoned_uploads):**
```python
@celery_app.task(name="tasks.audit_tasks.audit_log_daily_export")
def audit_log_daily_export() -> dict:
return asyncio.run(_run_daily_export())
async def _run_daily_export() -> dict:
from datetime import date, timedelta
import csv, io
from db.session import AsyncSessionLocal
from db.models import AuditLog
from storage import get_storage_backend
from sqlalchemy import select
yesterday = date.today() - timedelta(days=1)
start = datetime(yesterday.year, yesterday.month, yesterday.day, tzinfo=timezone.utc)
end = start + timedelta(days=1)
async with AsyncSessionLocal() as session:
result = await session.execute(
select(AuditLog)
.where(AuditLog.created_at >= start, AuditLog.created_at < end)
.order_by(AuditLog.created_at)
)
rows = result.scalars().all()
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=[...])
writer.writeheader()
for row in rows:
writer.writerow({...})
csv_bytes = output.getvalue().encode("utf-8")
key = f"audit-logs/{yesterday.isoformat()}.csv"
backend = get_storage_backend()
await backend.put_object_raw(
bucket="audit-logs",
key=key,
data=io.BytesIO(csv_bytes),
length=len(csv_bytes),
content_type="text/csv",
)
return {"exported": len(rows), "key": key}
```
**Note:** `MinIOBackend` may need a `put_object_raw(bucket, key, data, length, content_type)` method (vs the existing `put_object` which constructs the key schema). The audit-logs bucket uses a different key scheme. Add this method in the migration plan.
[CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html]
### Pattern 9: Pydantic Response Model Exclusion (SEC-08)
**What:** Explicit `response_model=CloudConnectionOut` where `CloudConnectionOut` omits `credentials_enc`. This is the safe-by-default Pydantic v2 approach.
```python
# Source: CITED: https://fastapi.tiangolo.com/tutorial/response-model/
from pydantic import BaseModel
from datetime import datetime
class CloudConnectionOut(BaseModel):
id: str
provider: str
display_name: str
status: str
connected_at: datetime
# credentials_enc is deliberately absent — never serialized (SEC-08)
model_config = {"from_attributes": True}
```
**Phase 4 scope:** No cloud connection endpoints exist yet (Phase 5). Phase 4's obligation is to ensure the `CloudConnectionOut` model is defined and used as `response_model` on any admin endpoint that touches `cloud_connections`. Since no cloud connection endpoints exist yet, Phase 4 creates the model in `schemas/cloud.py` (or inline in `api/admin.py`) so Phase 5 cannot accidentally expose the field.
[CITED: https://fastapi.tiangolo.com/tutorial/response-model/]
### Pattern 10: Vue 3 Debounced Search (no external dependency)
**What:** Watch the search query ref with a manual debounce using `setTimeout`/`clearTimeout`. Does not require lodash or VueUse.
```javascript
// Source: Vue 3 Composition API watch pattern (Options API compatible via watch option)
// frontend/src/stores/documents.js
import { ref, watch } from 'vue'
const searchQuery = ref('')
let _searchTimer = null
watch(searchQuery, (newVal) => {
clearTimeout(_searchTimer)
if (newVal.length < 2) {
// Clear results or show all
return
}
_searchTimer = setTimeout(() => {
fetchDocuments({ q: newVal, folderId: currentFolderId.value })
}, 300)
})
```
[ASSUMED: 300ms is conventional for search debounce; no official Vue doc specifies this value]
### Anti-Patterns to Avoid
- **Using `session.commit()` inside `write_audit_log()`:** Must use `flush()` to stay within the handler's transaction. A separate commit risks committing the audit entry without the primary operation.
- **Generating presigned URLs in the streaming proxy:** D-08 explicitly prohibits this. Never call `presigned_get_url()` from the proxy endpoint. The bytes must flow through FastAPI.
- **`to_tsquery` instead of `plainto_tsquery`:** `to_tsquery` requires the user to supply `&`, `|`, and `!` operators. Search bar input uses `plainto_tsquery` which accepts natural language phrases.
- **Using `response_model_exclude` on CloudConnection:** Prefer defining an explicit `CloudConnectionOut` model that never includes `credentials_enc`. `response_model_exclude` is fragile (requires remembering to pass the set on every endpoint). Whitelist is safer.
- **Storing breadcrumb state in the URL query string:** Breadcrumb is derived from the current `folder_id` parameter — only the current folder ID needs to be in the URL (or in Vue state). The full path is reconstructed from the DB on navigation.
- **Recursive SQL for breadcrumb without depth limit:** `WITH RECURSIVE` CTE is correct but should include a depth guard (`WHERE depth < 50`) to prevent infinite recursion on a corrupted self-referential FK.
---
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Range header parsing | Custom byte-range logic from scratch | Pattern 3 above (from FastAPI discussion #7718) | Browser PDFs use RFC 7233 range requests; edge cases (open-ended ranges `bytes=500-`) must be handled |
| Full-text search ranking | BM25 in Python | PostgreSQL `ts_rank()` + `plainto_tsquery` | Runs in the DB; GIN index makes it fast; no external service needed |
| CSV serialization | String concatenation with escaping | Python stdlib `csv.DictWriter` | Handles quoting, escaping, newlines in field values automatically |
| Audit log middleware | Celery task or FastAPI middleware for audit writes | Inline `write_audit_log()` after successful operation | Middleware fires on request start — cannot know if operation succeeded; Celery task adds latency and potential missed writes on worker failure |
| Tsvector maintenance | Application-layer `extracted_text` → tsvector sync | PostgreSQL GIN expression index `to_tsvector('english', extracted_text)` | DB recalculates automatically on UPDATE; no trigger or application sync needed |
| Debounce utility | npm package (lodash, VueUse) | `setTimeout`/`clearTimeout` in watch | < 10 lines; no new dependency justified |
**Key insight:** PostgreSQL handles full-text search natively with GIN indexes at query time — no external search infrastructure (Elasticsearch, Typesense, Meilisearch) is needed or justified at this scale.
---
## Common Pitfalls
### Pitfall 1: Folder Cascade-Delete Misses Sub-folder Documents
**What goes wrong:** `DELETE /api/folders/{id}` deletes direct child documents but leaves orphaned documents in sub-folders. Quota becomes incorrect.
**Why it happens:** A naive `WHERE folder_id = :id` only catches immediate children. Self-referential FK with unlimited nesting requires a recursive query to collect the full subtree.
**How to avoid:** Use `WITH RECURSIVE` CTE to collect all folder IDs in the subtree before deleting:
```sql
WITH RECURSIVE subtree AS (
SELECT id FROM folders WHERE id = :root_id AND user_id = :uid
UNION ALL
SELECT f.id FROM folders f
JOIN subtree s ON f.parent_id = s.id
WHERE f.user_id = :uid AND depth < 50
)
SELECT d.id, d.size_bytes, d.object_key
FROM documents d
JOIN subtree s ON d.folder_id = s.id
WHERE d.user_id = :uid
```
**Warning signs:** After deleting a folder tree, `quotas.used_bytes` does not match `SUM(documents.size_bytes)`.
[ASSUMED: SQLite does not support WITH RECURSIVE for test runs — integration tests requiring this pattern must use PostgreSQL (INTEGRATION=1)]
### Pitfall 2: Alembic Autogenerate Repeatedly Detects tsvector GIN Index
**What goes wrong:** After adding the GIN expression index in migration 0004, running `alembic revision --autogenerate` again generates a migration that drops and recreates the same index.
**Why it happens:** Alembic's autogenerate does not understand functional/expression indexes on PostgreSQL when the expression involves function calls like `to_tsvector()`. Known issue in Alembic.
**How to avoid:** Write the GIN index migration manually (as raw `op.execute(CREATE INDEX...)`) and do NOT add the index to the ORM model's `__table_args__`. Keep it only in the migration. Document with a comment `# managed manually — do not autogenerate`.
**Warning signs:** Repeated `alembic revision --autogenerate` output shows DROP INDEX / CREATE INDEX for `ix_documents_fts` on every run.
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]
### Pitfall 3: PDF Streaming Proxy — Admin Access
**What goes wrong:** Admin user calls `GET /api/documents/{id}/content` and receives document bytes.
**Why it happens:** Using `get_current_user` instead of `get_regular_user` on the proxy endpoint.
**How to avoid:** The proxy endpoint MUST use `get_regular_user` dep, which raises HTTP 403 for `user.role == 'admin'`. This is established in Phase 3 (D-16) and must be applied to the proxy endpoint.
**Warning signs:** Security test `test_admin_cannot_access_document_content` fails or is absent.
[VERIFIED: codebase — backend/deps/auth.py `get_regular_user` raises 403 for admin role]
### Pitfall 4: Share IDOR — Recipient Accesses Another Recipient's Share
**What goes wrong:** User A shares document with User B. User C (also a recipient of a different share on the same document, or just any user) can enumerate share IDs and revoke or view shares they don't own.
**Why it happens:** `DELETE /api/shares/{share_id}` checks only that the share exists, not that `share.owner_id == current_user.id`.
**How to avoid:** Ownership assertion on all share mutation endpoints:
```python
share = await session.get(Share, share_id)
if share is None or share.owner_id != current_user.id:
raise HTTPException(404, "Share not found")
```
**Warning signs:** Security test `test_share_revoke_wrong_owner_returns_404` is absent.
[ASSUMED: standard IDOR prevention pattern per CLAUDE.md; test must be written]
### Pitfall 5: write_audit_log Does Not Include IP Address
**What goes wrong:** `ip_address` column is left NULL for all events because the handler does not extract it from the request.
**Why it happens:** The `request.client.host` extraction is not wired into `write_audit_log()` calls.
**How to avoid:** Inject `request: Request` into each handler and pass `request.client.host` as `ip_address` to `write_audit_log()`. For handlers behind a reverse proxy, use `request.headers.get("X-Forwarded-For", request.client.host)`.
[ASSUMED: IP extraction from FastAPI Request is standard; X-Forwarded-For handling depends on deployment]
### Pitfall 6: Folder Name Uniqueness Constraint Violation
**What goes wrong:** Creating two folders with the same name under the same parent raises an IntegrityError from the `uq_folders_user_parent_name` constraint, which surfaces as HTTP 500.
**Why it happens:** The `Folder` model has `UniqueConstraint("user_id", "parent_id", "name")`. The API handler does not check for duplicates before inserting.
**How to avoid:** Catch `IntegrityError` from the ORM insert and return HTTP 409 Conflict:
```python
from sqlalchemy.exc import IntegrityError
try:
session.add(folder)
await session.commit()
except IntegrityError:
await session.rollback()
raise HTTPException(409, "A folder with that name already exists here")
```
**Warning signs:** POST /api/folders returns 500 on duplicate name.
[VERIFIED: codebase — db/models.py Folder `UniqueConstraint("user_id", "parent_id", "name")`]
### Pitfall 7: "Shared with Me" Leaks Document Content
**What goes wrong:** The `GET /api/documents` list endpoint accidentally returns documents shared with the current user (not just owned by them), exposing filenames and extracted text to recipients.
**Why it happens:** A join on the `shares` table is added to the list query without scoping by ownership.
**How to avoid:** The standard list endpoint (`GET /api/documents`) MUST filter `WHERE documents.user_id = current_user.id` only. Shared documents are a separate endpoint `GET /api/shares/received`. Extracted text is not returned in the shared documents list view — only metadata.
[VERIFIED: codebase — services/storage.py `list_metadata` already scopes by `user_id`]
### Pitfall 8: MinIO Audit-Logs Bucket Does Not Exist
**What goes wrong:** The daily Celery beat task fails with `NoSuchBucket` error on first run.
**Why it happens:** The `audit-logs` bucket is not created in the migration or application lifespan.
**How to avoid:** Create the `audit-logs` bucket in migration 0004's `upgrade()` function (same pattern as the main documents bucket, gated on `MINIO_ENDPOINT` env var to allow SQLite test runs):
```python
# In migration 0004 post-DDL step
minio_endpoint = os.environ.get("MINIO_ENDPOINT")
if minio_endpoint:
client = Minio(minio_endpoint, ...)
if not client.bucket_exists("audit-logs"):
client.make_bucket("audit-logs")
```
[VERIFIED: codebase — migration 0003 shows the `MINIO_ENDPOINT`-gated pattern; audit-logs bucket creation follows same approach]
---
## Runtime State Inventory
> Phase 4 is a feature addition (not a rename/refactor). No runtime state migration is required.
| Category | Items Found | Action Required |
|----------|-------------|------------------|
| Stored data | No existing `pdf_open_mode` values; column is NEW (default `'in_app'`) | Alembic migration 0004 adds column with server_default — no data migration |
| Stored data | Existing `audit_log` table is EMPTY (no events written in prior phases) | No migration; Phase 4 backfills audit writes into handlers going forward |
| Live service config | `audit-logs` MinIO bucket does not exist | Created in migration 0004 post-DDL step (gated on MINIO_ENDPOINT) |
| OS-registered state | None — verified by codebase scan | None |
| Secrets/env vars | No new env vars required for Phase 4 features | None |
| Build artifacts | No stale egg-info or compiled artifacts from prior phases | None |
---
## Environment Availability
| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| PostgreSQL | tsvector GIN index, CTE folder cascade | [runtime — Docker Compose] | 15+ | SQLite for unit tests (no tsvector; integration tests require INTEGRATION=1) |
| MinIO | Streaming proxy, audit CSV upload | [runtime — Docker Compose] | latest | Mock in tests (established pattern) |
| Redis | Celery beat daily export task | [runtime — Docker Compose] | latest | Not mockable for beat tasks; integration tests skip beat scheduling |
| Python csv (stdlib) | CSV export endpoint + Celery export task | ✓ | stdlib | — |
**Missing dependencies with no fallback:**
- PostgreSQL tsvector: cannot be tested without a real PostgreSQL instance. Tests that exercise FTS must be marked with `pytest.mark.skipif(not live_services_available, ...)` or use the `INTEGRATION=1` pattern.
**Missing dependencies with fallback:**
- SQLite in unit tests: most folder/share CRUD tests run fine on SQLite. Only FTS and `WITH RECURSIVE` CTE tests require PostgreSQL.
---
## Validation Architecture
### Test Framework
| Property | Value |
|----------|-------|
| Framework | pytest + pytest-asyncio (already configured) |
| Config file | `pytest.ini` or `pyproject.toml` (check existing) |
| Quick run command | `pytest backend/tests/test_folders.py backend/tests/test_shares.py -x` |
| Full suite command | `cd backend && pytest -v` |
### Phase Requirements → Test Map
| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| FOLD-01 | Create/rename/delete folder | Integration | `pytest backend/tests/test_folders.py::test_create_folder -x` | ❌ Wave 0 |
| FOLD-01 | Delete non-empty folder → warning count | Integration | `pytest backend/tests/test_folders.py::test_delete_folder_content_count -x` | ❌ Wave 0 |
| FOLD-02 | Move document to folder | Integration | `pytest backend/tests/test_folders.py::test_move_document -x` | ❌ Wave 0 |
| FOLD-02 | Move document — ownership assertion | Integration (negative) | `pytest backend/tests/test_folders.py::test_move_wrong_owner_404 -x` | ❌ Wave 0 |
| FOLD-03 | Breadcrumb path returned in folder response | Unit | `pytest backend/tests/test_folders.py::test_breadcrumb_path -x` | ❌ Wave 0 |
| FOLD-04 | Sort by name/date/size | Integration | `pytest backend/tests/test_folders.py::test_document_sort -x` | ❌ Wave 0 |
| FOLD-05 | tsvector search returns matching docs | Integration (PostgreSQL required) | `pytest backend/tests/test_folders.py::test_fts_search -x -m integration` | ❌ Wave 0 |
| SHARE-01 | Share by handle — handle not found → 404 | Integration | `pytest backend/tests/test_shares.py::test_share_handle_not_found -x` | ❌ Wave 0 |
| SHARE-01 | Share by handle — success | Integration | `pytest backend/tests/test_shares.py::test_share_success -x` | ❌ Wave 0 |
| SHARE-02 | Shared doc in recipient virtual folder | Integration | `pytest backend/tests/test_shares.py::test_shared_with_me -x` | ❌ Wave 0 |
| SHARE-02 | Shared doc — zero quota charged to recipient | Integration | `pytest backend/tests/test_shares.py::test_share_no_quota_impact -x` | ❌ Wave 0 |
| SHARE-04 | Revoke share — immediate for recipient | Integration | `pytest backend/tests/test_shares.py::test_revoke_share -x` | ❌ Wave 0 |
| SHARE-01..04 | Share IDOR — wrong owner cannot revoke | Security (negative) | `pytest backend/tests/test_shares.py::test_share_revoke_wrong_owner_404 -x` | ❌ Wave 0 |
| SEC-08 | credentials_enc absent from all responses | Security (negative) | `pytest backend/tests/test_security.py::test_credentials_enc_not_in_response -x` | ❌ Wave 0 |
| SEC-09 | Account deletion deletes user files | Integration | `pytest backend/tests/test_admin_api.py::test_delete_user_cleans_files -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy returns bytes, 200 or 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_200 -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — no presigned URL in response | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_no_presigned_url -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — Range header → 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_206_range -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — admin blocked (403) | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_admin_403 -x` | ❌ Wave 0 |
| ADMIN-06 | Audit log viewer — no document content in entries | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_no_doc_content -x` | ❌ Wave 0 |
| ADMIN-06 | Audit log viewer — admin only | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_regular_user_403 -x` | ❌ Wave 0 |
### Sampling Rate
- **Per task commit:** `pytest backend/tests/test_folders.py backend/tests/test_shares.py backend/tests/test_audit.py backend/tests/test_documents.py -x`
- **Per wave merge:** `cd backend && pytest -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`
### Wave 0 Gaps
- [ ] `backend/tests/test_folders.py` — covers FOLD-01..05
- [ ] `backend/tests/test_shares.py` — covers SHARE-01..05 + IDOR security tests
- [ ] `backend/tests/test_audit.py` — covers ADMIN-06 + no-doc-content security tests
- [ ] `backend/tests/test_documents.py` — add proxy tests (test_content_stream_*) to existing file
- [ ] `backend/tests/test_security.py` — add SEC-08, SEC-09 tests (may be in test_admin_api.py)
---
## Security Domain
### Applicable ASVS Categories
| ASVS Category | Applies | Standard Control |
|---------------|---------|-----------------|
| V2 Authentication | yes | `get_regular_user` / `get_current_admin` deps on all new endpoints |
| V3 Session Management | no | No new session mechanisms introduced |
| V4 Access Control | yes | Ownership assertion on all folder/share endpoints; `get_regular_user` rejects admin on proxy |
| V5 Input Validation | yes | Pydantic models on all request bodies; `doc_id` parsed via `uuid.UUID()` |
| V6 Cryptography | no | No new cryptographic operations; credentials still encrypted via Phase 2 HKDF pattern |
### Known Threat Patterns for this Phase
| Pattern | STRIDE | Standard Mitigation |
|---------|--------|---------------------|
| Share IDOR — revoke another user's share | Elevation of privilege | `share.owner_id == current_user.id` → 404 on mismatch |
| Share IDOR — access shared doc content without share | Information disclosure | `_can_access_document()` check in proxy endpoint: owner OR active share |
| PDF proxy leaking presigned URL | Information disclosure | `get_object()` fetches bytes directly; presigned URL never generated in proxy handler |
| Admin accessing document content via proxy | Broken access control | `get_regular_user` dep raises 403 for admin role |
| Folder IDOR — delete another user's folder | Tampering | `folder.user_id == current_user.id` → 404 on mismatch |
| Audit log containing document content | Sensitive data exposure | `write_audit_log()` metadata_ MUST NOT include `filename`, `extracted_text`, or file bytes |
| Audit log admin access by regular user | Broken access control | `GET /api/admin/audit-log` uses `get_current_admin` dep |
| Path traversal via `folder_id` parameter | Information disclosure | All folder/document lookups via DB primary key; `uuid.UUID()` parse validates format |
| Tsvector search scope leak — user sees others' docs | Information disclosure | FTS query MUST include `Document.user_id == current_user.id` scope filter |
| credentials_enc in serialized admin response | Sensitive data exposure | `CloudConnectionOut` excludes field; never in any response |
### Security Gate Checklist (Phase 4 specific)
- [ ] `bandit -r backend/` — zero HIGH findings in new files (`api/folders.py`, `api/shares.py`, `api/audit.py`, `services/audit.py`, `tasks/audit_tasks.py`)
- [ ] `pip audit` — zero new critical/high CVEs (no new packages)
- [ ] `npm audit --audit-level=high` — zero (no new npm packages)
- [ ] `test_share_revoke_wrong_owner_404` passes
- [ ] `test_content_stream_admin_403` passes
- [ ] `test_content_stream_no_presigned_url` passes
- [ ] `test_audit_log_no_doc_content` passes — verifies `filename`, `extracted_text` absent from all audit entries
- [ ] `test_credentials_enc_not_in_response` passes
- [ ] `test_fts_search_scoped_to_owner` passes — other user's docs not in search results
---
## Code Examples
### Folder Breadcrumb Path Query (PostgreSQL CTE)
```sql
-- Source: CITED: https://www.postgresql.org/docs/current/queries-with.html
WITH RECURSIVE breadcrumb AS (
SELECT id, parent_id, name, 1 AS depth
FROM folders
WHERE id = :folder_id AND user_id = :uid
UNION ALL
SELECT f.id, f.parent_id, f.name, b.depth + 1
FROM folders f
JOIN breadcrumb b ON f.id = b.parent_id
WHERE f.user_id = :uid AND b.depth < 50
)
SELECT id, name, depth FROM breadcrumb ORDER BY depth DESC;
```
Returns path from root to current folder (root = highest depth value after `ORDER BY depth DESC`). Each segment is `{id, name}` for clickable navigation.
### Full-text Search Query (SQLAlchemy)
```python
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func
from sqlalchemy.dialects.postgresql import TSVECTOR
stmt = (
select(Document)
.where(
Document.user_id == current_user.id,
func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
func.plainto_tsquery("english", q)
),
)
.order_by(Document.created_at.desc())
.limit(per_page)
.offset((page - 1) * per_page)
)
```
### "Shared with me" Virtual Folder Query
```python
# Source: CONTEXT.md D-06
from sqlalchemy import select
from db.models import Share, Document
stmt = (
select(Document)
.join(Share, Share.document_id == Document.id)
.where(Share.recipient_id == current_user.id)
.order_by(Document.created_at.desc())
)
result = await session.execute(stmt)
shared_docs = result.scalars().all()
```
### `is_shared` indicator on document list
```python
# Source: CONTEXT.md SHARE-05 — owner's list shows "shared" badge
from sqlalchemy import exists, select
# Add to list_documents query:
shared_subq = (
select(Share.document_id)
.where(Share.owner_id == current_user.id)
.scalar_subquery()
)
# Add to Document select:
stmt = select(Document, Document.id.in_(shared_subq).label("is_shared"))
```
---
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| PostgreSQL `to_tsquery` (operator syntax required) | `plainto_tsquery` (natural language input) | PostgreSQL 9.3+ | Search bars should use `plainto_tsquery`; `to_tsquery` for programmatic use only |
| PDF.js for in-browser PDF viewing | Native browser PDF viewer via `Content-Type: application/pdf` + `Content-Disposition: inline` | Browser support ~2018+ | All modern browsers support native PDF rendering; PDF.js only needed for custom annotation features |
| Alembic autogenerate for GIN expression indexes | Manual migration SQL | Alembic 1.13+ (known bug) | Must write `op.execute("CREATE INDEX ... USING GIN")` manually |
| `asyncio.run()` in Celery tasks | `asyncio.run(_async_func())` is the standard bridge | Established in Phase 3 document_tasks.py | Celery sync task calls `asyncio.run()` to enter async context |
**Deprecated/outdated:**
- `to_tsvector` with stored Computed column: adds schema complexity; expression index is simpler and equally performant for this scale
- Separate audit log middleware: D-14 decision is inline writes; middleware approach not used
---
## Assumptions Log
| # | Claim | Section | Risk if Wrong |
|---|-------|---------|---------------|
| A1 | The streaming proxy's shared-document access check (Pattern 4) is implied by D-08 and SHARE-02 | Pattern 4, Security Domain | If recipients cannot access the proxy, "Shared with me" is useless; planner must confirm access rule |
| A2 | `session.flush()` (not `commit()`) inside `write_audit_log()` is the correct transactional pattern | Pattern 6 | Using `commit()` would create two separate transactions; if primary operation fails after audit commit, audit entry is orphaned with no corresponding operation |
| A3 | 300ms debounce interval for search | Pattern 10 | If too short, excessive API calls; if too long, UX feels sluggish. Standard convention, low risk |
| A4 | `WITH RECURSIVE` CTE for subtree cascade-delete is the correct PostgreSQL approach; SQLite tests must use INTEGRATION=1 | Pitfall 1, Common Pitfalls | SQLite does not support `WITH RECURSIVE` — all subtree cascade tests require PostgreSQL |
| A5 | `MinIOBackend.put_object_raw()` method needs to be added for audit-logs bucket (different key scheme from documents) | Pattern 8 | If not added, Celery export task cannot upload to `audit-logs` bucket using the correct key |
| A6 | The `audit-logs` MinIO bucket creation is safe to gate on `MINIO_ENDPOINT` env var (same as migration 0003 pattern) | Pitfall 8 | If MinIO is present but bucket creation is skipped, first export task fails |
**If this table is empty:** Not applicable — some assumptions remain in this research.
---
## Open Questions (RESOLVED)
1. **Subtree folder delete — CTE vs. iterative in Python**
- What we know: `WITH RECURSIVE` CTE works in PostgreSQL; SQLite does not support it
- What's unclear: Should the planner use the CTE (PostgreSQL-only, integration test) or iterative Python (works in SQLite, slower for deep trees)?
- Recommendation: Use CTE for the implementation (real database is PostgreSQL); mark the cascade-delete tests as `pytest.mark.skipif(not live_services_available, ...)` or use `INTEGRATION=1`
2. **Shared document access in proxy — `_can_access_document()` adds a DB query per request**
- What we know: Every `GET /api/documents/{id}/content` call currently does one DB lookup; adding share check adds a second
- What's unclear: Is the extra query acceptable given Phase 4 scale (single-user → small multi-user)?
- Recommendation: Add the share check unconditionally; at Phase 4 scale a second indexed query is negligible
3. **`PATCH /api/me/preferences` endpoint path**
- What we know: D-10 specifies this endpoint for pdf_open_mode
- What's unclear: Should it be on `/api/auth/me/preferences` (alongside `/api/auth/me`) or `/api/me/preferences` (new router)?
- Recommendation: Place on `/api/auth/me/preferences` (same router prefix as `/api/auth/me`) to keep auth-related user settings in the same module
---
## Sources
### Primary (HIGH confidence)
- `backend/db/models.py` — all ORM models verified: `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `User` (missing `pdf_open_mode`), `CloudConnection` (`credentials_enc` column confirmed)
- `backend/api/documents.py` — ownership assertion pattern, `get_regular_user` dep, quota UPDATE pattern
- `backend/deps/auth.py` — `get_regular_user` raises 403 for admin; `get_current_admin` pattern
- `backend/celery_app.py` — existing `beat_schedule` structure; `cleanup_abandoned_uploads` as template
- `backend/storage/minio_backend.py` — `get_object()` confirmed; `asyncio.to_thread()` pattern confirmed
- `backend/services/storage.py` — `list_metadata` scoping by `user_id` confirmed
### Secondary (MEDIUM confidence)
- [FastAPI discussions #7718](https://github.com/fastapi/fastapi/discussions/7718) — Range header StreamingResponse pattern verified by multiple community confirmations
- [PostgreSQL docs — Full-text search tables](https://www.postgresql.org/docs/current/textsearch-tables.html) — GIN expression index SQL verified from official docs
- [SQLAlchemy PostgreSQL dialect docs](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html) — `func.to_tsvector`, `plainto_tsquery` support confirmed
- [Celery periodic tasks docs](https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html) — `crontab(hour=0, minute=0)` pattern
- [FastAPI response model docs](https://fastapi.tiangolo.com/tutorial/response-model/) — `response_model=CloudConnectionOut` exclusion pattern
### Tertiary (LOW confidence)
- [Alembic issue #1390](https://github.com/sqlalchemy/alembic/issues/1390) — tsvector autogenerate bug confirmed by issue thread; marked for validation
---
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH — all packages already in requirements.txt; codebase verified
- Architecture: HIGH — all ORM models exist and were verified in codebase
- Pitfalls: MEDIUM — Pitfalls 1, 4 are confirmed from codebase + security requirements; Pitfall 2 confirmed from Alembic issue tracker
- Code patterns: MEDIUM — Range header pattern from FastAPI community discussion; PostgreSQL FTS from official docs
**Research date:** 2026-05-25
**Valid until:** 2026-06-25 (stable FastAPI/SQLAlchemy stack; no fast-moving dependencies)