kite/.planning/phases/04-folders-sharing-quotas-document-ux/04-RESEARCH.md at main

curo/kite

Files

T

curo1305 747303246a docs(04): create phase 4 plan (9 plans, 7 waves)

Folders, Sharing, Quotas & Document UX — plans verified (0 blockers,
2 non-blocking warnings). Covers FOLD-01..05, SHARE-01..05, SEC-08/09,
ADMIN-06, DOC-01/02.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-25 18:20:16 +02:00

59 KiB

Raw Permalink Blame History

Researched: 2026-05-25 Domain: FastAPI folder/share CRUD, PostgreSQL tsvector, MinIO streaming proxy, Celery beat, Vue 3 folder navigation Confidence: HIGH

<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

D-01 Hybrid layout — AppSidebar shows top-level folders only. Sub-folders + breadcrumb in main content area. Top-level folders clickable directly in sidebar.

D-02 Unlimited nesting depth (no API or UI cap). Folder.parent_id self-referential FK is authoritative. Breadcrumbs truncate at depth > 4 (show first + "..." + last 2).

D-03 Non-empty folder delete: warning modal with document count. Confirm → cascade-delete all documents (MinIO + DB + quota). Cancel → no action. Documents are NOT moved to root — they are destroyed.

D-04 Exact handle input — no autocomplete. API returns 404 if handle not found; UI shows "User not found" error.

D-05 Share button on DocumentCard (inline icon button). Modal: (a) handle input, (b) Share button, (c) current recipients list with Revoke per row.

D-06 "Shared with me" is a fixed virtual folder entry in AppSidebar, rendered above the user's own folder list. Filtered by shares.recipient_id = current_user.id. Zero quota charged to recipient.

D-07 Share permission is view only for Phase 4. edit deferred.

D-08 Streaming proxy endpoint: GET /api/documents/{id}/content. Returns bytes via FastAPI StreamingResponse. Supports Range headers. Content-Disposition: inline. No presigned URL ever generated or exposed. Uses get_regular_user dep.

D-09 Native browser PDF rendering — no PDF.js. Zero frontend dependencies added. Content-Type header drives browser rendering.

D-10 users.pdf_open_mode column (String, default 'in_app'). Exposed via PATCH /api/me/preferences. in_app = modal with <iframe>; new_tab = window.open.

D-11 PostgreSQL tsvector GIN index on documents.extracted_text. plainto_tsquery('english', :q). Search scope: user's own docs only. GET /api/documents?q=<query>.

D-12 Inline search bar in document list view. Debounced, minimum 2 chars. Folder-context aware.

D-13 All 4 audit event categories logged: auth events, document operations, folder/share operations, admin actions.

D-14 Audit log written inline in handlers (no middleware, no Celery task). Shared helper write_audit_log(session, event_type, user_id, actor_id, resource_id, ip_address, metadata_dict).

D-15 Admin audit log viewer (/admin → AuditLog tab): paginated table, filters (date range, user, action type). No document content, filenames, or extracted text in any log entry.

D-16 Manual CSV and JSON export button in admin audit log view. Exports current filtered result set.

D-17 Daily automated CSV backup via Celery beat. Exports previous day's audit rows to MinIO audit-logs bucket (audit-logs/YYYY-MM-DD.csv). Bucket is private. Admins can download from admin panel. Rows NOT deleted from DB.

D-18 credentials_enc excluded from all serializers (SEC-08). Explicit Pydantic response model enforcement on all user-facing endpoints that touch CloudConnection.

D-19 Account deletion (admin-triggered) runs delete_user_files() before removing DB records (SEC-09). Cloud connection cleanup deferred to Phase 5.

Claude's Discretion

None specified — all significant decisions are locked.

Deferred Ideas (OUT OF SCOPE)

Per-day document scan quota tiers (future billing milestone)
Share permission levels beyond view (edit, comment deferred)
Presigned GET URLs for non-PDF document download
Audit log row archival/deletion policy (DB rows are not auto-deleted) </user_constraints>

<phase_requirements>

Phase Requirements

ID	Description	Research Support
FOLD-01	User can create, rename, and delete folders (delete confirms content count before proceeding)	`Folder` model complete in `db/models.py`; cascade-delete pattern via SQLAlchemy + atomic quota decrement established in Phase 3
FOLD-02	User can move documents between folders	`Document.folder_id` FK exists and is nullable; PATCH endpoint updates `folder_id` with ownership assertion on both doc and target folder
FOLD-03	Breadcrumb navigation renders current folder path; each segment clickable	Recursive path query via `WITH RECURSIVE` CTE or iterative parent-chain walk; Vue computed property for breadcrumb array
FOLD-04	Document list supports sort by name, date uploaded, and file size	SQLAlchemy `order_by()` with enum sort param; frontend `sort` query param
FOLD-05	Full-text search via PostgreSQL tsvector index on extracted text	Expression-index approach `to_tsvector('english', extracted_text)` + GIN; `plainto_tsquery` in `WHERE` clause; Alembic migration adds index manually (not as Computed column to avoid autogenerate noise)
SHARE-01	User can share a document with another user by their unique handle	`Share` model complete; `User.handle` unique lookup; exact match, no autocomplete
SHARE-02	Shared documents appear in "Shared with me" virtual folder; zero quota charged to recipient	Filter `shares.recipient_id = current_user.id`; no quota row touched; virtual folder is a UI/API filter, not a DB folder
SHARE-03	Shared access view-only by default; owner controls permission level	`Share.permission` defaults to `'view'`; Phase 4 only implements `view`; permission field returned in list responses
SHARE-04	Owner can revoke share; revocation is immediate	`DELETE /api/shares/{share_id}` with owner assertion; no async cleanup needed
SHARE-05	Documents shared with others display a "shared" indicator	Document list response includes `is_shared: bool` derived from EXISTS subquery on `shares.owner_id`
SEC-08	`credentials_enc` excluded from all API serializers	Explicit `CloudConnectionOut` Pydantic response model excluding `credentials_enc`
SEC-09	Account deletion triggers `delete_user_files()` per cloud connection before DB removal	Implemented in admin delete-user endpoint; iterates documents, deletes MinIO objects, decrements quota
ADMIN-06	Admin can view audit log filtered by date range, user, and action type (metadata only)	`AuditLog` model complete; paginated query with filters; admin endpoint uses `get_current_admin`
DOC-01	User can view document metadata and extracted text	Existing `GET /api/documents/{id}` already returns extracted_text; Phase 4 ensures it is included in response
DOC-02	In-browser PDF preview via PDF.js proxy; no presigned URLs exposed	`GET /api/documents/{id}/content` — StreamingResponse with Range header support, bytes from MinIO via `asyncio.to_thread`
</phase_requirements>

Summary

Phase 4 builds entirely on the existing ORM schema (all required models — Folder, Share, AuditLog, Document.folder_id, Share.recipient_id — are already in db/models.py). The work is predominantly plumbing: creating new API router modules for folders, shares, and audit log; adding a single Alembic migration for the users.pdf_open_mode column and the tsvector GIN expression index; extending the Celery beat schedule with a daily audit export task; and extending the Vue 3 frontend with folder navigation, sharing modals, and the settings preference toggle.

The highest-risk areas are: (1) the PDF streaming proxy with Range header support (needs careful byte-range parsing to handle PDF viewer seeking), (2) the tsvector GIN index and corresponding Alembic migration (Alembic autogenerate has known quirks with to_tsvector() expression indexes — the migration must be written manually), and (3) the folder cascade-delete with atomic quota decrement for multiple documents (each document's size_bytes must be summed and decremented in a single atomic UPDATE or via a pre-computed sum).

No new Python or npm packages are required for Phase 4. The decision to use native browser PDF rendering (D-09) eliminates the PDF.js dependency. All needed libraries — FastAPI, SQLAlchemy, MinIO SDK, Celery, Pydantic — are already pinned in requirements.txt. The frontend needs only Tailwind utility classes; no new npm packages.

Primary recommendation: Build in 6 plans: Wave 0 test scaffolds + migration; folders backend; shares backend + audit service; streaming proxy + search; audit log admin API + Celery export task; frontend folder navigation + sharing + settings.

Architectural Responsibility Map

Capability	Primary Tier	Secondary Tier	Rationale
Folder CRUD + breadcrumb path	API / Backend	Database / Storage	Ownership assertion and cascade-delete logic belongs in the API layer; path reconstruction is a DB query
Document move between folders	API / Backend	Database / Storage	Ownership assertions on both document and target folder must be co-located in the API handler
Full-text search (tsvector)	Database / Storage	API / Backend	GIN index and `plainto_tsquery` execute in PostgreSQL; API adds scope filter (`user_id`)
Share grant / revoke	API / Backend	Database / Storage	IDOR prevention requires ownership assertion before DB write; share table is the single source of truth
"Shared with me" virtual folder	API / Backend	—	Pure query filter on `shares.recipient_id`; no DB folder row; quota enforcement is in the API layer
PDF streaming proxy	API / Backend	Database / Storage	Bytes flow from MinIO through FastAPI to the browser; presigned URLs never reach the browser
Per-user PDF open preference	API / Backend	Database / Storage	User preference column (`users.pdf_open_mode`) read/written via `/api/me/preferences`
Audit log write	API / Backend	—	`write_audit_log()` called inline in each handler after successful operation; no middleware
Audit log viewer + export	API / Backend	Database / Storage	Admin-only paginated query with filters; CSV export is a StreamingResponse from a filtered DB query
Daily audit backup (Celery beat)	API / Backend (worker)	Database / Storage	Celery beat task runs in the worker; queries DB and uploads CSV to MinIO `audit-logs` bucket
Folder navigation UI state	Frontend Server (SSR) → Browser / Client	—	Folder ID and breadcrumb path held in Vue `ref()` in `HomeView.vue`; no server-side state
Share modal + indicator	Browser / Client	—	Share button on DocumentCard; modal state is local component `ref()`; share list in documents store
PDF open preference toggle	Browser / Client	API / Backend	Toggle in SettingsView.vue reads/writes preference via `PATCH /api/me/preferences`
`credentials_enc` exclusion (SEC-08)	API / Backend	—	Pydantic response model `CloudConnectionOut` excludes the field; never reaches serializer

Standard Stack

Core (all already in requirements.txt / package.json)

Library	Version	Purpose	Why Standard
FastAPI	0.136+	API router for folders, shares, audit, proxy, preferences	Already in use; `APIRouter` + `StreamingResponse` needed for proxy
SQLAlchemy 2.0 async	2.0+	ORM queries for all new endpoints	Established pattern in project; async session already wired
Alembic	1.18+	Migration 0004 — `users.pdf_open_mode` column + tsvector GIN index	Established migration toolchain
MinIO Python SDK	7.2+	`get_object()` for streaming proxy; `put_object()` for audit CSV upload	Already in use; `asyncio.to_thread()` pattern established
Celery + Redis	5.5+	Daily audit export beat task	Already in use; `beat_schedule` in `celery_app.py`
Pydantic v2	2.0+	`CloudConnectionOut` response model excluding `credentials_enc` (SEC-08)	Already in use for all request/response models
Vue 3 (Options API)	3.4+	Folder nav, share modal, settings toggle	Already in use; `ref()` + `watch()` patterns established

[VERIFIED: codebase — requirements.txt and package.json]

No New Packages Required

Phase 4 introduces zero new dependencies. Key rationale:

PDF viewer: D-09 chose native browser rendering — no PDF.js
Debounce: hand-rolled setTimeout/clearTimeout in Vue watch callback (< 10 lines, no lodash needed)
CSV export: Python stdlib csv.DictWriter with io.StringIO — no pandas
Search: PostgreSQL built-in plainto_tsquery — no Elasticsearch or external search engine

[VERIFIED: codebase — no new deps required]

Package Legitimacy Audit

No new packages are installed in Phase 4. All runtime dependencies are carried forward from Phases 1–3 with pinned versions in requirements.txt.

Packages removed due to slopcheck [SLOP] verdict: none Packages flagged as suspicious [SUS]: none New packages to install: none

Architecture Patterns

System Architecture Diagram

Browser
  │
  ├── GET /api/documents?q=search&folder_id=X&sort=name
  │     FastAPI → SQLAlchemy → PostgreSQL (tsvector @@ plainto_tsquery, scoped to user)
  │     ← JSON: [{id, filename, size_bytes, is_shared, folder_id, ...}]
  │
  ├── POST /api/folders / PATCH /api/folders/{id} / DELETE /api/folders/{id}
  │     FastAPI → ownership assertion → SQLAlchemy → PostgreSQL
  │     DELETE: sum(doc.size_bytes) → atomic quota decrement → MinIO delete_object per doc
  │
  ├── POST /api/shares / DELETE /api/shares/{id}
  │     FastAPI → ownership assertion (owner_id = current_user.id) → SQLAlchemy → PostgreSQL
  │     GET /api/shares/received → filter shares.recipient_id = current_user.id
  │
  ├── GET /api/documents/{id}/content   [Range: bytes=X-Y]
  │     FastAPI (get_regular_user) → ownership OR share-access check → SQLAlchemy (object_key)
  │     → MinIO get_object() via asyncio.to_thread() → StreamingResponse (206 or 200)
  │     ← bytes with Content-Type, Content-Disposition: inline, Accept-Ranges: bytes
  │     NOTE: no presigned URL generated; bytes flow through FastAPI
  │
  ├── PATCH /api/me/preferences
  │     FastAPI → SQLAlchemy UPDATE users.pdf_open_mode → ← {pdf_open_mode}
  │
  └── GET /api/admin/audit-log  [?start=&end=&user_id=&event_type=&page=]
        FastAPI (get_current_admin) → SQLAlchemy → PostgreSQL
        ← JSON: {items: [{id, event_type, user_id, actor_id, ip_address, metadata_, created_at}], total}

      GET /api/admin/audit-log/export?format=csv|json
        FastAPI (get_current_admin) → query → StreamingResponse (text/csv or application/json)

Celery Beat Worker (daily at midnight UTC)
  │
  └── audit_log_daily_export task
        → AsyncSessionLocal → SELECT audit_log WHERE date = yesterday
        → csv.DictWriter → io.BytesIO
        → MinIO put_object("audit-logs", "YYYY-MM-DD.csv", bytes)

Recommended Project Structure (new files only)

backend/
├── api/
│   ├── folders.py          # FOLD-01, FOLD-02, FOLD-03, FOLD-04, FOLD-05
│   ├── shares.py           # SHARE-01..05
│   └── audit.py            # ADMIN-06 (admin viewer + export)
├── services/
│   └── audit.py            # write_audit_log() helper
├── tasks/
│   └── audit_tasks.py      # audit_log_daily_export Celery beat task
└── migrations/versions/
    └── 0004_phase4_pdf_open_mode_tsvector.py

frontend/src/
├── components/
│   ├── documents/
│   │   └── ShareModal.vue          # D-05
│   ├── layout/
│   │   └── BreadcrumbNav.vue       # FOLD-03
│   └── admin/
│       └── AdminAuditLogTab.vue    # ADMIN-06 (D-15, D-16)
├── stores/
│   └── folders.js                  # folder state + actions
└── views/
    └── SettingsView.vue            # extend with PDF preference toggle (D-10)

Pattern 1: Ownership Assertion (established in Phase 3 — D-16)

What: Every resource endpoint asserts resource.user_id == current_user.id. Cross-user access returns 404 (not 403) to prevent attacker enumeration of resource IDs.

When to use: All folder CRUD, document move, share grant/revoke (owner side), streaming proxy.

# Source: backend/api/documents.py (established pattern)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
    raise HTTPException(status_code=404, detail="Document not found")

Folder variant:

folder = await session.get(Folder, folder_id)
if folder is None or folder.user_id != current_user.id:
    raise HTTPException(status_code=404, detail="Folder not found")

[VERIFIED: codebase — backend/api/documents.py]

Pattern 2: Atomic Quota Decrement for Folder Cascade-Delete

What: When deleting a folder, collect all document size_bytes first, then issue a single atomic quota UPDATE. Never read-then-write in Python.

When to use: DELETE /api/folders/{id} when folder is non-empty.

# Source: established atomic pattern (CLAUDE.md + Phase 3 D-07)
# Step 1: collect document IDs and total size within the folder subtree
result = await session.execute(
    select(Document.id, Document.size_bytes, Document.object_key)
    .where(Document.folder_id == folder_id, Document.user_id == current_user.id)
)
docs = result.all()
total_bytes = sum(row.size_bytes for row in docs)

# Step 2: atomic quota decrement (CASE WHEN pattern — SQLite compatible)
await session.execute(
    text(
        "UPDATE quotas SET used_bytes = "
        "CASE WHEN used_bytes > :delta THEN used_bytes - :delta ELSE 0 END "
        "WHERE user_id = :uid"
    ),
    {"delta": total_bytes, "uid": str(current_user.id)},
)

# Step 3: delete MinIO objects (best-effort)
for row in docs:
    try:
        await get_storage_backend().delete_object(row.object_key)
    except Exception:
        pass  # MinIO cleanup is best-effort

# Step 4: delete documents (cascade via FK) then folder
await session.execute(
    delete(Document).where(Document.folder_id == folder_id)
)
await session.delete(folder)
await session.commit()

[ASSUMED: the CASE WHEN pattern for SQLite compat is established in STATE.md; verified in existing delete_document service function]

Pattern 3: PDF Streaming Proxy with Range Headers

What: GET /api/documents/{id}/content fetches bytes from MinIO and streams them with Range header support. Status 206 for range requests; 200 for full fetches.

When to use: DOC-02. Required for PDF viewer to seek pages without re-downloading the entire file.

# Source: FastAPI discussion #7718 (CITED: github.com/fastapi/fastapi/discussions/7718)
from fastapi import Request
from fastapi.responses import StreamingResponse

def _parse_range(range_header: str, file_size: int) -> tuple[int, int]:
    """Parse 'bytes=start-end' Range header. Raises 416 on invalid range."""
    try:
        h = range_header.replace("bytes=", "").split("-")
        start = int(h[0]) if h[0] != "" else 0
        end = int(h[1]) if h[1] != "" else file_size - 1
    except (ValueError, IndexError):
        raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
    if start > end or start < 0 or end >= file_size:
        raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
    return start, end

@router.get("/{doc_id}/content")
async def stream_document_content(
    doc_id: str,
    request: Request,
    session: AsyncSession = Depends(get_db),
    current_user: User = Depends(get_regular_user),
):
    # Ownership assertion (DOC-02 also allows shared-with-me access — see Pattern 4)
    doc = await session.get(Document, uid)
    if doc is None or doc.user_id != current_user.id:
        # Also allow if document is shared with current_user
        share = await _get_share_for_user(session, uid, current_user.id)
        if share is None:
            raise HTTPException(404, "Document not found")
        # Use the document from the share's owner context
        # (doc is already loaded; ownership check skipped for shared access)

    # Fetch all bytes from MinIO (asyncio.to_thread wraps sync SDK)
    file_bytes = await get_storage_backend().get_object(doc.object_key)
    file_size = len(file_bytes)

    range_header = request.headers.get("range")
    headers = {
        "content-type": doc.content_type,
        "content-disposition": f"inline; filename=\"{doc.filename}\"",
        "accept-ranges": "bytes",
        "content-length": str(file_size),
    }

    if range_header:
        start, end = _parse_range(range_header, file_size)
        chunk = file_bytes[start:end + 1]
        headers["content-range"] = f"bytes {start}-{end}/{file_size}"
        headers["content-length"] = str(len(chunk))
        return StreamingResponse(
            iter([chunk]),
            status_code=206,
            headers=headers,
        )

    return StreamingResponse(
        iter([file_bytes]),
        status_code=200,
        headers=headers,
    )

Critical: get_regular_user dep ensures admins cannot access document content (DOC-02 / CLAUDE.md). No presigned URL is generated at any point.

[CITED: https://github.com/fastapi/fastapi/discussions/7718] [CITED: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/206]

Pattern 4: Shared-Document Access for Streaming Proxy

What: The streaming proxy must allow both document owners AND recipients to access content.

When to use: Only in GET /api/documents/{id}/content — sharing grants read access to the content stream.

# Source: CONTEXT.md D-08 — recipient can preview shared document
async def _can_access_document(
    session: AsyncSession, doc: Document, current_user_id: uuid.UUID
) -> bool:
    """True if user owns the document OR has an active share as recipient."""
    if doc.user_id == current_user_id:
        return True
    result = await session.execute(
        select(Share).where(
            Share.document_id == doc.id,
            Share.recipient_id == current_user_id,
        )
    )
    return result.scalar_one_or_none() is not None

[ASSUMED: the share-based access for the streaming proxy is implied by D-08 and SHARE-02; not explicitly called out in a prior phase decision]

Pattern 5: PostgreSQL tsvector Full-Text Search

What: GIN expression index on documents.extracted_text for full-text search. Query uses plainto_tsquery (accepts natural language; no operator syntax required from user).

When to use: FOLD-05 — GET /api/documents?q=<query>.

Migration (manual — do NOT use Computed() to avoid Alembic autogenerate noise):

# Source: CITED: https://www.postgresql.org/docs/current/textsearch-tables.html
# In migration 0004:
op.execute(
    "CREATE INDEX ix_documents_fts ON documents "
    "USING GIN (to_tsvector('english', coalesce(extracted_text, '')))"
)

Query pattern:

# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func, text

stmt = (
    select(Document)
    .where(
        Document.user_id == current_user.id,
        func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
            func.plainto_tsquery("english", q)
        ),
    )
    .order_by(Document.created_at.desc())
)

Why plainto_tsquery not to_tsquery: plainto_tsquery accepts unstructured natural language ("invoice march 2024") without requiring the user to supply & and | operators. Simpler and safer for a search bar input.

Why expression index not generated/stored column: PostgreSQL generated columns with TSVECTOR type require PostgreSQL 12+ and add schema complexity. The expression index approach is simpler and Alembic-compatible (written as raw SQL in the migration; autogenerate will not re-detect it if managed manually).

Important Alembic caveat: include_schemas=True autogenerate WILL repeatedly flag the expression index for recreation (known Alembic issue: github.com/sqlalchemy/alembic/issues/1390). The migration MUST be written manually and committed once. Do not run alembic revision --autogenerate after adding the GIN index without reviewing the output.

[CITED: https://www.postgresql.org/docs/current/textsearch-tables.html] [CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html] [CITED: https://github.com/sqlalchemy/alembic/issues/1390]

Pattern 6: write_audit_log() Helper

What: Shared service function called inline in each handler after a successful operation. Does NOT raise on failure (audit log writing is best-effort — do not fail the primary operation).

When to use: Every handler in Phase 4 that performs state-changing operations; plus back-filling into auth.py and admin.py.

# Source: CONTEXT.md D-14
# backend/services/audit.py
from __future__ import annotations
import uuid
from typing import Optional
from sqlalchemy.ext.asyncio import AsyncSession
from db.models import AuditLog
import logging

logger = logging.getLogger(__name__)

async def write_audit_log(
    session: AsyncSession,
    event_type: str,
    user_id: Optional[uuid.UUID],
    actor_id: Optional[uuid.UUID],
    resource_id: Optional[uuid.UUID],
    ip_address: Optional[str],
    metadata_: Optional[dict] = None,
) -> None:
    """Write an audit log entry. Never raises — audit failure is non-fatal."""
    try:
        entry = AuditLog(
            event_type=event_type,
            user_id=user_id,
            actor_id=actor_id,
            resource_id=resource_id,
            ip_address=ip_address,
            metadata_=metadata_,
        )
        session.add(entry)
        await session.flush()  # flush within the handler's existing transaction
    except Exception as exc:
        logger.warning("audit log write failed: %s", exc)
        # Do not re-raise — audit failure must never abort the primary operation

Critical: session.flush() not session.commit() — audit log writes within the same transaction as the primary operation, so the primary operation's session.commit() commits the audit entry too. This avoids partial writes where the operation succeeds but the audit entry is orphaned.

[ASSUMED: flush-not-commit pattern is the safest approach given transactional semantics; not explicitly documented in a prior phase decision]

Pattern 7: CSV Export (admin endpoint)

What: GET /api/admin/audit-log/export?format=csv returns a StreamingResponse with text/csv content type and Content-Disposition: attachment.

# Source: Python stdlib csv module
import csv
import io
from fastapi.responses import StreamingResponse

async def export_audit_log(
    session: AsyncSession,
    start: Optional[datetime],
    end: Optional[datetime],
    user_id: Optional[uuid.UUID],
    event_type: Optional[str],
    format: str = "csv",
):
    rows = await _query_audit_log(session, start, end, user_id, event_type)
    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[
        "id", "event_type", "user_id", "actor_id",
        "resource_id", "ip_address", "metadata_", "created_at"
    ])
    writer.writeheader()
    for row in rows:
        writer.writerow({
            "id": row.id,
            "event_type": row.event_type,
            "user_id": str(row.user_id) if row.user_id else "",
            "actor_id": str(row.actor_id) if row.actor_id else "",
            "resource_id": str(row.resource_id) if row.resource_id else "",
            "ip_address": str(row.ip_address) if row.ip_address else "",
            "metadata_": str(row.metadata_) if row.metadata_ else "",
            "created_at": row.created_at.isoformat(),
        })
    output.seek(0)
    return StreamingResponse(
        iter([output.getvalue()]),
        media_type="text/csv",
        headers={"Content-Disposition": f"attachment; filename=audit-export.csv"},
    )

[CITED: Python stdlib csv documentation — https://docs.python.org/3/library/csv.html]

Pattern 8: Celery Beat — Daily Audit Export Task

What: A new task audit_log_daily_export added to beat_schedule. Uses crontab(hour=0, minute=0) for midnight UTC.

# Source: CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
from celery.schedules import crontab

# In celery_app.py beat_schedule dict — ADD to existing schedule:
"audit-log-daily-export": {
    "task": "tasks.audit_tasks.audit_log_daily_export",
    "schedule": crontab(hour=0, minute=0),  # midnight UTC
},

Task implementation pattern (mirrors cleanup_abandoned_uploads):

@celery_app.task(name="tasks.audit_tasks.audit_log_daily_export")
def audit_log_daily_export() -> dict:
    return asyncio.run(_run_daily_export())

async def _run_daily_export() -> dict:
    from datetime import date, timedelta
    import csv, io
    from db.session import AsyncSessionLocal
    from db.models import AuditLog
    from storage import get_storage_backend
    from sqlalchemy import select

    yesterday = date.today() - timedelta(days=1)
    start = datetime(yesterday.year, yesterday.month, yesterday.day, tzinfo=timezone.utc)
    end = start + timedelta(days=1)

    async with AsyncSessionLocal() as session:
        result = await session.execute(
            select(AuditLog)
            .where(AuditLog.created_at >= start, AuditLog.created_at < end)
            .order_by(AuditLog.created_at)
        )
        rows = result.scalars().all()

    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[...])
    writer.writeheader()
    for row in rows:
        writer.writerow({...})

    csv_bytes = output.getvalue().encode("utf-8")
    key = f"audit-logs/{yesterday.isoformat()}.csv"
    backend = get_storage_backend()
    await backend.put_object_raw(
        bucket="audit-logs",
        key=key,
        data=io.BytesIO(csv_bytes),
        length=len(csv_bytes),
        content_type="text/csv",
    )
    return {"exported": len(rows), "key": key}

Note: MinIOBackend may need a put_object_raw(bucket, key, data, length, content_type) method (vs the existing put_object which constructs the key schema). The audit-logs bucket uses a different key scheme. Add this method in the migration plan.

[CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html]

Pattern 9: Pydantic Response Model Exclusion (SEC-08)

What: Explicit response_model=CloudConnectionOut where CloudConnectionOut omits credentials_enc. This is the safe-by-default Pydantic v2 approach.

# Source: CITED: https://fastapi.tiangolo.com/tutorial/response-model/
from pydantic import BaseModel
from datetime import datetime

class CloudConnectionOut(BaseModel):
    id: str
    provider: str
    display_name: str
    status: str
    connected_at: datetime
    # credentials_enc is deliberately absent — never serialized (SEC-08)

    model_config = {"from_attributes": True}

Phase 4 scope: No cloud connection endpoints exist yet (Phase 5). Phase 4's obligation is to ensure the CloudConnectionOut model is defined and used as response_model on any admin endpoint that touches cloud_connections. Since no cloud connection endpoints exist yet, Phase 4 creates the model in schemas/cloud.py (or inline in api/admin.py) so Phase 5 cannot accidentally expose the field.

[CITED: https://fastapi.tiangolo.com/tutorial/response-model/]

Pattern 10: Vue 3 Debounced Search (no external dependency)

What: Watch the search query ref with a manual debounce using setTimeout/clearTimeout. Does not require lodash or VueUse.

// Source: Vue 3 Composition API watch pattern (Options API compatible via watch option)
// frontend/src/stores/documents.js
import { ref, watch } from 'vue'

const searchQuery = ref('')
let _searchTimer = null

watch(searchQuery, (newVal) => {
  clearTimeout(_searchTimer)
  if (newVal.length < 2) {
    // Clear results or show all
    return
  }
  _searchTimer = setTimeout(() => {
    fetchDocuments({ q: newVal, folderId: currentFolderId.value })
  }, 300)
})

[ASSUMED: 300ms is conventional for search debounce; no official Vue doc specifies this value]

Anti-Patterns to Avoid

Using session.commit() inside write_audit_log(): Must use flush() to stay within the handler's transaction. A separate commit risks committing the audit entry without the primary operation.
Generating presigned URLs in the streaming proxy: D-08 explicitly prohibits this. Never call presigned_get_url() from the proxy endpoint. The bytes must flow through FastAPI.
to_tsquery instead of plainto_tsquery: to_tsquery requires the user to supply &, |, and ! operators. Search bar input uses plainto_tsquery which accepts natural language phrases.
Using response_model_exclude on CloudConnection: Prefer defining an explicit CloudConnectionOut model that never includes credentials_enc. response_model_exclude is fragile (requires remembering to pass the set on every endpoint). Whitelist is safer.
Storing breadcrumb state in the URL query string: Breadcrumb is derived from the current folder_id parameter — only the current folder ID needs to be in the URL (or in Vue state). The full path is reconstructed from the DB on navigation.
Recursive SQL for breadcrumb without depth limit: WITH RECURSIVE CTE is correct but should include a depth guard (WHERE depth < 50) to prevent infinite recursion on a corrupted self-referential FK.

Don't Hand-Roll

Problem	Don't Build	Use Instead	Why
Range header parsing	Custom byte-range logic from scratch	Pattern 3 above (from FastAPI discussion #7718)	Browser PDFs use RFC 7233 range requests; edge cases (open-ended ranges `bytes=500-`) must be handled
Full-text search ranking	BM25 in Python	PostgreSQL `ts_rank()` + `plainto_tsquery`	Runs in the DB; GIN index makes it fast; no external service needed
CSV serialization	String concatenation with escaping	Python stdlib `csv.DictWriter`	Handles quoting, escaping, newlines in field values automatically
Audit log middleware	Celery task or FastAPI middleware for audit writes	Inline `write_audit_log()` after successful operation	Middleware fires on request start — cannot know if operation succeeded; Celery task adds latency and potential missed writes on worker failure
Tsvector maintenance	Application-layer `extracted_text` → tsvector sync	PostgreSQL GIN expression index `to_tsvector('english', extracted_text)`	DB recalculates automatically on UPDATE; no trigger or application sync needed
Debounce utility	npm package (lodash, VueUse)	`setTimeout`/`clearTimeout` in watch	< 10 lines; no new dependency justified

Key insight: PostgreSQL handles full-text search natively with GIN indexes at query time — no external search infrastructure (Elasticsearch, Typesense, Meilisearch) is needed or justified at this scale.

Common Pitfalls

Pitfall 1: Folder Cascade-Delete Misses Sub-folder Documents

What goes wrong: DELETE /api/folders/{id} deletes direct child documents but leaves orphaned documents in sub-folders. Quota becomes incorrect.

Why it happens: A naive WHERE folder_id = :id only catches immediate children. Self-referential FK with unlimited nesting requires a recursive query to collect the full subtree.

How to avoid: Use WITH RECURSIVE CTE to collect all folder IDs in the subtree before deleting:

WITH RECURSIVE subtree AS (
    SELECT id FROM folders WHERE id = :root_id AND user_id = :uid
    UNION ALL
    SELECT f.id FROM folders f
    JOIN subtree s ON f.parent_id = s.id
    WHERE f.user_id = :uid AND depth < 50
)
SELECT d.id, d.size_bytes, d.object_key
FROM documents d
JOIN subtree s ON d.folder_id = s.id
WHERE d.user_id = :uid

Warning signs: After deleting a folder tree, quotas.used_bytes does not match SUM(documents.size_bytes).

[ASSUMED: SQLite does not support WITH RECURSIVE for test runs — integration tests requiring this pattern must use PostgreSQL (INTEGRATION=1)]

Pitfall 2: Alembic Autogenerate Repeatedly Detects tsvector GIN Index

What goes wrong: After adding the GIN expression index in migration 0004, running alembic revision --autogenerate again generates a migration that drops and recreates the same index.

Why it happens: Alembic's autogenerate does not understand functional/expression indexes on PostgreSQL when the expression involves function calls like to_tsvector(). Known issue in Alembic.

How to avoid: Write the GIN index migration manually (as raw op.execute(CREATE INDEX...)) and do NOT add the index to the ORM model's __table_args__. Keep it only in the migration. Document with a comment # managed manually — do not autogenerate.

Warning signs: Repeated alembic revision --autogenerate output shows DROP INDEX / CREATE INDEX for ix_documents_fts on every run.

[CITED: https://github.com/sqlalchemy/alembic/issues/1390]

Pitfall 3: PDF Streaming Proxy — Admin Access

What goes wrong: Admin user calls GET /api/documents/{id}/content and receives document bytes.

Why it happens: Using get_current_user instead of get_regular_user on the proxy endpoint.

How to avoid: The proxy endpoint MUST use get_regular_user dep, which raises HTTP 403 for user.role == 'admin'. This is established in Phase 3 (D-16) and must be applied to the proxy endpoint.

Warning signs: Security test test_admin_cannot_access_document_content fails or is absent.

[VERIFIED: codebase — backend/deps/auth.py get_regular_user raises 403 for admin role]

What goes wrong: User A shares document with User B. User C (also a recipient of a different share on the same document, or just any user) can enumerate share IDs and revoke or view shares they don't own.

Why it happens: DELETE /api/shares/{share_id} checks only that the share exists, not that share.owner_id == current_user.id.

How to avoid: Ownership assertion on all share mutation endpoints:

share = await session.get(Share, share_id)
if share is None or share.owner_id != current_user.id:
    raise HTTPException(404, "Share not found")

Warning signs: Security test test_share_revoke_wrong_owner_returns_404 is absent.

[ASSUMED: standard IDOR prevention pattern per CLAUDE.md; test must be written]

Pitfall 5: write_audit_log Does Not Include IP Address

What goes wrong: ip_address column is left NULL for all events because the handler does not extract it from the request.

Why it happens: The request.client.host extraction is not wired into write_audit_log() calls.

How to avoid: Inject request: Request into each handler and pass request.client.host as ip_address to write_audit_log(). For handlers behind a reverse proxy, use request.headers.get("X-Forwarded-For", request.client.host).

[ASSUMED: IP extraction from FastAPI Request is standard; X-Forwarded-For handling depends on deployment]

Pitfall 6: Folder Name Uniqueness Constraint Violation

What goes wrong: Creating two folders with the same name under the same parent raises an IntegrityError from the uq_folders_user_parent_name constraint, which surfaces as HTTP 500.

Why it happens: The Folder model has UniqueConstraint("user_id", "parent_id", "name"). The API handler does not check for duplicates before inserting.

How to avoid: Catch IntegrityError from the ORM insert and return HTTP 409 Conflict:

from sqlalchemy.exc import IntegrityError
try:
    session.add(folder)
    await session.commit()
except IntegrityError:
    await session.rollback()
    raise HTTPException(409, "A folder with that name already exists here")

Warning signs: POST /api/folders returns 500 on duplicate name.

[VERIFIED: codebase — db/models.py Folder UniqueConstraint("user_id", "parent_id", "name")]

Pitfall 7: "Shared with Me" Leaks Document Content

What goes wrong: The GET /api/documents list endpoint accidentally returns documents shared with the current user (not just owned by them), exposing filenames and extracted text to recipients.

Why it happens: A join on the shares table is added to the list query without scoping by ownership.

How to avoid: The standard list endpoint (GET /api/documents) MUST filter WHERE documents.user_id = current_user.id only. Shared documents are a separate endpoint GET /api/shares/received. Extracted text is not returned in the shared documents list view — only metadata.

[VERIFIED: codebase — services/storage.py list_metadata already scopes by user_id]

Pitfall 8: MinIO Audit-Logs Bucket Does Not Exist

What goes wrong: The daily Celery beat task fails with NoSuchBucket error on first run.

Why it happens: The audit-logs bucket is not created in the migration or application lifespan.

How to avoid: Create the audit-logs bucket in migration 0004's upgrade() function (same pattern as the main documents bucket, gated on MINIO_ENDPOINT env var to allow SQLite test runs):

# In migration 0004 post-DDL step
minio_endpoint = os.environ.get("MINIO_ENDPOINT")
if minio_endpoint:
    client = Minio(minio_endpoint, ...)
    if not client.bucket_exists("audit-logs"):
        client.make_bucket("audit-logs")

[VERIFIED: codebase — migration 0003 shows the MINIO_ENDPOINT-gated pattern; audit-logs bucket creation follows same approach]

Runtime State Inventory

Phase 4 is a feature addition (not a rename/refactor). No runtime state migration is required.

Category	Items Found	Action Required
Stored data	No existing `pdf_open_mode` values; column is NEW (default `'in_app'`)	Alembic migration 0004 adds column with server_default — no data migration
Stored data	Existing `audit_log` table is EMPTY (no events written in prior phases)	No migration; Phase 4 backfills audit writes into handlers going forward
Live service config	`audit-logs` MinIO bucket does not exist	Created in migration 0004 post-DDL step (gated on MINIO_ENDPOINT)
OS-registered state	None — verified by codebase scan	None
Secrets/env vars	No new env vars required for Phase 4 features	None
Build artifacts	No stale egg-info or compiled artifacts from prior phases	None

Environment Availability

Dependency	Required By	Available	Version	Fallback
PostgreSQL	tsvector GIN index, CTE folder cascade	[runtime — Docker Compose]	15+	SQLite for unit tests (no tsvector; integration tests require INTEGRATION=1)
MinIO	Streaming proxy, audit CSV upload	[runtime — Docker Compose]	latest	Mock in tests (established pattern)
Redis	Celery beat daily export task	[runtime — Docker Compose]	latest	Not mockable for beat tasks; integration tests skip beat scheduling
Python csv (stdlib)	CSV export endpoint + Celery export task	✓	stdlib	—

Missing dependencies with no fallback:

PostgreSQL tsvector: cannot be tested without a real PostgreSQL instance. Tests that exercise FTS must be marked with pytest.mark.skipif(not live_services_available, ...) or use the INTEGRATION=1 pattern.

Missing dependencies with fallback:

SQLite in unit tests: most folder/share CRUD tests run fine on SQLite. Only FTS and WITH RECURSIVE CTE tests require PostgreSQL.

Validation Architecture

Test Framework

Property	Value
Framework	pytest + pytest-asyncio (already configured)
Config file	`pytest.ini` or `pyproject.toml` (check existing)
Quick run command	`pytest backend/tests/test_folders.py backend/tests/test_shares.py -x`
Full suite command	`cd backend && pytest -v`

Phase Requirements → Test Map

Req ID	Behavior	Test Type	Automated Command	File Exists?
FOLD-01	Create/rename/delete folder	Integration	`pytest backend/tests/test_folders.py::test_create_folder -x`	❌ Wave 0
FOLD-01	Delete non-empty folder → warning count	Integration	`pytest backend/tests/test_folders.py::test_delete_folder_content_count -x`	❌ Wave 0
FOLD-02	Move document to folder	Integration	`pytest backend/tests/test_folders.py::test_move_document -x`	❌ Wave 0
FOLD-02	Move document — ownership assertion	Integration (negative)	`pytest backend/tests/test_folders.py::test_move_wrong_owner_404 -x`	❌ Wave 0
FOLD-03	Breadcrumb path returned in folder response	Unit	`pytest backend/tests/test_folders.py::test_breadcrumb_path -x`	❌ Wave 0
FOLD-04	Sort by name/date/size	Integration	`pytest backend/tests/test_folders.py::test_document_sort -x`	❌ Wave 0
FOLD-05	tsvector search returns matching docs	Integration (PostgreSQL required)	`pytest backend/tests/test_folders.py::test_fts_search -x -m integration`	❌ Wave 0
SHARE-01	Share by handle — handle not found → 404	Integration	`pytest backend/tests/test_shares.py::test_share_handle_not_found -x`	❌ Wave 0
SHARE-01	Share by handle — success	Integration	`pytest backend/tests/test_shares.py::test_share_success -x`	❌ Wave 0
SHARE-02	Shared doc in recipient virtual folder	Integration	`pytest backend/tests/test_shares.py::test_shared_with_me -x`	❌ Wave 0
SHARE-02	Shared doc — zero quota charged to recipient	Integration	`pytest backend/tests/test_shares.py::test_share_no_quota_impact -x`	❌ Wave 0
SHARE-04	Revoke share — immediate for recipient	Integration	`pytest backend/tests/test_shares.py::test_revoke_share -x`	❌ Wave 0
SHARE-01..04	Share IDOR — wrong owner cannot revoke	Security (negative)	`pytest backend/tests/test_shares.py::test_share_revoke_wrong_owner_404 -x`	❌ Wave 0
SEC-08	credentials_enc absent from all responses	Security (negative)	`pytest backend/tests/test_security.py::test_credentials_enc_not_in_response -x`	❌ Wave 0
SEC-09	Account deletion deletes user files	Integration	`pytest backend/tests/test_admin_api.py::test_delete_user_cleans_files -x`	❌ Wave 0
DOC-02	PDF proxy returns bytes, 200 or 206	Integration	`pytest backend/tests/test_documents.py::test_content_stream_200 -x`	❌ Wave 0
DOC-02	PDF proxy — no presigned URL in response	Security (negative)	`pytest backend/tests/test_documents.py::test_content_stream_no_presigned_url -x`	❌ Wave 0
DOC-02	PDF proxy — Range header → 206	Integration	`pytest backend/tests/test_documents.py::test_content_stream_206_range -x`	❌ Wave 0
DOC-02	PDF proxy — admin blocked (403)	Security (negative)	`pytest backend/tests/test_documents.py::test_content_stream_admin_403 -x`	❌ Wave 0
ADMIN-06	Audit log viewer — no document content in entries	Security (negative)	`pytest backend/tests/test_audit.py::test_audit_log_no_doc_content -x`	❌ Wave 0
ADMIN-06	Audit log viewer — admin only	Security (negative)	`pytest backend/tests/test_audit.py::test_audit_log_regular_user_403 -x`	❌ Wave 0

Sampling Rate

Per task commit: pytest backend/tests/test_folders.py backend/tests/test_shares.py backend/tests/test_audit.py backend/tests/test_documents.py -x
Per wave merge: cd backend && pytest -v
Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

backend/tests/test_folders.py — covers FOLD-01..05
backend/tests/test_shares.py — covers SHARE-01..05 + IDOR security tests
backend/tests/test_audit.py — covers ADMIN-06 + no-doc-content security tests
backend/tests/test_documents.py — add proxy tests (test_content_stream_*) to existing file
backend/tests/test_security.py — add SEC-08, SEC-09 tests (may be in test_admin_api.py)

Security Domain

Applicable ASVS Categories

ASVS Category	Applies	Standard Control
V2 Authentication	yes	`get_regular_user` / `get_current_admin` deps on all new endpoints
V3 Session Management	no	No new session mechanisms introduced
V4 Access Control	yes	Ownership assertion on all folder/share endpoints; `get_regular_user` rejects admin on proxy
V5 Input Validation	yes	Pydantic models on all request bodies; `doc_id` parsed via `uuid.UUID()`
V6 Cryptography	no	No new cryptographic operations; credentials still encrypted via Phase 2 HKDF pattern

Known Threat Patterns for this Phase

Pattern	STRIDE	Standard Mitigation
Share IDOR — revoke another user's share	Elevation of privilege	`share.owner_id == current_user.id` → 404 on mismatch
Share IDOR — access shared doc content without share	Information disclosure	`_can_access_document()` check in proxy endpoint: owner OR active share
PDF proxy leaking presigned URL	Information disclosure	`get_object()` fetches bytes directly; presigned URL never generated in proxy handler
Admin accessing document content via proxy	Broken access control	`get_regular_user` dep raises 403 for admin role
Folder IDOR — delete another user's folder	Tampering	`folder.user_id == current_user.id` → 404 on mismatch
Audit log containing document content	Sensitive data exposure	`write_audit_log()` metadata_ MUST NOT include `filename`, `extracted_text`, or file bytes
Audit log admin access by regular user	Broken access control	`GET /api/admin/audit-log` uses `get_current_admin` dep
Path traversal via `folder_id` parameter	Information disclosure	All folder/document lookups via DB primary key; `uuid.UUID()` parse validates format
Tsvector search scope leak — user sees others' docs	Information disclosure	FTS query MUST include `Document.user_id == current_user.id` scope filter
credentials_enc in serialized admin response	Sensitive data exposure	`CloudConnectionOut` excludes field; never in any response

Security Gate Checklist (Phase 4 specific)

bandit -r backend/ — zero HIGH findings in new files (api/folders.py, api/shares.py, api/audit.py, services/audit.py, tasks/audit_tasks.py)
pip audit — zero new critical/high CVEs (no new packages)
npm audit --audit-level=high — zero (no new npm packages)
test_share_revoke_wrong_owner_404 passes
test_content_stream_admin_403 passes
test_content_stream_no_presigned_url passes
test_audit_log_no_doc_content passes — verifies filename, extracted_text absent from all audit entries
test_credentials_enc_not_in_response passes
test_fts_search_scoped_to_owner passes — other user's docs not in search results

Code Examples

-- Source: CITED: https://www.postgresql.org/docs/current/queries-with.html
WITH RECURSIVE breadcrumb AS (
    SELECT id, parent_id, name, 1 AS depth
    FROM folders
    WHERE id = :folder_id AND user_id = :uid
    UNION ALL
    SELECT f.id, f.parent_id, f.name, b.depth + 1
    FROM folders f
    JOIN breadcrumb b ON f.id = b.parent_id
    WHERE f.user_id = :uid AND b.depth < 50
)
SELECT id, name, depth FROM breadcrumb ORDER BY depth DESC;

Returns path from root to current folder (root = highest depth value after ORDER BY depth DESC). Each segment is {id, name} for clickable navigation.

Full-text Search Query (SQLAlchemy)

# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func
from sqlalchemy.dialects.postgresql import TSVECTOR

stmt = (
    select(Document)
    .where(
        Document.user_id == current_user.id,
        func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
            func.plainto_tsquery("english", q)
        ),
    )
    .order_by(Document.created_at.desc())
    .limit(per_page)
    .offset((page - 1) * per_page)
)

"Shared with me" Virtual Folder Query

# Source: CONTEXT.md D-06
from sqlalchemy import select
from db.models import Share, Document

stmt = (
    select(Document)
    .join(Share, Share.document_id == Document.id)
    .where(Share.recipient_id == current_user.id)
    .order_by(Document.created_at.desc())
)
result = await session.execute(stmt)
shared_docs = result.scalars().all()

`is_shared` indicator on document list

# Source: CONTEXT.md SHARE-05 — owner's list shows "shared" badge
from sqlalchemy import exists, select

# Add to list_documents query:
shared_subq = (
    select(Share.document_id)
    .where(Share.owner_id == current_user.id)
    .scalar_subquery()
)
# Add to Document select:
stmt = select(Document, Document.id.in_(shared_subq).label("is_shared"))

State of the Art

Old Approach	Current Approach	When Changed	Impact
PostgreSQL `to_tsquery` (operator syntax required)	`plainto_tsquery` (natural language input)	PostgreSQL 9.3+	Search bars should use `plainto_tsquery`; `to_tsquery` for programmatic use only
PDF.js for in-browser PDF viewing	Native browser PDF viewer via `Content-Type: application/pdf` + `Content-Disposition: inline`	Browser support ~2018+	All modern browsers support native PDF rendering; PDF.js only needed for custom annotation features
Alembic autogenerate for GIN expression indexes	Manual migration SQL	Alembic 1.13+ (known bug)	Must write `op.execute("CREATE INDEX ... USING GIN")` manually
`asyncio.run()` in Celery tasks	`asyncio.run(_async_func())` is the standard bridge	Established in Phase 3 document_tasks.py	Celery sync task calls `asyncio.run()` to enter async context

Deprecated/outdated:

to_tsvector with stored Computed column: adds schema complexity; expression index is simpler and equally performant for this scale
Separate audit log middleware: D-14 decision is inline writes; middleware approach not used

Assumptions Log

#	Claim	Section	Risk if Wrong
A1	The streaming proxy's shared-document access check (Pattern 4) is implied by D-08 and SHARE-02	Pattern 4, Security Domain	If recipients cannot access the proxy, "Shared with me" is useless; planner must confirm access rule
A2	`session.flush()` (not `commit()`) inside `write_audit_log()` is the correct transactional pattern	Pattern 6	Using `commit()` would create two separate transactions; if primary operation fails after audit commit, audit entry is orphaned with no corresponding operation
A3	300ms debounce interval for search	Pattern 10	If too short, excessive API calls; if too long, UX feels sluggish. Standard convention, low risk
A4	`WITH RECURSIVE` CTE for subtree cascade-delete is the correct PostgreSQL approach; SQLite tests must use INTEGRATION=1	Pitfall 1, Common Pitfalls	SQLite does not support `WITH RECURSIVE` — all subtree cascade tests require PostgreSQL
A5	`MinIOBackend.put_object_raw()` method needs to be added for audit-logs bucket (different key scheme from documents)	Pattern 8	If not added, Celery export task cannot upload to `audit-logs` bucket using the correct key
A6	The `audit-logs` MinIO bucket creation is safe to gate on `MINIO_ENDPOINT` env var (same as migration 0003 pattern)	Pitfall 8	If MinIO is present but bucket creation is skipped, first export task fails

If this table is empty: Not applicable — some assumptions remain in this research.

Open Questions (RESOLVED)

Subtree folder delete — CTE vs. iterative in Python
- What we know: WITH RECURSIVE CTE works in PostgreSQL; SQLite does not support it
- What's unclear: Should the planner use the CTE (PostgreSQL-only, integration test) or iterative Python (works in SQLite, slower for deep trees)?
- Recommendation: Use CTE for the implementation (real database is PostgreSQL); mark the cascade-delete tests as pytest.mark.skipif(not live_services_available, ...) or use INTEGRATION=1
Shared document access in proxy — _can_access_document() adds a DB query per request
- What we know: Every GET /api/documents/{id}/content call currently does one DB lookup; adding share check adds a second
- What's unclear: Is the extra query acceptable given Phase 4 scale (single-user → small multi-user)?
- Recommendation: Add the share check unconditionally; at Phase 4 scale a second indexed query is negligible
PATCH /api/me/preferences endpoint path
- What we know: D-10 specifies this endpoint for pdf_open_mode
- What's unclear: Should it be on /api/auth/me/preferences (alongside /api/auth/me) or /api/me/preferences (new router)?
- Recommendation: Place on /api/auth/me/preferences (same router prefix as /api/auth/me) to keep auth-related user settings in the same module

Sources

Primary (HIGH confidence)

backend/db/models.py — all ORM models verified: Folder, Share, AuditLog, Document.folder_id, User (missing pdf_open_mode), CloudConnection (credentials_enc column confirmed)
backend/api/documents.py — ownership assertion pattern, get_regular_user dep, quota UPDATE pattern
backend/deps/auth.py — get_regular_user raises 403 for admin; get_current_admin pattern
backend/celery_app.py — existing beat_schedule structure; cleanup_abandoned_uploads as template
backend/storage/minio_backend.py — get_object() confirmed; asyncio.to_thread() pattern confirmed
backend/services/storage.py — list_metadata scoping by user_id confirmed

Secondary (MEDIUM confidence)

FastAPI discussions #7718 — Range header StreamingResponse pattern verified by multiple community confirmations
PostgreSQL docs — Full-text search tables — GIN expression index SQL verified from official docs
SQLAlchemy PostgreSQL dialect docs — func.to_tsvector, plainto_tsquery support confirmed
Celery periodic tasks docs — crontab(hour=0, minute=0) pattern
FastAPI response model docs — response_model=CloudConnectionOut exclusion pattern

Tertiary (LOW confidence)

Alembic issue #1390 — tsvector autogenerate bug confirmed by issue thread; marked for validation

Metadata

Confidence breakdown:

Standard stack: HIGH — all packages already in requirements.txt; codebase verified
Architecture: HIGH — all ORM models exist and were verified in codebase
Pitfalls: MEDIUM — Pitfalls 1, 4 are confirmed from codebase + security requirements; Pitfall 2 confirmed from Alembic issue tracker
Code patterns: MEDIUM — Range header pattern from FastAPI community discussion; PostgreSQL FTS from official docs

Research date: 2026-05-25 Valid until: 2026-06-25 (stable FastAPI/SQLAlchemy stack; no fast-moving dependencies)

59 KiB Raw Permalink Blame History Unescape Escape

Phase 4: Folders, Sharing, Quotas & Document UX — Research