Files
curo1305 747303246a docs(04): create phase 4 plan (9 plans, 7 waves)
Folders, Sharing, Quotas & Document UX — plans verified (0 blockers,
2 non-blocking warnings). Covers FOLD-01..05, SHARE-01..05, SEC-08/09,
ADMIN-06, DOC-01/02.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:20:16 +02:00

59 KiB
Raw Permalink Blame History

Phase 4: Folders, Sharing, Quotas & Document UX — Research

Researched: 2026-05-25 Domain: FastAPI folder/share CRUD, PostgreSQL tsvector, MinIO streaming proxy, Celery beat, Vue 3 folder navigation Confidence: HIGH


<user_constraints>

User Constraints (from CONTEXT.md)

Locked Decisions

D-01 Hybrid layout — AppSidebar shows top-level folders only. Sub-folders + breadcrumb in main content area. Top-level folders clickable directly in sidebar.

D-02 Unlimited nesting depth (no API or UI cap). Folder.parent_id self-referential FK is authoritative. Breadcrumbs truncate at depth > 4 (show first + "..." + last 2).

D-03 Non-empty folder delete: warning modal with document count. Confirm → cascade-delete all documents (MinIO + DB + quota). Cancel → no action. Documents are NOT moved to root — they are destroyed.

D-04 Exact handle input — no autocomplete. API returns 404 if handle not found; UI shows "User not found" error.

D-05 Share button on DocumentCard (inline icon button). Modal: (a) handle input, (b) Share button, (c) current recipients list with Revoke per row.

D-06 "Shared with me" is a fixed virtual folder entry in AppSidebar, rendered above the user's own folder list. Filtered by shares.recipient_id = current_user.id. Zero quota charged to recipient.

D-07 Share permission is view only for Phase 4. edit deferred.

D-08 Streaming proxy endpoint: GET /api/documents/{id}/content. Returns bytes via FastAPI StreamingResponse. Supports Range headers. Content-Disposition: inline. No presigned URL ever generated or exposed. Uses get_regular_user dep.

D-09 Native browser PDF rendering — no PDF.js. Zero frontend dependencies added. Content-Type header drives browser rendering.

D-10 users.pdf_open_mode column (String, default 'in_app'). Exposed via PATCH /api/me/preferences. in_app = modal with <iframe>; new_tab = window.open.

D-11 PostgreSQL tsvector GIN index on documents.extracted_text. plainto_tsquery('english', :q). Search scope: user's own docs only. GET /api/documents?q=<query>.

D-12 Inline search bar in document list view. Debounced, minimum 2 chars. Folder-context aware.

D-13 All 4 audit event categories logged: auth events, document operations, folder/share operations, admin actions.

D-14 Audit log written inline in handlers (no middleware, no Celery task). Shared helper write_audit_log(session, event_type, user_id, actor_id, resource_id, ip_address, metadata_dict).

D-15 Admin audit log viewer (/admin → AuditLog tab): paginated table, filters (date range, user, action type). No document content, filenames, or extracted text in any log entry.

D-16 Manual CSV and JSON export button in admin audit log view. Exports current filtered result set.

D-17 Daily automated CSV backup via Celery beat. Exports previous day's audit rows to MinIO audit-logs bucket (audit-logs/YYYY-MM-DD.csv). Bucket is private. Admins can download from admin panel. Rows NOT deleted from DB.

D-18 credentials_enc excluded from all serializers (SEC-08). Explicit Pydantic response model enforcement on all user-facing endpoints that touch CloudConnection.

D-19 Account deletion (admin-triggered) runs delete_user_files() before removing DB records (SEC-09). Cloud connection cleanup deferred to Phase 5.

Claude's Discretion

None specified — all significant decisions are locked.

Deferred Ideas (OUT OF SCOPE)

  • Per-day document scan quota tiers (future billing milestone)
  • Share permission levels beyond view (edit, comment deferred)
  • Presigned GET URLs for non-PDF document download
  • Audit log row archival/deletion policy (DB rows are not auto-deleted) </user_constraints>

<phase_requirements>

Phase Requirements

ID Description Research Support
FOLD-01 User can create, rename, and delete folders (delete confirms content count before proceeding) Folder model complete in db/models.py; cascade-delete pattern via SQLAlchemy + atomic quota decrement established in Phase 3
FOLD-02 User can move documents between folders Document.folder_id FK exists and is nullable; PATCH endpoint updates folder_id with ownership assertion on both doc and target folder
FOLD-03 Breadcrumb navigation renders current folder path; each segment clickable Recursive path query via WITH RECURSIVE CTE or iterative parent-chain walk; Vue computed property for breadcrumb array
FOLD-04 Document list supports sort by name, date uploaded, and file size SQLAlchemy order_by() with enum sort param; frontend sort query param
FOLD-05 Full-text search via PostgreSQL tsvector index on extracted text Expression-index approach to_tsvector('english', extracted_text) + GIN; plainto_tsquery in WHERE clause; Alembic migration adds index manually (not as Computed column to avoid autogenerate noise)
SHARE-01 User can share a document with another user by their unique handle Share model complete; User.handle unique lookup; exact match, no autocomplete
SHARE-02 Shared documents appear in "Shared with me" virtual folder; zero quota charged to recipient Filter shares.recipient_id = current_user.id; no quota row touched; virtual folder is a UI/API filter, not a DB folder
SHARE-03 Shared access view-only by default; owner controls permission level Share.permission defaults to 'view'; Phase 4 only implements view; permission field returned in list responses
SHARE-04 Owner can revoke share; revocation is immediate DELETE /api/shares/{share_id} with owner assertion; no async cleanup needed
SHARE-05 Documents shared with others display a "shared" indicator Document list response includes is_shared: bool derived from EXISTS subquery on shares.owner_id
SEC-08 credentials_enc excluded from all API serializers Explicit CloudConnectionOut Pydantic response model excluding credentials_enc
SEC-09 Account deletion triggers delete_user_files() per cloud connection before DB removal Implemented in admin delete-user endpoint; iterates documents, deletes MinIO objects, decrements quota
ADMIN-06 Admin can view audit log filtered by date range, user, and action type (metadata only) AuditLog model complete; paginated query with filters; admin endpoint uses get_current_admin
DOC-01 User can view document metadata and extracted text Existing GET /api/documents/{id} already returns extracted_text; Phase 4 ensures it is included in response
DOC-02 In-browser PDF preview via PDF.js proxy; no presigned URLs exposed GET /api/documents/{id}/content — StreamingResponse with Range header support, bytes from MinIO via asyncio.to_thread
</phase_requirements>

Summary

Phase 4 builds entirely on the existing ORM schema (all required models — Folder, Share, AuditLog, Document.folder_id, Share.recipient_id — are already in db/models.py). The work is predominantly plumbing: creating new API router modules for folders, shares, and audit log; adding a single Alembic migration for the users.pdf_open_mode column and the tsvector GIN expression index; extending the Celery beat schedule with a daily audit export task; and extending the Vue 3 frontend with folder navigation, sharing modals, and the settings preference toggle.

The highest-risk areas are: (1) the PDF streaming proxy with Range header support (needs careful byte-range parsing to handle PDF viewer seeking), (2) the tsvector GIN index and corresponding Alembic migration (Alembic autogenerate has known quirks with to_tsvector() expression indexes — the migration must be written manually), and (3) the folder cascade-delete with atomic quota decrement for multiple documents (each document's size_bytes must be summed and decremented in a single atomic UPDATE or via a pre-computed sum).

No new Python or npm packages are required for Phase 4. The decision to use native browser PDF rendering (D-09) eliminates the PDF.js dependency. All needed libraries — FastAPI, SQLAlchemy, MinIO SDK, Celery, Pydantic — are already pinned in requirements.txt. The frontend needs only Tailwind utility classes; no new npm packages.

Primary recommendation: Build in 6 plans: Wave 0 test scaffolds + migration; folders backend; shares backend + audit service; streaming proxy + search; audit log admin API + Celery export task; frontend folder navigation + sharing + settings.


Architectural Responsibility Map

Capability Primary Tier Secondary Tier Rationale
Folder CRUD + breadcrumb path API / Backend Database / Storage Ownership assertion and cascade-delete logic belongs in the API layer; path reconstruction is a DB query
Document move between folders API / Backend Database / Storage Ownership assertions on both document and target folder must be co-located in the API handler
Full-text search (tsvector) Database / Storage API / Backend GIN index and plainto_tsquery execute in PostgreSQL; API adds scope filter (user_id)
Share grant / revoke API / Backend Database / Storage IDOR prevention requires ownership assertion before DB write; share table is the single source of truth
"Shared with me" virtual folder API / Backend Pure query filter on shares.recipient_id; no DB folder row; quota enforcement is in the API layer
PDF streaming proxy API / Backend Database / Storage Bytes flow from MinIO through FastAPI to the browser; presigned URLs never reach the browser
Per-user PDF open preference API / Backend Database / Storage User preference column (users.pdf_open_mode) read/written via /api/me/preferences
Audit log write API / Backend write_audit_log() called inline in each handler after successful operation; no middleware
Audit log viewer + export API / Backend Database / Storage Admin-only paginated query with filters; CSV export is a StreamingResponse from a filtered DB query
Daily audit backup (Celery beat) API / Backend (worker) Database / Storage Celery beat task runs in the worker; queries DB and uploads CSV to MinIO audit-logs bucket
Folder navigation UI state Frontend Server (SSR) → Browser / Client Folder ID and breadcrumb path held in Vue ref() in HomeView.vue; no server-side state
Share modal + indicator Browser / Client Share button on DocumentCard; modal state is local component ref(); share list in documents store
PDF open preference toggle Browser / Client API / Backend Toggle in SettingsView.vue reads/writes preference via PATCH /api/me/preferences
credentials_enc exclusion (SEC-08) API / Backend Pydantic response model CloudConnectionOut excludes the field; never reaches serializer

Standard Stack

Core (all already in requirements.txt / package.json)

Library Version Purpose Why Standard
FastAPI 0.136+ API router for folders, shares, audit, proxy, preferences Already in use; APIRouter + StreamingResponse needed for proxy
SQLAlchemy 2.0 async 2.0+ ORM queries for all new endpoints Established pattern in project; async session already wired
Alembic 1.18+ Migration 0004 — users.pdf_open_mode column + tsvector GIN index Established migration toolchain
MinIO Python SDK 7.2+ get_object() for streaming proxy; put_object() for audit CSV upload Already in use; asyncio.to_thread() pattern established
Celery + Redis 5.5+ Daily audit export beat task Already in use; beat_schedule in celery_app.py
Pydantic v2 2.0+ CloudConnectionOut response model excluding credentials_enc (SEC-08) Already in use for all request/response models
Vue 3 (Options API) 3.4+ Folder nav, share modal, settings toggle Already in use; ref() + watch() patterns established

[VERIFIED: codebase — requirements.txt and package.json]

No New Packages Required

Phase 4 introduces zero new dependencies. Key rationale:

  • PDF viewer: D-09 chose native browser rendering — no PDF.js
  • Debounce: hand-rolled setTimeout/clearTimeout in Vue watch callback (< 10 lines, no lodash needed)
  • CSV export: Python stdlib csv.DictWriter with io.StringIO — no pandas
  • Search: PostgreSQL built-in plainto_tsquery — no Elasticsearch or external search engine

[VERIFIED: codebase — no new deps required]


Package Legitimacy Audit

No new packages are installed in Phase 4. All runtime dependencies are carried forward from Phases 13 with pinned versions in requirements.txt.

Packages removed due to slopcheck [SLOP] verdict: none Packages flagged as suspicious [SUS]: none New packages to install: none


Architecture Patterns

System Architecture Diagram

Browser
  │
  ├── GET /api/documents?q=search&folder_id=X&sort=name
  │     FastAPI → SQLAlchemy → PostgreSQL (tsvector @@ plainto_tsquery, scoped to user)
  │     ← JSON: [{id, filename, size_bytes, is_shared, folder_id, ...}]
  │
  ├── POST /api/folders / PATCH /api/folders/{id} / DELETE /api/folders/{id}
  │     FastAPI → ownership assertion → SQLAlchemy → PostgreSQL
  │     DELETE: sum(doc.size_bytes) → atomic quota decrement → MinIO delete_object per doc
  │
  ├── POST /api/shares / DELETE /api/shares/{id}
  │     FastAPI → ownership assertion (owner_id = current_user.id) → SQLAlchemy → PostgreSQL
  │     GET /api/shares/received → filter shares.recipient_id = current_user.id
  │
  ├── GET /api/documents/{id}/content   [Range: bytes=X-Y]
  │     FastAPI (get_regular_user) → ownership OR share-access check → SQLAlchemy (object_key)
  │     → MinIO get_object() via asyncio.to_thread() → StreamingResponse (206 or 200)
  │     ← bytes with Content-Type, Content-Disposition: inline, Accept-Ranges: bytes
  │     NOTE: no presigned URL generated; bytes flow through FastAPI
  │
  ├── PATCH /api/me/preferences
  │     FastAPI → SQLAlchemy UPDATE users.pdf_open_mode → ← {pdf_open_mode}
  │
  └── GET /api/admin/audit-log  [?start=&end=&user_id=&event_type=&page=]
        FastAPI (get_current_admin) → SQLAlchemy → PostgreSQL
        ← JSON: {items: [{id, event_type, user_id, actor_id, ip_address, metadata_, created_at}], total}

      GET /api/admin/audit-log/export?format=csv|json
        FastAPI (get_current_admin) → query → StreamingResponse (text/csv or application/json)

Celery Beat Worker (daily at midnight UTC)
  │
  └── audit_log_daily_export task
        → AsyncSessionLocal → SELECT audit_log WHERE date = yesterday
        → csv.DictWriter → io.BytesIO
        → MinIO put_object("audit-logs", "YYYY-MM-DD.csv", bytes)
backend/
├── api/
│   ├── folders.py          # FOLD-01, FOLD-02, FOLD-03, FOLD-04, FOLD-05
│   ├── shares.py           # SHARE-01..05
│   └── audit.py            # ADMIN-06 (admin viewer + export)
├── services/
│   └── audit.py            # write_audit_log() helper
├── tasks/
│   └── audit_tasks.py      # audit_log_daily_export Celery beat task
└── migrations/versions/
    └── 0004_phase4_pdf_open_mode_tsvector.py

frontend/src/
├── components/
│   ├── documents/
│   │   └── ShareModal.vue          # D-05
│   ├── layout/
│   │   └── BreadcrumbNav.vue       # FOLD-03
│   └── admin/
│       └── AdminAuditLogTab.vue    # ADMIN-06 (D-15, D-16)
├── stores/
│   └── folders.js                  # folder state + actions
└── views/
    └── SettingsView.vue            # extend with PDF preference toggle (D-10)

Pattern 1: Ownership Assertion (established in Phase 3 — D-16)

What: Every resource endpoint asserts resource.user_id == current_user.id. Cross-user access returns 404 (not 403) to prevent attacker enumeration of resource IDs.

When to use: All folder CRUD, document move, share grant/revoke (owner side), streaming proxy.

# Source: backend/api/documents.py (established pattern)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
    raise HTTPException(status_code=404, detail="Document not found")

Folder variant:

folder = await session.get(Folder, folder_id)
if folder is None or folder.user_id != current_user.id:
    raise HTTPException(status_code=404, detail="Folder not found")

[VERIFIED: codebase — backend/api/documents.py]

Pattern 2: Atomic Quota Decrement for Folder Cascade-Delete

What: When deleting a folder, collect all document size_bytes first, then issue a single atomic quota UPDATE. Never read-then-write in Python.

When to use: DELETE /api/folders/{id} when folder is non-empty.

# Source: established atomic pattern (CLAUDE.md + Phase 3 D-07)
# Step 1: collect document IDs and total size within the folder subtree
result = await session.execute(
    select(Document.id, Document.size_bytes, Document.object_key)
    .where(Document.folder_id == folder_id, Document.user_id == current_user.id)
)
docs = result.all()
total_bytes = sum(row.size_bytes for row in docs)

# Step 2: atomic quota decrement (CASE WHEN pattern — SQLite compatible)
await session.execute(
    text(
        "UPDATE quotas SET used_bytes = "
        "CASE WHEN used_bytes > :delta THEN used_bytes - :delta ELSE 0 END "
        "WHERE user_id = :uid"
    ),
    {"delta": total_bytes, "uid": str(current_user.id)},
)

# Step 3: delete MinIO objects (best-effort)
for row in docs:
    try:
        await get_storage_backend().delete_object(row.object_key)
    except Exception:
        pass  # MinIO cleanup is best-effort

# Step 4: delete documents (cascade via FK) then folder
await session.execute(
    delete(Document).where(Document.folder_id == folder_id)
)
await session.delete(folder)
await session.commit()

[ASSUMED: the CASE WHEN pattern for SQLite compat is established in STATE.md; verified in existing delete_document service function]

Pattern 3: PDF Streaming Proxy with Range Headers

What: GET /api/documents/{id}/content fetches bytes from MinIO and streams them with Range header support. Status 206 for range requests; 200 for full fetches.

When to use: DOC-02. Required for PDF viewer to seek pages without re-downloading the entire file.

# Source: FastAPI discussion #7718 (CITED: github.com/fastapi/fastapi/discussions/7718)
from fastapi import Request
from fastapi.responses import StreamingResponse

def _parse_range(range_header: str, file_size: int) -> tuple[int, int]:
    """Parse 'bytes=start-end' Range header. Raises 416 on invalid range."""
    try:
        h = range_header.replace("bytes=", "").split("-")
        start = int(h[0]) if h[0] != "" else 0
        end = int(h[1]) if h[1] != "" else file_size - 1
    except (ValueError, IndexError):
        raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
    if start > end or start < 0 or end >= file_size:
        raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
    return start, end

@router.get("/{doc_id}/content")
async def stream_document_content(
    doc_id: str,
    request: Request,
    session: AsyncSession = Depends(get_db),
    current_user: User = Depends(get_regular_user),
):
    # Ownership assertion (DOC-02 also allows shared-with-me access — see Pattern 4)
    doc = await session.get(Document, uid)
    if doc is None or doc.user_id != current_user.id:
        # Also allow if document is shared with current_user
        share = await _get_share_for_user(session, uid, current_user.id)
        if share is None:
            raise HTTPException(404, "Document not found")
        # Use the document from the share's owner context
        # (doc is already loaded; ownership check skipped for shared access)

    # Fetch all bytes from MinIO (asyncio.to_thread wraps sync SDK)
    file_bytes = await get_storage_backend().get_object(doc.object_key)
    file_size = len(file_bytes)

    range_header = request.headers.get("range")
    headers = {
        "content-type": doc.content_type,
        "content-disposition": f"inline; filename=\"{doc.filename}\"",
        "accept-ranges": "bytes",
        "content-length": str(file_size),
    }

    if range_header:
        start, end = _parse_range(range_header, file_size)
        chunk = file_bytes[start:end + 1]
        headers["content-range"] = f"bytes {start}-{end}/{file_size}"
        headers["content-length"] = str(len(chunk))
        return StreamingResponse(
            iter([chunk]),
            status_code=206,
            headers=headers,
        )

    return StreamingResponse(
        iter([file_bytes]),
        status_code=200,
        headers=headers,
    )

Critical: get_regular_user dep ensures admins cannot access document content (DOC-02 / CLAUDE.md). No presigned URL is generated at any point.

[CITED: https://github.com/fastapi/fastapi/discussions/7718] [CITED: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/206]

Pattern 4: Shared-Document Access for Streaming Proxy

What: The streaming proxy must allow both document owners AND recipients to access content.

When to use: Only in GET /api/documents/{id}/content — sharing grants read access to the content stream.

# Source: CONTEXT.md D-08 — recipient can preview shared document
async def _can_access_document(
    session: AsyncSession, doc: Document, current_user_id: uuid.UUID
) -> bool:
    """True if user owns the document OR has an active share as recipient."""
    if doc.user_id == current_user_id:
        return True
    result = await session.execute(
        select(Share).where(
            Share.document_id == doc.id,
            Share.recipient_id == current_user_id,
        )
    )
    return result.scalar_one_or_none() is not None

[ASSUMED: the share-based access for the streaming proxy is implied by D-08 and SHARE-02; not explicitly called out in a prior phase decision]

What: GIN expression index on documents.extracted_text for full-text search. Query uses plainto_tsquery (accepts natural language; no operator syntax required from user).

When to use: FOLD-05 — GET /api/documents?q=<query>.

Migration (manual — do NOT use Computed() to avoid Alembic autogenerate noise):

# Source: CITED: https://www.postgresql.org/docs/current/textsearch-tables.html
# In migration 0004:
op.execute(
    "CREATE INDEX ix_documents_fts ON documents "
    "USING GIN (to_tsvector('english', coalesce(extracted_text, '')))"
)

Query pattern:

# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func, text

stmt = (
    select(Document)
    .where(
        Document.user_id == current_user.id,
        func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
            func.plainto_tsquery("english", q)
        ),
    )
    .order_by(Document.created_at.desc())
)

Why plainto_tsquery not to_tsquery: plainto_tsquery accepts unstructured natural language ("invoice march 2024") without requiring the user to supply & and | operators. Simpler and safer for a search bar input.

Why expression index not generated/stored column: PostgreSQL generated columns with TSVECTOR type require PostgreSQL 12+ and add schema complexity. The expression index approach is simpler and Alembic-compatible (written as raw SQL in the migration; autogenerate will not re-detect it if managed manually).

Important Alembic caveat: include_schemas=True autogenerate WILL repeatedly flag the expression index for recreation (known Alembic issue: github.com/sqlalchemy/alembic/issues/1390). The migration MUST be written manually and committed once. Do not run alembic revision --autogenerate after adding the GIN index without reviewing the output.

[CITED: https://www.postgresql.org/docs/current/textsearch-tables.html] [CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html] [CITED: https://github.com/sqlalchemy/alembic/issues/1390]

Pattern 6: write_audit_log() Helper

What: Shared service function called inline in each handler after a successful operation. Does NOT raise on failure (audit log writing is best-effort — do not fail the primary operation).

When to use: Every handler in Phase 4 that performs state-changing operations; plus back-filling into auth.py and admin.py.

# Source: CONTEXT.md D-14
# backend/services/audit.py
from __future__ import annotations
import uuid
from typing import Optional
from sqlalchemy.ext.asyncio import AsyncSession
from db.models import AuditLog
import logging

logger = logging.getLogger(__name__)

async def write_audit_log(
    session: AsyncSession,
    event_type: str,
    user_id: Optional[uuid.UUID],
    actor_id: Optional[uuid.UUID],
    resource_id: Optional[uuid.UUID],
    ip_address: Optional[str],
    metadata_: Optional[dict] = None,
) -> None:
    """Write an audit log entry. Never raises — audit failure is non-fatal."""
    try:
        entry = AuditLog(
            event_type=event_type,
            user_id=user_id,
            actor_id=actor_id,
            resource_id=resource_id,
            ip_address=ip_address,
            metadata_=metadata_,
        )
        session.add(entry)
        await session.flush()  # flush within the handler's existing transaction
    except Exception as exc:
        logger.warning("audit log write failed: %s", exc)
        # Do not re-raise — audit failure must never abort the primary operation

Critical: session.flush() not session.commit() — audit log writes within the same transaction as the primary operation, so the primary operation's session.commit() commits the audit entry too. This avoids partial writes where the operation succeeds but the audit entry is orphaned.

[ASSUMED: flush-not-commit pattern is the safest approach given transactional semantics; not explicitly documented in a prior phase decision]

Pattern 7: CSV Export (admin endpoint)

What: GET /api/admin/audit-log/export?format=csv returns a StreamingResponse with text/csv content type and Content-Disposition: attachment.

# Source: Python stdlib csv module
import csv
import io
from fastapi.responses import StreamingResponse

async def export_audit_log(
    session: AsyncSession,
    start: Optional[datetime],
    end: Optional[datetime],
    user_id: Optional[uuid.UUID],
    event_type: Optional[str],
    format: str = "csv",
):
    rows = await _query_audit_log(session, start, end, user_id, event_type)
    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[
        "id", "event_type", "user_id", "actor_id",
        "resource_id", "ip_address", "metadata_", "created_at"
    ])
    writer.writeheader()
    for row in rows:
        writer.writerow({
            "id": row.id,
            "event_type": row.event_type,
            "user_id": str(row.user_id) if row.user_id else "",
            "actor_id": str(row.actor_id) if row.actor_id else "",
            "resource_id": str(row.resource_id) if row.resource_id else "",
            "ip_address": str(row.ip_address) if row.ip_address else "",
            "metadata_": str(row.metadata_) if row.metadata_ else "",
            "created_at": row.created_at.isoformat(),
        })
    output.seek(0)
    return StreamingResponse(
        iter([output.getvalue()]),
        media_type="text/csv",
        headers={"Content-Disposition": f"attachment; filename=audit-export.csv"},
    )

[CITED: Python stdlib csv documentation — https://docs.python.org/3/library/csv.html]

Pattern 8: Celery Beat — Daily Audit Export Task

What: A new task audit_log_daily_export added to beat_schedule. Uses crontab(hour=0, minute=0) for midnight UTC.

# Source: CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
from celery.schedules import crontab

# In celery_app.py beat_schedule dict — ADD to existing schedule:
"audit-log-daily-export": {
    "task": "tasks.audit_tasks.audit_log_daily_export",
    "schedule": crontab(hour=0, minute=0),  # midnight UTC
},

Task implementation pattern (mirrors cleanup_abandoned_uploads):

@celery_app.task(name="tasks.audit_tasks.audit_log_daily_export")
def audit_log_daily_export() -> dict:
    return asyncio.run(_run_daily_export())

async def _run_daily_export() -> dict:
    from datetime import date, timedelta
    import csv, io
    from db.session import AsyncSessionLocal
    from db.models import AuditLog
    from storage import get_storage_backend
    from sqlalchemy import select

    yesterday = date.today() - timedelta(days=1)
    start = datetime(yesterday.year, yesterday.month, yesterday.day, tzinfo=timezone.utc)
    end = start + timedelta(days=1)

    async with AsyncSessionLocal() as session:
        result = await session.execute(
            select(AuditLog)
            .where(AuditLog.created_at >= start, AuditLog.created_at < end)
            .order_by(AuditLog.created_at)
        )
        rows = result.scalars().all()

    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[...])
    writer.writeheader()
    for row in rows:
        writer.writerow({...})

    csv_bytes = output.getvalue().encode("utf-8")
    key = f"audit-logs/{yesterday.isoformat()}.csv"
    backend = get_storage_backend()
    await backend.put_object_raw(
        bucket="audit-logs",
        key=key,
        data=io.BytesIO(csv_bytes),
        length=len(csv_bytes),
        content_type="text/csv",
    )
    return {"exported": len(rows), "key": key}

Note: MinIOBackend may need a put_object_raw(bucket, key, data, length, content_type) method (vs the existing put_object which constructs the key schema). The audit-logs bucket uses a different key scheme. Add this method in the migration plan.

[CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html]

Pattern 9: Pydantic Response Model Exclusion (SEC-08)

What: Explicit response_model=CloudConnectionOut where CloudConnectionOut omits credentials_enc. This is the safe-by-default Pydantic v2 approach.

# Source: CITED: https://fastapi.tiangolo.com/tutorial/response-model/
from pydantic import BaseModel
from datetime import datetime

class CloudConnectionOut(BaseModel):
    id: str
    provider: str
    display_name: str
    status: str
    connected_at: datetime
    # credentials_enc is deliberately absent — never serialized (SEC-08)

    model_config = {"from_attributes": True}

Phase 4 scope: No cloud connection endpoints exist yet (Phase 5). Phase 4's obligation is to ensure the CloudConnectionOut model is defined and used as response_model on any admin endpoint that touches cloud_connections. Since no cloud connection endpoints exist yet, Phase 4 creates the model in schemas/cloud.py (or inline in api/admin.py) so Phase 5 cannot accidentally expose the field.

[CITED: https://fastapi.tiangolo.com/tutorial/response-model/]

Pattern 10: Vue 3 Debounced Search (no external dependency)

What: Watch the search query ref with a manual debounce using setTimeout/clearTimeout. Does not require lodash or VueUse.

// Source: Vue 3 Composition API watch pattern (Options API compatible via watch option)
// frontend/src/stores/documents.js
import { ref, watch } from 'vue'

const searchQuery = ref('')
let _searchTimer = null

watch(searchQuery, (newVal) => {
  clearTimeout(_searchTimer)
  if (newVal.length < 2) {
    // Clear results or show all
    return
  }
  _searchTimer = setTimeout(() => {
    fetchDocuments({ q: newVal, folderId: currentFolderId.value })
  }, 300)
})

[ASSUMED: 300ms is conventional for search debounce; no official Vue doc specifies this value]

Anti-Patterns to Avoid

  • Using session.commit() inside write_audit_log(): Must use flush() to stay within the handler's transaction. A separate commit risks committing the audit entry without the primary operation.
  • Generating presigned URLs in the streaming proxy: D-08 explicitly prohibits this. Never call presigned_get_url() from the proxy endpoint. The bytes must flow through FastAPI.
  • to_tsquery instead of plainto_tsquery: to_tsquery requires the user to supply &, |, and ! operators. Search bar input uses plainto_tsquery which accepts natural language phrases.
  • Using response_model_exclude on CloudConnection: Prefer defining an explicit CloudConnectionOut model that never includes credentials_enc. response_model_exclude is fragile (requires remembering to pass the set on every endpoint). Whitelist is safer.
  • Storing breadcrumb state in the URL query string: Breadcrumb is derived from the current folder_id parameter — only the current folder ID needs to be in the URL (or in Vue state). The full path is reconstructed from the DB on navigation.
  • Recursive SQL for breadcrumb without depth limit: WITH RECURSIVE CTE is correct but should include a depth guard (WHERE depth < 50) to prevent infinite recursion on a corrupted self-referential FK.

Don't Hand-Roll

Problem Don't Build Use Instead Why
Range header parsing Custom byte-range logic from scratch Pattern 3 above (from FastAPI discussion #7718) Browser PDFs use RFC 7233 range requests; edge cases (open-ended ranges bytes=500-) must be handled
Full-text search ranking BM25 in Python PostgreSQL ts_rank() + plainto_tsquery Runs in the DB; GIN index makes it fast; no external service needed
CSV serialization String concatenation with escaping Python stdlib csv.DictWriter Handles quoting, escaping, newlines in field values automatically
Audit log middleware Celery task or FastAPI middleware for audit writes Inline write_audit_log() after successful operation Middleware fires on request start — cannot know if operation succeeded; Celery task adds latency and potential missed writes on worker failure
Tsvector maintenance Application-layer extracted_text → tsvector sync PostgreSQL GIN expression index to_tsvector('english', extracted_text) DB recalculates automatically on UPDATE; no trigger or application sync needed
Debounce utility npm package (lodash, VueUse) setTimeout/clearTimeout in watch < 10 lines; no new dependency justified

Key insight: PostgreSQL handles full-text search natively with GIN indexes at query time — no external search infrastructure (Elasticsearch, Typesense, Meilisearch) is needed or justified at this scale.


Common Pitfalls

Pitfall 1: Folder Cascade-Delete Misses Sub-folder Documents

What goes wrong: DELETE /api/folders/{id} deletes direct child documents but leaves orphaned documents in sub-folders. Quota becomes incorrect.

Why it happens: A naive WHERE folder_id = :id only catches immediate children. Self-referential FK with unlimited nesting requires a recursive query to collect the full subtree.

How to avoid: Use WITH RECURSIVE CTE to collect all folder IDs in the subtree before deleting:

WITH RECURSIVE subtree AS (
    SELECT id FROM folders WHERE id = :root_id AND user_id = :uid
    UNION ALL
    SELECT f.id FROM folders f
    JOIN subtree s ON f.parent_id = s.id
    WHERE f.user_id = :uid AND depth < 50
)
SELECT d.id, d.size_bytes, d.object_key
FROM documents d
JOIN subtree s ON d.folder_id = s.id
WHERE d.user_id = :uid

Warning signs: After deleting a folder tree, quotas.used_bytes does not match SUM(documents.size_bytes).

[ASSUMED: SQLite does not support WITH RECURSIVE for test runs — integration tests requiring this pattern must use PostgreSQL (INTEGRATION=1)]

Pitfall 2: Alembic Autogenerate Repeatedly Detects tsvector GIN Index

What goes wrong: After adding the GIN expression index in migration 0004, running alembic revision --autogenerate again generates a migration that drops and recreates the same index.

Why it happens: Alembic's autogenerate does not understand functional/expression indexes on PostgreSQL when the expression involves function calls like to_tsvector(). Known issue in Alembic.

How to avoid: Write the GIN index migration manually (as raw op.execute(CREATE INDEX...)) and do NOT add the index to the ORM model's __table_args__. Keep it only in the migration. Document with a comment # managed manually — do not autogenerate.

Warning signs: Repeated alembic revision --autogenerate output shows DROP INDEX / CREATE INDEX for ix_documents_fts on every run.

[CITED: https://github.com/sqlalchemy/alembic/issues/1390]

Pitfall 3: PDF Streaming Proxy — Admin Access

What goes wrong: Admin user calls GET /api/documents/{id}/content and receives document bytes.

Why it happens: Using get_current_user instead of get_regular_user on the proxy endpoint.

How to avoid: The proxy endpoint MUST use get_regular_user dep, which raises HTTP 403 for user.role == 'admin'. This is established in Phase 3 (D-16) and must be applied to the proxy endpoint.

Warning signs: Security test test_admin_cannot_access_document_content fails or is absent.

[VERIFIED: codebase — backend/deps/auth.py get_regular_user raises 403 for admin role]

Pitfall 4: Share IDOR — Recipient Accesses Another Recipient's Share

What goes wrong: User A shares document with User B. User C (also a recipient of a different share on the same document, or just any user) can enumerate share IDs and revoke or view shares they don't own.

Why it happens: DELETE /api/shares/{share_id} checks only that the share exists, not that share.owner_id == current_user.id.

How to avoid: Ownership assertion on all share mutation endpoints:

share = await session.get(Share, share_id)
if share is None or share.owner_id != current_user.id:
    raise HTTPException(404, "Share not found")

Warning signs: Security test test_share_revoke_wrong_owner_returns_404 is absent.

[ASSUMED: standard IDOR prevention pattern per CLAUDE.md; test must be written]

Pitfall 5: write_audit_log Does Not Include IP Address

What goes wrong: ip_address column is left NULL for all events because the handler does not extract it from the request.

Why it happens: The request.client.host extraction is not wired into write_audit_log() calls.

How to avoid: Inject request: Request into each handler and pass request.client.host as ip_address to write_audit_log(). For handlers behind a reverse proxy, use request.headers.get("X-Forwarded-For", request.client.host).

[ASSUMED: IP extraction from FastAPI Request is standard; X-Forwarded-For handling depends on deployment]

Pitfall 6: Folder Name Uniqueness Constraint Violation

What goes wrong: Creating two folders with the same name under the same parent raises an IntegrityError from the uq_folders_user_parent_name constraint, which surfaces as HTTP 500.

Why it happens: The Folder model has UniqueConstraint("user_id", "parent_id", "name"). The API handler does not check for duplicates before inserting.

How to avoid: Catch IntegrityError from the ORM insert and return HTTP 409 Conflict:

from sqlalchemy.exc import IntegrityError
try:
    session.add(folder)
    await session.commit()
except IntegrityError:
    await session.rollback()
    raise HTTPException(409, "A folder with that name already exists here")

Warning signs: POST /api/folders returns 500 on duplicate name.

[VERIFIED: codebase — db/models.py Folder UniqueConstraint("user_id", "parent_id", "name")]

Pitfall 7: "Shared with Me" Leaks Document Content

What goes wrong: The GET /api/documents list endpoint accidentally returns documents shared with the current user (not just owned by them), exposing filenames and extracted text to recipients.

Why it happens: A join on the shares table is added to the list query without scoping by ownership.

How to avoid: The standard list endpoint (GET /api/documents) MUST filter WHERE documents.user_id = current_user.id only. Shared documents are a separate endpoint GET /api/shares/received. Extracted text is not returned in the shared documents list view — only metadata.

[VERIFIED: codebase — services/storage.py list_metadata already scopes by user_id]

Pitfall 8: MinIO Audit-Logs Bucket Does Not Exist

What goes wrong: The daily Celery beat task fails with NoSuchBucket error on first run.

Why it happens: The audit-logs bucket is not created in the migration or application lifespan.

How to avoid: Create the audit-logs bucket in migration 0004's upgrade() function (same pattern as the main documents bucket, gated on MINIO_ENDPOINT env var to allow SQLite test runs):

# In migration 0004 post-DDL step
minio_endpoint = os.environ.get("MINIO_ENDPOINT")
if minio_endpoint:
    client = Minio(minio_endpoint, ...)
    if not client.bucket_exists("audit-logs"):
        client.make_bucket("audit-logs")

[VERIFIED: codebase — migration 0003 shows the MINIO_ENDPOINT-gated pattern; audit-logs bucket creation follows same approach]


Runtime State Inventory

Phase 4 is a feature addition (not a rename/refactor). No runtime state migration is required.

Category Items Found Action Required
Stored data No existing pdf_open_mode values; column is NEW (default 'in_app') Alembic migration 0004 adds column with server_default — no data migration
Stored data Existing audit_log table is EMPTY (no events written in prior phases) No migration; Phase 4 backfills audit writes into handlers going forward
Live service config audit-logs MinIO bucket does not exist Created in migration 0004 post-DDL step (gated on MINIO_ENDPOINT)
OS-registered state None — verified by codebase scan None
Secrets/env vars No new env vars required for Phase 4 features None
Build artifacts No stale egg-info or compiled artifacts from prior phases None

Environment Availability

Dependency Required By Available Version Fallback
PostgreSQL tsvector GIN index, CTE folder cascade [runtime — Docker Compose] 15+ SQLite for unit tests (no tsvector; integration tests require INTEGRATION=1)
MinIO Streaming proxy, audit CSV upload [runtime — Docker Compose] latest Mock in tests (established pattern)
Redis Celery beat daily export task [runtime — Docker Compose] latest Not mockable for beat tasks; integration tests skip beat scheduling
Python csv (stdlib) CSV export endpoint + Celery export task stdlib

Missing dependencies with no fallback:

  • PostgreSQL tsvector: cannot be tested without a real PostgreSQL instance. Tests that exercise FTS must be marked with pytest.mark.skipif(not live_services_available, ...) or use the INTEGRATION=1 pattern.

Missing dependencies with fallback:

  • SQLite in unit tests: most folder/share CRUD tests run fine on SQLite. Only FTS and WITH RECURSIVE CTE tests require PostgreSQL.

Validation Architecture

Test Framework

Property Value
Framework pytest + pytest-asyncio (already configured)
Config file pytest.ini or pyproject.toml (check existing)
Quick run command pytest backend/tests/test_folders.py backend/tests/test_shares.py -x
Full suite command cd backend && pytest -v

Phase Requirements → Test Map

Req ID Behavior Test Type Automated Command File Exists?
FOLD-01 Create/rename/delete folder Integration pytest backend/tests/test_folders.py::test_create_folder -x Wave 0
FOLD-01 Delete non-empty folder → warning count Integration pytest backend/tests/test_folders.py::test_delete_folder_content_count -x Wave 0
FOLD-02 Move document to folder Integration pytest backend/tests/test_folders.py::test_move_document -x Wave 0
FOLD-02 Move document — ownership assertion Integration (negative) pytest backend/tests/test_folders.py::test_move_wrong_owner_404 -x Wave 0
FOLD-03 Breadcrumb path returned in folder response Unit pytest backend/tests/test_folders.py::test_breadcrumb_path -x Wave 0
FOLD-04 Sort by name/date/size Integration pytest backend/tests/test_folders.py::test_document_sort -x Wave 0
FOLD-05 tsvector search returns matching docs Integration (PostgreSQL required) pytest backend/tests/test_folders.py::test_fts_search -x -m integration Wave 0
SHARE-01 Share by handle — handle not found → 404 Integration pytest backend/tests/test_shares.py::test_share_handle_not_found -x Wave 0
SHARE-01 Share by handle — success Integration pytest backend/tests/test_shares.py::test_share_success -x Wave 0
SHARE-02 Shared doc in recipient virtual folder Integration pytest backend/tests/test_shares.py::test_shared_with_me -x Wave 0
SHARE-02 Shared doc — zero quota charged to recipient Integration pytest backend/tests/test_shares.py::test_share_no_quota_impact -x Wave 0
SHARE-04 Revoke share — immediate for recipient Integration pytest backend/tests/test_shares.py::test_revoke_share -x Wave 0
SHARE-01..04 Share IDOR — wrong owner cannot revoke Security (negative) pytest backend/tests/test_shares.py::test_share_revoke_wrong_owner_404 -x Wave 0
SEC-08 credentials_enc absent from all responses Security (negative) pytest backend/tests/test_security.py::test_credentials_enc_not_in_response -x Wave 0
SEC-09 Account deletion deletes user files Integration pytest backend/tests/test_admin_api.py::test_delete_user_cleans_files -x Wave 0
DOC-02 PDF proxy returns bytes, 200 or 206 Integration pytest backend/tests/test_documents.py::test_content_stream_200 -x Wave 0
DOC-02 PDF proxy — no presigned URL in response Security (negative) pytest backend/tests/test_documents.py::test_content_stream_no_presigned_url -x Wave 0
DOC-02 PDF proxy — Range header → 206 Integration pytest backend/tests/test_documents.py::test_content_stream_206_range -x Wave 0
DOC-02 PDF proxy — admin blocked (403) Security (negative) pytest backend/tests/test_documents.py::test_content_stream_admin_403 -x Wave 0
ADMIN-06 Audit log viewer — no document content in entries Security (negative) pytest backend/tests/test_audit.py::test_audit_log_no_doc_content -x Wave 0
ADMIN-06 Audit log viewer — admin only Security (negative) pytest backend/tests/test_audit.py::test_audit_log_regular_user_403 -x Wave 0

Sampling Rate

  • Per task commit: pytest backend/tests/test_folders.py backend/tests/test_shares.py backend/tests/test_audit.py backend/tests/test_documents.py -x
  • Per wave merge: cd backend && pytest -v
  • Phase gate: Full suite green before /gsd:verify-work

Wave 0 Gaps

  • backend/tests/test_folders.py — covers FOLD-01..05
  • backend/tests/test_shares.py — covers SHARE-01..05 + IDOR security tests
  • backend/tests/test_audit.py — covers ADMIN-06 + no-doc-content security tests
  • backend/tests/test_documents.py — add proxy tests (test_content_stream_*) to existing file
  • backend/tests/test_security.py — add SEC-08, SEC-09 tests (may be in test_admin_api.py)

Security Domain

Applicable ASVS Categories

ASVS Category Applies Standard Control
V2 Authentication yes get_regular_user / get_current_admin deps on all new endpoints
V3 Session Management no No new session mechanisms introduced
V4 Access Control yes Ownership assertion on all folder/share endpoints; get_regular_user rejects admin on proxy
V5 Input Validation yes Pydantic models on all request bodies; doc_id parsed via uuid.UUID()
V6 Cryptography no No new cryptographic operations; credentials still encrypted via Phase 2 HKDF pattern

Known Threat Patterns for this Phase

Pattern STRIDE Standard Mitigation
Share IDOR — revoke another user's share Elevation of privilege share.owner_id == current_user.id → 404 on mismatch
Share IDOR — access shared doc content without share Information disclosure _can_access_document() check in proxy endpoint: owner OR active share
PDF proxy leaking presigned URL Information disclosure get_object() fetches bytes directly; presigned URL never generated in proxy handler
Admin accessing document content via proxy Broken access control get_regular_user dep raises 403 for admin role
Folder IDOR — delete another user's folder Tampering folder.user_id == current_user.id → 404 on mismatch
Audit log containing document content Sensitive data exposure write_audit_log() metadata_ MUST NOT include filename, extracted_text, or file bytes
Audit log admin access by regular user Broken access control GET /api/admin/audit-log uses get_current_admin dep
Path traversal via folder_id parameter Information disclosure All folder/document lookups via DB primary key; uuid.UUID() parse validates format
Tsvector search scope leak — user sees others' docs Information disclosure FTS query MUST include Document.user_id == current_user.id scope filter
credentials_enc in serialized admin response Sensitive data exposure CloudConnectionOut excludes field; never in any response

Security Gate Checklist (Phase 4 specific)

  • bandit -r backend/ — zero HIGH findings in new files (api/folders.py, api/shares.py, api/audit.py, services/audit.py, tasks/audit_tasks.py)
  • pip audit — zero new critical/high CVEs (no new packages)
  • npm audit --audit-level=high — zero (no new npm packages)
  • test_share_revoke_wrong_owner_404 passes
  • test_content_stream_admin_403 passes
  • test_content_stream_no_presigned_url passes
  • test_audit_log_no_doc_content passes — verifies filename, extracted_text absent from all audit entries
  • test_credentials_enc_not_in_response passes
  • test_fts_search_scoped_to_owner passes — other user's docs not in search results

Code Examples

Folder Breadcrumb Path Query (PostgreSQL CTE)

-- Source: CITED: https://www.postgresql.org/docs/current/queries-with.html
WITH RECURSIVE breadcrumb AS (
    SELECT id, parent_id, name, 1 AS depth
    FROM folders
    WHERE id = :folder_id AND user_id = :uid
    UNION ALL
    SELECT f.id, f.parent_id, f.name, b.depth + 1
    FROM folders f
    JOIN breadcrumb b ON f.id = b.parent_id
    WHERE f.user_id = :uid AND b.depth < 50
)
SELECT id, name, depth FROM breadcrumb ORDER BY depth DESC;

Returns path from root to current folder (root = highest depth value after ORDER BY depth DESC). Each segment is {id, name} for clickable navigation.

Full-text Search Query (SQLAlchemy)

# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func
from sqlalchemy.dialects.postgresql import TSVECTOR

stmt = (
    select(Document)
    .where(
        Document.user_id == current_user.id,
        func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
            func.plainto_tsquery("english", q)
        ),
    )
    .order_by(Document.created_at.desc())
    .limit(per_page)
    .offset((page - 1) * per_page)
)

"Shared with me" Virtual Folder Query

# Source: CONTEXT.md D-06
from sqlalchemy import select
from db.models import Share, Document

stmt = (
    select(Document)
    .join(Share, Share.document_id == Document.id)
    .where(Share.recipient_id == current_user.id)
    .order_by(Document.created_at.desc())
)
result = await session.execute(stmt)
shared_docs = result.scalars().all()

is_shared indicator on document list

# Source: CONTEXT.md SHARE-05 — owner's list shows "shared" badge
from sqlalchemy import exists, select

# Add to list_documents query:
shared_subq = (
    select(Share.document_id)
    .where(Share.owner_id == current_user.id)
    .scalar_subquery()
)
# Add to Document select:
stmt = select(Document, Document.id.in_(shared_subq).label("is_shared"))

State of the Art

Old Approach Current Approach When Changed Impact
PostgreSQL to_tsquery (operator syntax required) plainto_tsquery (natural language input) PostgreSQL 9.3+ Search bars should use plainto_tsquery; to_tsquery for programmatic use only
PDF.js for in-browser PDF viewing Native browser PDF viewer via Content-Type: application/pdf + Content-Disposition: inline Browser support ~2018+ All modern browsers support native PDF rendering; PDF.js only needed for custom annotation features
Alembic autogenerate for GIN expression indexes Manual migration SQL Alembic 1.13+ (known bug) Must write op.execute("CREATE INDEX ... USING GIN") manually
asyncio.run() in Celery tasks asyncio.run(_async_func()) is the standard bridge Established in Phase 3 document_tasks.py Celery sync task calls asyncio.run() to enter async context

Deprecated/outdated:

  • to_tsvector with stored Computed column: adds schema complexity; expression index is simpler and equally performant for this scale
  • Separate audit log middleware: D-14 decision is inline writes; middleware approach not used

Assumptions Log

# Claim Section Risk if Wrong
A1 The streaming proxy's shared-document access check (Pattern 4) is implied by D-08 and SHARE-02 Pattern 4, Security Domain If recipients cannot access the proxy, "Shared with me" is useless; planner must confirm access rule
A2 session.flush() (not commit()) inside write_audit_log() is the correct transactional pattern Pattern 6 Using commit() would create two separate transactions; if primary operation fails after audit commit, audit entry is orphaned with no corresponding operation
A3 300ms debounce interval for search Pattern 10 If too short, excessive API calls; if too long, UX feels sluggish. Standard convention, low risk
A4 WITH RECURSIVE CTE for subtree cascade-delete is the correct PostgreSQL approach; SQLite tests must use INTEGRATION=1 Pitfall 1, Common Pitfalls SQLite does not support WITH RECURSIVE — all subtree cascade tests require PostgreSQL
A5 MinIOBackend.put_object_raw() method needs to be added for audit-logs bucket (different key scheme from documents) Pattern 8 If not added, Celery export task cannot upload to audit-logs bucket using the correct key
A6 The audit-logs MinIO bucket creation is safe to gate on MINIO_ENDPOINT env var (same as migration 0003 pattern) Pitfall 8 If MinIO is present but bucket creation is skipped, first export task fails

If this table is empty: Not applicable — some assumptions remain in this research.


Open Questions (RESOLVED)

  1. Subtree folder delete — CTE vs. iterative in Python

    • What we know: WITH RECURSIVE CTE works in PostgreSQL; SQLite does not support it
    • What's unclear: Should the planner use the CTE (PostgreSQL-only, integration test) or iterative Python (works in SQLite, slower for deep trees)?
    • Recommendation: Use CTE for the implementation (real database is PostgreSQL); mark the cascade-delete tests as pytest.mark.skipif(not live_services_available, ...) or use INTEGRATION=1
  2. Shared document access in proxy — _can_access_document() adds a DB query per request

    • What we know: Every GET /api/documents/{id}/content call currently does one DB lookup; adding share check adds a second
    • What's unclear: Is the extra query acceptable given Phase 4 scale (single-user → small multi-user)?
    • Recommendation: Add the share check unconditionally; at Phase 4 scale a second indexed query is negligible
  3. PATCH /api/me/preferences endpoint path

    • What we know: D-10 specifies this endpoint for pdf_open_mode
    • What's unclear: Should it be on /api/auth/me/preferences (alongside /api/auth/me) or /api/me/preferences (new router)?
    • Recommendation: Place on /api/auth/me/preferences (same router prefix as /api/auth/me) to keep auth-related user settings in the same module

Sources

Primary (HIGH confidence)

  • backend/db/models.py — all ORM models verified: Folder, Share, AuditLog, Document.folder_id, User (missing pdf_open_mode), CloudConnection (credentials_enc column confirmed)
  • backend/api/documents.py — ownership assertion pattern, get_regular_user dep, quota UPDATE pattern
  • backend/deps/auth.pyget_regular_user raises 403 for admin; get_current_admin pattern
  • backend/celery_app.py — existing beat_schedule structure; cleanup_abandoned_uploads as template
  • backend/storage/minio_backend.pyget_object() confirmed; asyncio.to_thread() pattern confirmed
  • backend/services/storage.pylist_metadata scoping by user_id confirmed

Secondary (MEDIUM confidence)

Tertiary (LOW confidence)

  • Alembic issue #1390 — tsvector autogenerate bug confirmed by issue thread; marked for validation

Metadata

Confidence breakdown:

  • Standard stack: HIGH — all packages already in requirements.txt; codebase verified
  • Architecture: HIGH — all ORM models exist and were verified in codebase
  • Pitfalls: MEDIUM — Pitfalls 1, 4 are confirmed from codebase + security requirements; Pitfall 2 confirmed from Alembic issue tracker
  • Code patterns: MEDIUM — Range header pattern from FastAPI community discussion; PostgreSQL FTS from official docs

Research date: 2026-05-25 Valid until: 2026-06-25 (stable FastAPI/SQLAlchemy stack; no fast-moving dependencies)