kite/.planning/phases/04-folders-sharing-quotas-document-ux/04-RESEARCH.md

# Phase 4: Folders, Sharing, Quotas & Document UX — Research

**Researched:** 2026-05-25
**Domain:** FastAPI folder/share CRUD, PostgreSQL tsvector, MinIO streaming proxy, Celery beat, Vue 3 folder navigation
**Confidence:** HIGH

---

<user_constraints>
## User Constraints (from CONTEXT.md)

### Locked Decisions

**D-01** Hybrid layout — AppSidebar shows top-level folders only. Sub-folders + breadcrumb in main content area. Top-level folders clickable directly in sidebar.

**D-02** Unlimited nesting depth (no API or UI cap). `Folder.parent_id` self-referential FK is authoritative. Breadcrumbs truncate at depth > 4 (show first + "..." + last 2).

**D-03** Non-empty folder delete: warning modal with document count. Confirm → cascade-delete all documents (MinIO + DB + quota). Cancel → no action. Documents are NOT moved to root — they are destroyed.

**D-04** Exact handle input — no autocomplete. API returns 404 if handle not found; UI shows "User not found" error.

**D-05** Share button on DocumentCard (inline icon button). Modal: (a) handle input, (b) Share button, (c) current recipients list with Revoke per row.

**D-06** "Shared with me" is a fixed virtual folder entry in AppSidebar, rendered above the user's own folder list. Filtered by `shares.recipient_id = current_user.id`. Zero quota charged to recipient.

**D-07** Share permission is `view` only for Phase 4. `edit` deferred.

**D-08** Streaming proxy endpoint: `GET /api/documents/{id}/content`. Returns bytes via FastAPI `StreamingResponse`. Supports `Range` headers. `Content-Disposition: inline`. No presigned URL ever generated or exposed. Uses `get_regular_user` dep.

**D-09** Native browser PDF rendering — no PDF.js. Zero frontend dependencies added. Content-Type header drives browser rendering.

**D-10** `users.pdf_open_mode` column (String, default `'in_app'`). Exposed via `PATCH /api/me/preferences`. `in_app` = modal with `<iframe>`; `new_tab` = `window.open`.

**D-11** PostgreSQL tsvector GIN index on `documents.extracted_text`. `plainto_tsquery('english', :q)`. Search scope: user's own docs only. `GET /api/documents?q=<query>`.

**D-12** Inline search bar in document list view. Debounced, minimum 2 chars. Folder-context aware.

**D-13** All 4 audit event categories logged: auth events, document operations, folder/share operations, admin actions.

**D-14** Audit log written inline in handlers (no middleware, no Celery task). Shared helper `write_audit_log(session, event_type, user_id, actor_id, resource_id, ip_address, metadata_dict)`.

**D-15** Admin audit log viewer (`/admin` → AuditLog tab): paginated table, filters (date range, user, action type). No document content, filenames, or extracted text in any log entry.

**D-16** Manual CSV and JSON export button in admin audit log view. Exports current filtered result set.

**D-17** Daily automated CSV backup via Celery beat. Exports previous day's audit rows to MinIO `audit-logs` bucket (`audit-logs/YYYY-MM-DD.csv`). Bucket is private. Admins can download from admin panel. Rows NOT deleted from DB.

**D-18** `credentials_enc` excluded from all serializers (SEC-08). Explicit Pydantic response model enforcement on all user-facing endpoints that touch `CloudConnection`.

**D-19** Account deletion (admin-triggered) runs `delete_user_files()` before removing DB records (SEC-09). Cloud connection cleanup deferred to Phase 5.

### Claude's Discretion

None specified — all significant decisions are locked.

### Deferred Ideas (OUT OF SCOPE)

- Per-day document scan quota tiers (future billing milestone)
- Share permission levels beyond `view` (`edit`, `comment` deferred)
- Presigned GET URLs for non-PDF document download
- Audit log row archival/deletion policy (DB rows are not auto-deleted)
</user_constraints>

---

<phase_requirements>
## Phase Requirements

| ID | Description | Research Support |
|----|-------------|------------------|
| FOLD-01 | User can create, rename, and delete folders (delete confirms content count before proceeding) | `Folder` model complete in `db/models.py`; cascade-delete pattern via SQLAlchemy + atomic quota decrement established in Phase 3 |
| FOLD-02 | User can move documents between folders | `Document.folder_id` FK exists and is nullable; PATCH endpoint updates `folder_id` with ownership assertion on both doc and target folder |
| FOLD-03 | Breadcrumb navigation renders current folder path; each segment clickable | Recursive path query via `WITH RECURSIVE` CTE or iterative parent-chain walk; Vue computed property for breadcrumb array |
| FOLD-04 | Document list supports sort by name, date uploaded, and file size | SQLAlchemy `order_by()` with enum sort param; frontend `sort` query param |
| FOLD-05 | Full-text search via PostgreSQL tsvector index on extracted text | Expression-index approach `to_tsvector('english', extracted_text)` + GIN; `plainto_tsquery` in `WHERE` clause; Alembic migration adds index manually (not as Computed column to avoid autogenerate noise) |
| SHARE-01 | User can share a document with another user by their unique handle | `Share` model complete; `User.handle` unique lookup; exact match, no autocomplete |
| SHARE-02 | Shared documents appear in "Shared with me" virtual folder; zero quota charged to recipient | Filter `shares.recipient_id = current_user.id`; no quota row touched; virtual folder is a UI/API filter, not a DB folder |
| SHARE-03 | Shared access view-only by default; owner controls permission level | `Share.permission` defaults to `'view'`; Phase 4 only implements `view`; permission field returned in list responses |
| SHARE-04 | Owner can revoke share; revocation is immediate | `DELETE /api/shares/{share_id}` with owner assertion; no async cleanup needed |
| SHARE-05 | Documents shared with others display a "shared" indicator | Document list response includes `is_shared: bool` derived from EXISTS subquery on `shares.owner_id` |
| SEC-08 | `credentials_enc` excluded from all API serializers | Explicit `CloudConnectionOut` Pydantic response model excluding `credentials_enc` |
| SEC-09 | Account deletion triggers `delete_user_files()` per cloud connection before DB removal | Implemented in admin delete-user endpoint; iterates documents, deletes MinIO objects, decrements quota |
| ADMIN-06 | Admin can view audit log filtered by date range, user, and action type (metadata only) | `AuditLog` model complete; paginated query with filters; admin endpoint uses `get_current_admin` |
| DOC-01 | User can view document metadata and extracted text | Existing `GET /api/documents/{id}` already returns extracted_text; Phase 4 ensures it is included in response |
| DOC-02 | In-browser PDF preview via PDF.js proxy; no presigned URLs exposed | `GET /api/documents/{id}/content` — StreamingResponse with Range header support, bytes from MinIO via `asyncio.to_thread` |
</phase_requirements>

---

## Summary

Phase 4 builds entirely on the existing ORM schema (all required models — `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `Share.recipient_id` — are already in `db/models.py`). The work is predominantly plumbing: creating new API router modules for folders, shares, and audit log; adding a single Alembic migration for the `users.pdf_open_mode` column and the tsvector GIN expression index; extending the Celery beat schedule with a daily audit export task; and extending the Vue 3 frontend with folder navigation, sharing modals, and the settings preference toggle.

The highest-risk areas are: (1) the PDF streaming proxy with Range header support (needs careful byte-range parsing to handle PDF viewer seeking), (2) the tsvector GIN index and corresponding Alembic migration (Alembic autogenerate has known quirks with `to_tsvector()` expression indexes — the migration must be written manually), and (3) the folder cascade-delete with atomic quota decrement for multiple documents (each document's size_bytes must be summed and decremented in a single atomic UPDATE or via a pre-computed sum).

No new Python or npm packages are required for Phase 4. The decision to use native browser PDF rendering (D-09) eliminates the PDF.js dependency. All needed libraries — FastAPI, SQLAlchemy, MinIO SDK, Celery, Pydantic — are already pinned in `requirements.txt`. The frontend needs only Tailwind utility classes; no new npm packages.

**Primary recommendation:** Build in 6 plans: Wave 0 test scaffolds + migration; folders backend; shares backend + audit service; streaming proxy + search; audit log admin API + Celery export task; frontend folder navigation + sharing + settings.

---

## Architectural Responsibility Map

| Capability | Primary Tier | Secondary Tier | Rationale |
|------------|-------------|----------------|-----------|
| Folder CRUD + breadcrumb path | API / Backend | Database / Storage | Ownership assertion and cascade-delete logic belongs in the API layer; path reconstruction is a DB query |
| Document move between folders | API / Backend | Database / Storage | Ownership assertions on both document and target folder must be co-located in the API handler |
| Full-text search (tsvector) | Database / Storage | API / Backend | GIN index and `plainto_tsquery` execute in PostgreSQL; API adds scope filter (`user_id`) |
| Share grant / revoke | API / Backend | Database / Storage | IDOR prevention requires ownership assertion before DB write; share table is the single source of truth |
| "Shared with me" virtual folder | API / Backend | — | Pure query filter on `shares.recipient_id`; no DB folder row; quota enforcement is in the API layer |
| PDF streaming proxy | API / Backend | Database / Storage | Bytes flow from MinIO through FastAPI to the browser; presigned URLs never reach the browser |
| Per-user PDF open preference | API / Backend | Database / Storage | User preference column (`users.pdf_open_mode`) read/written via `/api/me/preferences` |
| Audit log write | API / Backend | — | `write_audit_log()` called inline in each handler after successful operation; no middleware |
| Audit log viewer + export | API / Backend | Database / Storage | Admin-only paginated query with filters; CSV export is a StreamingResponse from a filtered DB query |
| Daily audit backup (Celery beat) | API / Backend (worker) | Database / Storage | Celery beat task runs in the worker; queries DB and uploads CSV to MinIO `audit-logs` bucket |
| Folder navigation UI state | Frontend Server (SSR) → Browser / Client | — | Folder ID and breadcrumb path held in Vue `ref()` in `HomeView.vue`; no server-side state |
| Share modal + indicator | Browser / Client | — | Share button on DocumentCard; modal state is local component `ref()`; share list in documents store |
| PDF open preference toggle | Browser / Client | API / Backend | Toggle in SettingsView.vue reads/writes preference via `PATCH /api/me/preferences` |
| `credentials_enc` exclusion (SEC-08) | API / Backend | — | Pydantic response model `CloudConnectionOut` excludes the field; never reaches serializer |

---

## Standard Stack

### Core (all already in requirements.txt / package.json)

| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| FastAPI | 0.136+ | API router for folders, shares, audit, proxy, preferences | Already in use; `APIRouter` + `StreamingResponse` needed for proxy |
| SQLAlchemy 2.0 async | 2.0+ | ORM queries for all new endpoints | Established pattern in project; async session already wired |
| Alembic | 1.18+ | Migration 0004 — `users.pdf_open_mode` column + tsvector GIN index | Established migration toolchain |
| MinIO Python SDK | 7.2+ | `get_object()` for streaming proxy; `put_object()` for audit CSV upload | Already in use; `asyncio.to_thread()` pattern established |
| Celery + Redis | 5.5+ | Daily audit export beat task | Already in use; `beat_schedule` in `celery_app.py` |
| Pydantic v2 | 2.0+ | `CloudConnectionOut` response model excluding `credentials_enc` (SEC-08) | Already in use for all request/response models |
| Vue 3 (Options API) | 3.4+ | Folder nav, share modal, settings toggle | Already in use; `ref()` + `watch()` patterns established |

[VERIFIED: codebase — requirements.txt and package.json]

### No New Packages Required

Phase 4 introduces zero new dependencies. Key rationale:
- PDF viewer: D-09 chose native browser rendering — no PDF.js
- Debounce: hand-rolled `setTimeout`/`clearTimeout` in Vue `watch` callback (< 10 lines, no lodash needed)
- CSV export: Python stdlib `csv.DictWriter` with `io.StringIO` — no pandas
- Search: PostgreSQL built-in `plainto_tsquery` — no Elasticsearch or external search engine

[VERIFIED: codebase — no new deps required]

---

## Package Legitimacy Audit

No new packages are installed in Phase 4. All runtime dependencies are carried forward from Phases 1–3 with pinned versions in `requirements.txt`.

**Packages removed due to slopcheck [SLOP] verdict:** none
**Packages flagged as suspicious [SUS]:** none
**New packages to install:** none

---

## Architecture Patterns

### System Architecture Diagram

```
Browser
  │
  ├── GET /api/documents?q=search&folder_id=X&sort=name
  │     FastAPI → SQLAlchemy → PostgreSQL (tsvector @@ plainto_tsquery, scoped to user)
  │     ← JSON: [{id, filename, size_bytes, is_shared, folder_id, ...}]
  │
  ├── POST /api/folders / PATCH /api/folders/{id} / DELETE /api/folders/{id}
  │     FastAPI → ownership assertion → SQLAlchemy → PostgreSQL
  │     DELETE: sum(doc.size_bytes) → atomic quota decrement → MinIO delete_object per doc
  │
  ├── POST /api/shares / DELETE /api/shares/{id}
  │     FastAPI → ownership assertion (owner_id = current_user.id) → SQLAlchemy → PostgreSQL
  │     GET /api/shares/received → filter shares.recipient_id = current_user.id
  │
  ├── GET /api/documents/{id}/content   [Range: bytes=X-Y]
  │     FastAPI (get_regular_user) → ownership OR share-access check → SQLAlchemy (object_key)
  │     → MinIO get_object() via asyncio.to_thread() → StreamingResponse (206 or 200)
  │     ← bytes with Content-Type, Content-Disposition: inline, Accept-Ranges: bytes
  │     NOTE: no presigned URL generated; bytes flow through FastAPI
  │
  ├── PATCH /api/me/preferences
  │     FastAPI → SQLAlchemy UPDATE users.pdf_open_mode → ← {pdf_open_mode}
  │
  └── GET /api/admin/audit-log  [?start=&end=&user_id=&event_type=&page=]
        FastAPI (get_current_admin) → SQLAlchemy → PostgreSQL
        ← JSON: {items: [{id, event_type, user_id, actor_id, ip_address, metadata_, created_at}], total}

      GET /api/admin/audit-log/export?format=csv|json
        FastAPI (get_current_admin) → query → StreamingResponse (text/csv or application/json)

Celery Beat Worker (daily at midnight UTC)
  │
  └── audit_log_daily_export task
        → AsyncSessionLocal → SELECT audit_log WHERE date = yesterday
        → csv.DictWriter → io.BytesIO
        → MinIO put_object("audit-logs", "YYYY-MM-DD.csv", bytes)
```

### Recommended Project Structure (new files only)

```
backend/
├── api/
│   ├── folders.py          # FOLD-01, FOLD-02, FOLD-03, FOLD-04, FOLD-05
│   ├── shares.py           # SHARE-01..05
│   └── audit.py            # ADMIN-06 (admin viewer + export)
├── services/
│   └── audit.py            # write_audit_log() helper
├── tasks/
│   └── audit_tasks.py      # audit_log_daily_export Celery beat task
└── migrations/versions/
    └── 0004_phase4_pdf_open_mode_tsvector.py

frontend/src/
├── components/
│   ├── documents/
│   │   └── ShareModal.vue          # D-05
│   ├── layout/
│   │   └── BreadcrumbNav.vue       # FOLD-03
│   └── admin/
│       └── AdminAuditLogTab.vue    # ADMIN-06 (D-15, D-16)
├── stores/
│   └── folders.js                  # folder state + actions
└── views/
    └── SettingsView.vue            # extend with PDF preference toggle (D-10)
```

### Pattern 1: Ownership Assertion (established in Phase 3 — D-16)

**What:** Every resource endpoint asserts `resource.user_id == current_user.id`. Cross-user access returns 404 (not 403) to prevent attacker enumeration of resource IDs.

**When to use:** All folder CRUD, document move, share grant/revoke (owner side), streaming proxy.

```python
# Source: backend/api/documents.py (established pattern)
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
    raise HTTPException(status_code=404, detail="Document not found")
```

**Folder variant:**
```python
folder = await session.get(Folder, folder_id)
if folder is None or folder.user_id != current_user.id:
    raise HTTPException(status_code=404, detail="Folder not found")
```

[VERIFIED: codebase — backend/api/documents.py]

### Pattern 2: Atomic Quota Decrement for Folder Cascade-Delete

**What:** When deleting a folder, collect all document `size_bytes` first, then issue a single atomic quota UPDATE. Never read-then-write in Python.

**When to use:** `DELETE /api/folders/{id}` when folder is non-empty.

```python
# Source: established atomic pattern (CLAUDE.md + Phase 3 D-07)
# Step 1: collect document IDs and total size within the folder subtree
result = await session.execute(
    select(Document.id, Document.size_bytes, Document.object_key)
    .where(Document.folder_id == folder_id, Document.user_id == current_user.id)
)
docs = result.all()
total_bytes = sum(row.size_bytes for row in docs)

# Step 2: atomic quota decrement (CASE WHEN pattern — SQLite compatible)
await session.execute(
    text(
        "UPDATE quotas SET used_bytes = "
        "CASE WHEN used_bytes > :delta THEN used_bytes - :delta ELSE 0 END "
        "WHERE user_id = :uid"
    ),
    {"delta": total_bytes, "uid": str(current_user.id)},
)

# Step 3: delete MinIO objects (best-effort)
for row in docs:
    try:
        await get_storage_backend().delete_object(row.object_key)
    except Exception:
        pass  # MinIO cleanup is best-effort

# Step 4: delete documents (cascade via FK) then folder
await session.execute(
    delete(Document).where(Document.folder_id == folder_id)
)
await session.delete(folder)
await session.commit()
```

[ASSUMED: the CASE WHEN pattern for SQLite compat is established in STATE.md; verified in existing delete_document service function]

### Pattern 3: PDF Streaming Proxy with Range Headers

**What:** `GET /api/documents/{id}/content` fetches bytes from MinIO and streams them with Range header support. Status 206 for range requests; 200 for full fetches.

**When to use:** DOC-02. Required for PDF viewer to seek pages without re-downloading the entire file.

```python
# Source: FastAPI discussion #7718 (CITED: github.com/fastapi/fastapi/discussions/7718)
from fastapi import Request
from fastapi.responses import StreamingResponse

def _parse_range(range_header: str, file_size: int) -> tuple[int, int]:
    """Parse 'bytes=start-end' Range header. Raises 416 on invalid range."""
    try:
        h = range_header.replace("bytes=", "").split("-")
        start = int(h[0]) if h[0] != "" else 0
        end = int(h[1]) if h[1] != "" else file_size - 1
    except (ValueError, IndexError):
        raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
    if start > end or start < 0 or end >= file_size:
        raise HTTPException(status.HTTP_416_REQUESTED_RANGE_NOT_SATISFIABLE)
    return start, end

@router.get("/{doc_id}/content")
async def stream_document_content(
    doc_id: str,
    request: Request,
    session: AsyncSession = Depends(get_db),
    current_user: User = Depends(get_regular_user),
):
    # Ownership assertion (DOC-02 also allows shared-with-me access — see Pattern 4)
    doc = await session.get(Document, uid)
    if doc is None or doc.user_id != current_user.id:
        # Also allow if document is shared with current_user
        share = await _get_share_for_user(session, uid, current_user.id)
        if share is None:
            raise HTTPException(404, "Document not found")
        # Use the document from the share's owner context
        # (doc is already loaded; ownership check skipped for shared access)

    # Fetch all bytes from MinIO (asyncio.to_thread wraps sync SDK)
    file_bytes = await get_storage_backend().get_object(doc.object_key)
    file_size = len(file_bytes)

    range_header = request.headers.get("range")
    headers = {
        "content-type": doc.content_type,
        "content-disposition": f"inline; filename=\"{doc.filename}\"",
        "accept-ranges": "bytes",
        "content-length": str(file_size),
    }

    if range_header:
        start, end = _parse_range(range_header, file_size)
        chunk = file_bytes[start:end + 1]
        headers["content-range"] = f"bytes {start}-{end}/{file_size}"
        headers["content-length"] = str(len(chunk))
        return StreamingResponse(
            iter([chunk]),
            status_code=206,
            headers=headers,
        )

    return StreamingResponse(
        iter([file_bytes]),
        status_code=200,
        headers=headers,
    )
```

**Critical:** `get_regular_user` dep ensures admins cannot access document content (DOC-02 / CLAUDE.md). No presigned URL is generated at any point.

[CITED: https://github.com/fastapi/fastapi/discussions/7718]
[CITED: https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/206]

### Pattern 4: Shared-Document Access for Streaming Proxy

**What:** The streaming proxy must allow both document owners AND recipients to access content.

**When to use:** Only in `GET /api/documents/{id}/content` — sharing grants read access to the content stream.

```python
# Source: CONTEXT.md D-08 — recipient can preview shared document
async def _can_access_document(
    session: AsyncSession, doc: Document, current_user_id: uuid.UUID
) -> bool:
    """True if user owns the document OR has an active share as recipient."""
    if doc.user_id == current_user_id:
        return True
    result = await session.execute(
        select(Share).where(
            Share.document_id == doc.id,
            Share.recipient_id == current_user_id,
        )
    )
    return result.scalar_one_or_none() is not None
```

[ASSUMED: the share-based access for the streaming proxy is implied by D-08 and SHARE-02; not explicitly called out in a prior phase decision]

### Pattern 5: PostgreSQL tsvector Full-Text Search

**What:** GIN expression index on `documents.extracted_text` for full-text search. Query uses `plainto_tsquery` (accepts natural language; no operator syntax required from user).

**When to use:** FOLD-05 — `GET /api/documents?q=<query>`.

**Migration (manual — do NOT use Computed() to avoid Alembic autogenerate noise):**
```python
# Source: CITED: https://www.postgresql.org/docs/current/textsearch-tables.html
# In migration 0004:
op.execute(
    "CREATE INDEX ix_documents_fts ON documents "
    "USING GIN (to_tsvector('english', coalesce(extracted_text, '')))"
)
```

**Query pattern:**
```python
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func, text

stmt = (
    select(Document)
    .where(
        Document.user_id == current_user.id,
        func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
            func.plainto_tsquery("english", q)
        ),
    )
    .order_by(Document.created_at.desc())
)
```

**Why `plainto_tsquery` not `to_tsquery`:** `plainto_tsquery` accepts unstructured natural language ("invoice march 2024") without requiring the user to supply `&` and `|` operators. Simpler and safer for a search bar input.

**Why expression index not generated/stored column:** PostgreSQL generated columns with `TSVECTOR` type require PostgreSQL 12+ and add schema complexity. The expression index approach is simpler and Alembic-compatible (written as raw SQL in the migration; autogenerate will not re-detect it if managed manually).

**Important Alembic caveat:** `include_schemas=True` autogenerate WILL repeatedly flag the expression index for recreation (known Alembic issue: github.com/sqlalchemy/alembic/issues/1390). The migration MUST be written manually and committed once. Do not run `alembic revision --autogenerate` after adding the GIN index without reviewing the output.

[CITED: https://www.postgresql.org/docs/current/textsearch-tables.html]
[CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html]
[CITED: https://github.com/sqlalchemy/alembic/issues/1390]

### Pattern 6: write_audit_log() Helper

**What:** Shared service function called inline in each handler after a successful operation. Does NOT raise on failure (audit log writing is best-effort — do not fail the primary operation).

**When to use:** Every handler in Phase 4 that performs state-changing operations; plus back-filling into `auth.py` and `admin.py`.

```python
# Source: CONTEXT.md D-14
# backend/services/audit.py
from __future__ import annotations
import uuid
from typing import Optional
from sqlalchemy.ext.asyncio import AsyncSession
from db.models import AuditLog
import logging

logger = logging.getLogger(__name__)

async def write_audit_log(
    session: AsyncSession,
    event_type: str,
    user_id: Optional[uuid.UUID],
    actor_id: Optional[uuid.UUID],
    resource_id: Optional[uuid.UUID],
    ip_address: Optional[str],
    metadata_: Optional[dict] = None,
) -> None:
    """Write an audit log entry. Never raises — audit failure is non-fatal."""
    try:
        entry = AuditLog(
            event_type=event_type,
            user_id=user_id,
            actor_id=actor_id,
            resource_id=resource_id,
            ip_address=ip_address,
            metadata_=metadata_,
        )
        session.add(entry)
        await session.flush()  # flush within the handler's existing transaction
    except Exception as exc:
        logger.warning("audit log write failed: %s", exc)
        # Do not re-raise — audit failure must never abort the primary operation
```

**Critical: `session.flush()` not `session.commit()`** — audit log writes within the same transaction as the primary operation, so the primary operation's `session.commit()` commits the audit entry too. This avoids partial writes where the operation succeeds but the audit entry is orphaned.

[ASSUMED: flush-not-commit pattern is the safest approach given transactional semantics; not explicitly documented in a prior phase decision]

### Pattern 7: CSV Export (admin endpoint)

**What:** `GET /api/admin/audit-log/export?format=csv` returns a StreamingResponse with `text/csv` content type and `Content-Disposition: attachment`.

```python
# Source: Python stdlib csv module
import csv
import io
from fastapi.responses import StreamingResponse

async def export_audit_log(
    session: AsyncSession,
    start: Optional[datetime],
    end: Optional[datetime],
    user_id: Optional[uuid.UUID],
    event_type: Optional[str],
    format: str = "csv",
):
    rows = await _query_audit_log(session, start, end, user_id, event_type)
    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[
        "id", "event_type", "user_id", "actor_id",
        "resource_id", "ip_address", "metadata_", "created_at"
    ])
    writer.writeheader()
    for row in rows:
        writer.writerow({
            "id": row.id,
            "event_type": row.event_type,
            "user_id": str(row.user_id) if row.user_id else "",
            "actor_id": str(row.actor_id) if row.actor_id else "",
            "resource_id": str(row.resource_id) if row.resource_id else "",
            "ip_address": str(row.ip_address) if row.ip_address else "",
            "metadata_": str(row.metadata_) if row.metadata_ else "",
            "created_at": row.created_at.isoformat(),
        })
    output.seek(0)
    return StreamingResponse(
        iter([output.getvalue()]),
        media_type="text/csv",
        headers={"Content-Disposition": f"attachment; filename=audit-export.csv"},
    )
```

[CITED: Python stdlib csv documentation — https://docs.python.org/3/library/csv.html]

### Pattern 8: Celery Beat — Daily Audit Export Task

**What:** A new task `audit_log_daily_export` added to `beat_schedule`. Uses `crontab(hour=0, minute=0)` for midnight UTC.

```python
# Source: CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html
from celery.schedules import crontab

# In celery_app.py beat_schedule dict — ADD to existing schedule:
"audit-log-daily-export": {
    "task": "tasks.audit_tasks.audit_log_daily_export",
    "schedule": crontab(hour=0, minute=0),  # midnight UTC
},
```

**Task implementation pattern (mirrors cleanup_abandoned_uploads):**
```python
@celery_app.task(name="tasks.audit_tasks.audit_log_daily_export")
def audit_log_daily_export() -> dict:
    return asyncio.run(_run_daily_export())

async def _run_daily_export() -> dict:
    from datetime import date, timedelta
    import csv, io
    from db.session import AsyncSessionLocal
    from db.models import AuditLog
    from storage import get_storage_backend
    from sqlalchemy import select

    yesterday = date.today() - timedelta(days=1)
    start = datetime(yesterday.year, yesterday.month, yesterday.day, tzinfo=timezone.utc)
    end = start + timedelta(days=1)

    async with AsyncSessionLocal() as session:
        result = await session.execute(
            select(AuditLog)
            .where(AuditLog.created_at >= start, AuditLog.created_at < end)
            .order_by(AuditLog.created_at)
        )
        rows = result.scalars().all()

    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[...])
    writer.writeheader()
    for row in rows:
        writer.writerow({...})

    csv_bytes = output.getvalue().encode("utf-8")
    key = f"audit-logs/{yesterday.isoformat()}.csv"
    backend = get_storage_backend()
    await backend.put_object_raw(
        bucket="audit-logs",
        key=key,
        data=io.BytesIO(csv_bytes),
        length=len(csv_bytes),
        content_type="text/csv",
    )
    return {"exported": len(rows), "key": key}
```

**Note:** `MinIOBackend` may need a `put_object_raw(bucket, key, data, length, content_type)` method (vs the existing `put_object` which constructs the key schema). The audit-logs bucket uses a different key scheme. Add this method in the migration plan.

[CITED: https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html]

### Pattern 9: Pydantic Response Model Exclusion (SEC-08)

**What:** Explicit `response_model=CloudConnectionOut` where `CloudConnectionOut` omits `credentials_enc`. This is the safe-by-default Pydantic v2 approach.

```python
# Source: CITED: https://fastapi.tiangolo.com/tutorial/response-model/
from pydantic import BaseModel
from datetime import datetime

class CloudConnectionOut(BaseModel):
    id: str
    provider: str
    display_name: str
    status: str
    connected_at: datetime
    # credentials_enc is deliberately absent — never serialized (SEC-08)

    model_config = {"from_attributes": True}
```

**Phase 4 scope:** No cloud connection endpoints exist yet (Phase 5). Phase 4's obligation is to ensure the `CloudConnectionOut` model is defined and used as `response_model` on any admin endpoint that touches `cloud_connections`. Since no cloud connection endpoints exist yet, Phase 4 creates the model in `schemas/cloud.py` (or inline in `api/admin.py`) so Phase 5 cannot accidentally expose the field.

[CITED: https://fastapi.tiangolo.com/tutorial/response-model/]

### Pattern 10: Vue 3 Debounced Search (no external dependency)

**What:** Watch the search query ref with a manual debounce using `setTimeout`/`clearTimeout`. Does not require lodash or VueUse.

```javascript
// Source: Vue 3 Composition API watch pattern (Options API compatible via watch option)
// frontend/src/stores/documents.js
import { ref, watch } from 'vue'

const searchQuery = ref('')
let _searchTimer = null

watch(searchQuery, (newVal) => {
  clearTimeout(_searchTimer)
  if (newVal.length < 2) {
    // Clear results or show all
    return
  }
  _searchTimer = setTimeout(() => {
    fetchDocuments({ q: newVal, folderId: currentFolderId.value })
  }, 300)
})
```

[ASSUMED: 300ms is conventional for search debounce; no official Vue doc specifies this value]

### Anti-Patterns to Avoid

- **Using `session.commit()` inside `write_audit_log()`:** Must use `flush()` to stay within the handler's transaction. A separate commit risks committing the audit entry without the primary operation.
- **Generating presigned URLs in the streaming proxy:** D-08 explicitly prohibits this. Never call `presigned_get_url()` from the proxy endpoint. The bytes must flow through FastAPI.
- **`to_tsquery` instead of `plainto_tsquery`:** `to_tsquery` requires the user to supply `&`, `|`, and `!` operators. Search bar input uses `plainto_tsquery` which accepts natural language phrases.
- **Using `response_model_exclude` on CloudConnection:** Prefer defining an explicit `CloudConnectionOut` model that never includes `credentials_enc`. `response_model_exclude` is fragile (requires remembering to pass the set on every endpoint). Whitelist is safer.
- **Storing breadcrumb state in the URL query string:** Breadcrumb is derived from the current `folder_id` parameter — only the current folder ID needs to be in the URL (or in Vue state). The full path is reconstructed from the DB on navigation.
- **Recursive SQL for breadcrumb without depth limit:** `WITH RECURSIVE` CTE is correct but should include a depth guard (`WHERE depth < 50`) to prevent infinite recursion on a corrupted self-referential FK.

---

## Don't Hand-Roll

| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Range header parsing | Custom byte-range logic from scratch | Pattern 3 above (from FastAPI discussion #7718) | Browser PDFs use RFC 7233 range requests; edge cases (open-ended ranges `bytes=500-`) must be handled |
| Full-text search ranking | BM25 in Python | PostgreSQL `ts_rank()` + `plainto_tsquery` | Runs in the DB; GIN index makes it fast; no external service needed |
| CSV serialization | String concatenation with escaping | Python stdlib `csv.DictWriter` | Handles quoting, escaping, newlines in field values automatically |
| Audit log middleware | Celery task or FastAPI middleware for audit writes | Inline `write_audit_log()` after successful operation | Middleware fires on request start — cannot know if operation succeeded; Celery task adds latency and potential missed writes on worker failure |
| Tsvector maintenance | Application-layer `extracted_text` → tsvector sync | PostgreSQL GIN expression index `to_tsvector('english', extracted_text)` | DB recalculates automatically on UPDATE; no trigger or application sync needed |
| Debounce utility | npm package (lodash, VueUse) | `setTimeout`/`clearTimeout` in watch | < 10 lines; no new dependency justified |

**Key insight:** PostgreSQL handles full-text search natively with GIN indexes at query time — no external search infrastructure (Elasticsearch, Typesense, Meilisearch) is needed or justified at this scale.

---

## Common Pitfalls

### Pitfall 1: Folder Cascade-Delete Misses Sub-folder Documents

**What goes wrong:** `DELETE /api/folders/{id}` deletes direct child documents but leaves orphaned documents in sub-folders. Quota becomes incorrect.

**Why it happens:** A naive `WHERE folder_id = :id` only catches immediate children. Self-referential FK with unlimited nesting requires a recursive query to collect the full subtree.

**How to avoid:** Use `WITH RECURSIVE` CTE to collect all folder IDs in the subtree before deleting:

```sql
WITH RECURSIVE subtree AS (
    SELECT id FROM folders WHERE id = :root_id AND user_id = :uid
    UNION ALL
    SELECT f.id FROM folders f
    JOIN subtree s ON f.parent_id = s.id
    WHERE f.user_id = :uid AND depth < 50
)
SELECT d.id, d.size_bytes, d.object_key
FROM documents d
JOIN subtree s ON d.folder_id = s.id
WHERE d.user_id = :uid
```

**Warning signs:** After deleting a folder tree, `quotas.used_bytes` does not match `SUM(documents.size_bytes)`.

[ASSUMED: SQLite does not support WITH RECURSIVE for test runs — integration tests requiring this pattern must use PostgreSQL (INTEGRATION=1)]

### Pitfall 2: Alembic Autogenerate Repeatedly Detects tsvector GIN Index

**What goes wrong:** After adding the GIN expression index in migration 0004, running `alembic revision --autogenerate` again generates a migration that drops and recreates the same index.

**Why it happens:** Alembic's autogenerate does not understand functional/expression indexes on PostgreSQL when the expression involves function calls like `to_tsvector()`. Known issue in Alembic.

**How to avoid:** Write the GIN index migration manually (as raw `op.execute(CREATE INDEX...)`) and do NOT add the index to the ORM model's `__table_args__`. Keep it only in the migration. Document with a comment `# managed manually — do not autogenerate`.

**Warning signs:** Repeated `alembic revision --autogenerate` output shows DROP INDEX / CREATE INDEX for `ix_documents_fts` on every run.

[CITED: https://github.com/sqlalchemy/alembic/issues/1390]

### Pitfall 3: PDF Streaming Proxy — Admin Access

**What goes wrong:** Admin user calls `GET /api/documents/{id}/content` and receives document bytes.

**Why it happens:** Using `get_current_user` instead of `get_regular_user` on the proxy endpoint.

**How to avoid:** The proxy endpoint MUST use `get_regular_user` dep, which raises HTTP 403 for `user.role == 'admin'`. This is established in Phase 3 (D-16) and must be applied to the proxy endpoint.

**Warning signs:** Security test `test_admin_cannot_access_document_content` fails or is absent.

[VERIFIED: codebase — backend/deps/auth.py `get_regular_user` raises 403 for admin role]

### Pitfall 4: Share IDOR — Recipient Accesses Another Recipient's Share

**What goes wrong:** User A shares document with User B. User C (also a recipient of a different share on the same document, or just any user) can enumerate share IDs and revoke or view shares they don't own.

**Why it happens:** `DELETE /api/shares/{share_id}` checks only that the share exists, not that `share.owner_id == current_user.id`.

**How to avoid:** Ownership assertion on all share mutation endpoints:
```python
share = await session.get(Share, share_id)
if share is None or share.owner_id != current_user.id:
    raise HTTPException(404, "Share not found")
```

**Warning signs:** Security test `test_share_revoke_wrong_owner_returns_404` is absent.

[ASSUMED: standard IDOR prevention pattern per CLAUDE.md; test must be written]

### Pitfall 5: write_audit_log Does Not Include IP Address

**What goes wrong:** `ip_address` column is left NULL for all events because the handler does not extract it from the request.

**Why it happens:** The `request.client.host` extraction is not wired into `write_audit_log()` calls.

**How to avoid:** Inject `request: Request` into each handler and pass `request.client.host` as `ip_address` to `write_audit_log()`. For handlers behind a reverse proxy, use `request.headers.get("X-Forwarded-For", request.client.host)`.

[ASSUMED: IP extraction from FastAPI Request is standard; X-Forwarded-For handling depends on deployment]

### Pitfall 6: Folder Name Uniqueness Constraint Violation

**What goes wrong:** Creating two folders with the same name under the same parent raises an IntegrityError from the `uq_folders_user_parent_name` constraint, which surfaces as HTTP 500.

**Why it happens:** The `Folder` model has `UniqueConstraint("user_id", "parent_id", "name")`. The API handler does not check for duplicates before inserting.

**How to avoid:** Catch `IntegrityError` from the ORM insert and return HTTP 409 Conflict:
```python
from sqlalchemy.exc import IntegrityError
try:
    session.add(folder)
    await session.commit()
except IntegrityError:
    await session.rollback()
    raise HTTPException(409, "A folder with that name already exists here")
```

**Warning signs:** POST /api/folders returns 500 on duplicate name.

[VERIFIED: codebase — db/models.py Folder `UniqueConstraint("user_id", "parent_id", "name")`]

### Pitfall 7: "Shared with Me" Leaks Document Content

**What goes wrong:** The `GET /api/documents` list endpoint accidentally returns documents shared with the current user (not just owned by them), exposing filenames and extracted text to recipients.

**Why it happens:** A join on the `shares` table is added to the list query without scoping by ownership.

**How to avoid:** The standard list endpoint (`GET /api/documents`) MUST filter `WHERE documents.user_id = current_user.id` only. Shared documents are a separate endpoint `GET /api/shares/received`. Extracted text is not returned in the shared documents list view — only metadata.

[VERIFIED: codebase — services/storage.py `list_metadata` already scopes by `user_id`]

### Pitfall 8: MinIO Audit-Logs Bucket Does Not Exist

**What goes wrong:** The daily Celery beat task fails with `NoSuchBucket` error on first run.

**Why it happens:** The `audit-logs` bucket is not created in the migration or application lifespan.

**How to avoid:** Create the `audit-logs` bucket in migration 0004's `upgrade()` function (same pattern as the main documents bucket, gated on `MINIO_ENDPOINT` env var to allow SQLite test runs):
```python
# In migration 0004 post-DDL step
minio_endpoint = os.environ.get("MINIO_ENDPOINT")
if minio_endpoint:
    client = Minio(minio_endpoint, ...)
    if not client.bucket_exists("audit-logs"):
        client.make_bucket("audit-logs")
```

[VERIFIED: codebase — migration 0003 shows the `MINIO_ENDPOINT`-gated pattern; audit-logs bucket creation follows same approach]

---

## Runtime State Inventory

> Phase 4 is a feature addition (not a rename/refactor). No runtime state migration is required.

| Category | Items Found | Action Required |
|----------|-------------|------------------|
| Stored data | No existing `pdf_open_mode` values; column is NEW (default `'in_app'`) | Alembic migration 0004 adds column with server_default — no data migration |
| Stored data | Existing `audit_log` table is EMPTY (no events written in prior phases) | No migration; Phase 4 backfills audit writes into handlers going forward |
| Live service config | `audit-logs` MinIO bucket does not exist | Created in migration 0004 post-DDL step (gated on MINIO_ENDPOINT) |
| OS-registered state | None — verified by codebase scan | None |
| Secrets/env vars | No new env vars required for Phase 4 features | None |
| Build artifacts | No stale egg-info or compiled artifacts from prior phases | None |

---

## Environment Availability

| Dependency | Required By | Available | Version | Fallback |
|------------|------------|-----------|---------|----------|
| PostgreSQL | tsvector GIN index, CTE folder cascade | [runtime — Docker Compose] | 15+ | SQLite for unit tests (no tsvector; integration tests require INTEGRATION=1) |
| MinIO | Streaming proxy, audit CSV upload | [runtime — Docker Compose] | latest | Mock in tests (established pattern) |
| Redis | Celery beat daily export task | [runtime — Docker Compose] | latest | Not mockable for beat tasks; integration tests skip beat scheduling |
| Python csv (stdlib) | CSV export endpoint + Celery export task | ✓ | stdlib | — |

**Missing dependencies with no fallback:**
- PostgreSQL tsvector: cannot be tested without a real PostgreSQL instance. Tests that exercise FTS must be marked with `pytest.mark.skipif(not live_services_available, ...)` or use the `INTEGRATION=1` pattern.

**Missing dependencies with fallback:**
- SQLite in unit tests: most folder/share CRUD tests run fine on SQLite. Only FTS and `WITH RECURSIVE` CTE tests require PostgreSQL.

---

## Validation Architecture

### Test Framework

| Property | Value |
|----------|-------|
| Framework | pytest + pytest-asyncio (already configured) |
| Config file | `pytest.ini` or `pyproject.toml` (check existing) |
| Quick run command | `pytest backend/tests/test_folders.py backend/tests/test_shares.py -x` |
| Full suite command | `cd backend && pytest -v` |

### Phase Requirements → Test Map

| Req ID | Behavior | Test Type | Automated Command | File Exists? |
|--------|----------|-----------|-------------------|-------------|
| FOLD-01 | Create/rename/delete folder | Integration | `pytest backend/tests/test_folders.py::test_create_folder -x` | ❌ Wave 0 |
| FOLD-01 | Delete non-empty folder → warning count | Integration | `pytest backend/tests/test_folders.py::test_delete_folder_content_count -x` | ❌ Wave 0 |
| FOLD-02 | Move document to folder | Integration | `pytest backend/tests/test_folders.py::test_move_document -x` | ❌ Wave 0 |
| FOLD-02 | Move document — ownership assertion | Integration (negative) | `pytest backend/tests/test_folders.py::test_move_wrong_owner_404 -x` | ❌ Wave 0 |
| FOLD-03 | Breadcrumb path returned in folder response | Unit | `pytest backend/tests/test_folders.py::test_breadcrumb_path -x` | ❌ Wave 0 |
| FOLD-04 | Sort by name/date/size | Integration | `pytest backend/tests/test_folders.py::test_document_sort -x` | ❌ Wave 0 |
| FOLD-05 | tsvector search returns matching docs | Integration (PostgreSQL required) | `pytest backend/tests/test_folders.py::test_fts_search -x -m integration` | ❌ Wave 0 |
| SHARE-01 | Share by handle — handle not found → 404 | Integration | `pytest backend/tests/test_shares.py::test_share_handle_not_found -x` | ❌ Wave 0 |
| SHARE-01 | Share by handle — success | Integration | `pytest backend/tests/test_shares.py::test_share_success -x` | ❌ Wave 0 |
| SHARE-02 | Shared doc in recipient virtual folder | Integration | `pytest backend/tests/test_shares.py::test_shared_with_me -x` | ❌ Wave 0 |
| SHARE-02 | Shared doc — zero quota charged to recipient | Integration | `pytest backend/tests/test_shares.py::test_share_no_quota_impact -x` | ❌ Wave 0 |
| SHARE-04 | Revoke share — immediate for recipient | Integration | `pytest backend/tests/test_shares.py::test_revoke_share -x` | ❌ Wave 0 |
| SHARE-01..04 | Share IDOR — wrong owner cannot revoke | Security (negative) | `pytest backend/tests/test_shares.py::test_share_revoke_wrong_owner_404 -x` | ❌ Wave 0 |
| SEC-08 | credentials_enc absent from all responses | Security (negative) | `pytest backend/tests/test_security.py::test_credentials_enc_not_in_response -x` | ❌ Wave 0 |
| SEC-09 | Account deletion deletes user files | Integration | `pytest backend/tests/test_admin_api.py::test_delete_user_cleans_files -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy returns bytes, 200 or 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_200 -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — no presigned URL in response | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_no_presigned_url -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — Range header → 206 | Integration | `pytest backend/tests/test_documents.py::test_content_stream_206_range -x` | ❌ Wave 0 |
| DOC-02 | PDF proxy — admin blocked (403) | Security (negative) | `pytest backend/tests/test_documents.py::test_content_stream_admin_403 -x` | ❌ Wave 0 |
| ADMIN-06 | Audit log viewer — no document content in entries | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_no_doc_content -x` | ❌ Wave 0 |
| ADMIN-06 | Audit log viewer — admin only | Security (negative) | `pytest backend/tests/test_audit.py::test_audit_log_regular_user_403 -x` | ❌ Wave 0 |

### Sampling Rate

- **Per task commit:** `pytest backend/tests/test_folders.py backend/tests/test_shares.py backend/tests/test_audit.py backend/tests/test_documents.py -x`
- **Per wave merge:** `cd backend && pytest -v`
- **Phase gate:** Full suite green before `/gsd:verify-work`

### Wave 0 Gaps

- [ ] `backend/tests/test_folders.py` — covers FOLD-01..05
- [ ] `backend/tests/test_shares.py` — covers SHARE-01..05 + IDOR security tests
- [ ] `backend/tests/test_audit.py` — covers ADMIN-06 + no-doc-content security tests
- [ ] `backend/tests/test_documents.py` — add proxy tests (test_content_stream_*) to existing file
- [ ] `backend/tests/test_security.py` — add SEC-08, SEC-09 tests (may be in test_admin_api.py)

---

## Security Domain

### Applicable ASVS Categories

| ASVS Category | Applies | Standard Control |
|---------------|---------|-----------------|
| V2 Authentication | yes | `get_regular_user` / `get_current_admin` deps on all new endpoints |
| V3 Session Management | no | No new session mechanisms introduced |
| V4 Access Control | yes | Ownership assertion on all folder/share endpoints; `get_regular_user` rejects admin on proxy |
| V5 Input Validation | yes | Pydantic models on all request bodies; `doc_id` parsed via `uuid.UUID()` |
| V6 Cryptography | no | No new cryptographic operations; credentials still encrypted via Phase 2 HKDF pattern |

### Known Threat Patterns for this Phase

| Pattern | STRIDE | Standard Mitigation |
|---------|--------|---------------------|
| Share IDOR — revoke another user's share | Elevation of privilege | `share.owner_id == current_user.id` → 404 on mismatch |
| Share IDOR — access shared doc content without share | Information disclosure | `_can_access_document()` check in proxy endpoint: owner OR active share |
| PDF proxy leaking presigned URL | Information disclosure | `get_object()` fetches bytes directly; presigned URL never generated in proxy handler |
| Admin accessing document content via proxy | Broken access control | `get_regular_user` dep raises 403 for admin role |
| Folder IDOR — delete another user's folder | Tampering | `folder.user_id == current_user.id` → 404 on mismatch |
| Audit log containing document content | Sensitive data exposure | `write_audit_log()` metadata_ MUST NOT include `filename`, `extracted_text`, or file bytes |
| Audit log admin access by regular user | Broken access control | `GET /api/admin/audit-log` uses `get_current_admin` dep |
| Path traversal via `folder_id` parameter | Information disclosure | All folder/document lookups via DB primary key; `uuid.UUID()` parse validates format |
| Tsvector search scope leak — user sees others' docs | Information disclosure | FTS query MUST include `Document.user_id == current_user.id` scope filter |
| credentials_enc in serialized admin response | Sensitive data exposure | `CloudConnectionOut` excludes field; never in any response |

### Security Gate Checklist (Phase 4 specific)

- [ ] `bandit -r backend/` — zero HIGH findings in new files (`api/folders.py`, `api/shares.py`, `api/audit.py`, `services/audit.py`, `tasks/audit_tasks.py`)
- [ ] `pip audit` — zero new critical/high CVEs (no new packages)
- [ ] `npm audit --audit-level=high` — zero (no new npm packages)
- [ ] `test_share_revoke_wrong_owner_404` passes
- [ ] `test_content_stream_admin_403` passes
- [ ] `test_content_stream_no_presigned_url` passes
- [ ] `test_audit_log_no_doc_content` passes — verifies `filename`, `extracted_text` absent from all audit entries
- [ ] `test_credentials_enc_not_in_response` passes
- [ ] `test_fts_search_scoped_to_owner` passes — other user's docs not in search results

---

## Code Examples

### Folder Breadcrumb Path Query (PostgreSQL CTE)

```sql
-- Source: CITED: https://www.postgresql.org/docs/current/queries-with.html
WITH RECURSIVE breadcrumb AS (
    SELECT id, parent_id, name, 1 AS depth
    FROM folders
    WHERE id = :folder_id AND user_id = :uid
    UNION ALL
    SELECT f.id, f.parent_id, f.name, b.depth + 1
    FROM folders f
    JOIN breadcrumb b ON f.id = b.parent_id
    WHERE f.user_id = :uid AND b.depth < 50
)
SELECT id, name, depth FROM breadcrumb ORDER BY depth DESC;
```

Returns path from root to current folder (root = highest depth value after `ORDER BY depth DESC`). Each segment is `{id, name}` for clickable navigation.

### Full-text Search Query (SQLAlchemy)

```python
# Source: CITED: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html
from sqlalchemy import func
from sqlalchemy.dialects.postgresql import TSVECTOR

stmt = (
    select(Document)
    .where(
        Document.user_id == current_user.id,
        func.to_tsvector("english", func.coalesce(Document.extracted_text, "")).op("@@")(
            func.plainto_tsquery("english", q)
        ),
    )
    .order_by(Document.created_at.desc())
    .limit(per_page)
    .offset((page - 1) * per_page)
)
```

### "Shared with me" Virtual Folder Query

```python
# Source: CONTEXT.md D-06
from sqlalchemy import select
from db.models import Share, Document

stmt = (
    select(Document)
    .join(Share, Share.document_id == Document.id)
    .where(Share.recipient_id == current_user.id)
    .order_by(Document.created_at.desc())
)
result = await session.execute(stmt)
shared_docs = result.scalars().all()
```

### `is_shared` indicator on document list

```python
# Source: CONTEXT.md SHARE-05 — owner's list shows "shared" badge
from sqlalchemy import exists, select

# Add to list_documents query:
shared_subq = (
    select(Share.document_id)
    .where(Share.owner_id == current_user.id)
    .scalar_subquery()
)
# Add to Document select:
stmt = select(Document, Document.id.in_(shared_subq).label("is_shared"))
```

---

## State of the Art

| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| PostgreSQL `to_tsquery` (operator syntax required) | `plainto_tsquery` (natural language input) | PostgreSQL 9.3+ | Search bars should use `plainto_tsquery`; `to_tsquery` for programmatic use only |
| PDF.js for in-browser PDF viewing | Native browser PDF viewer via `Content-Type: application/pdf` + `Content-Disposition: inline` | Browser support ~2018+ | All modern browsers support native PDF rendering; PDF.js only needed for custom annotation features |
| Alembic autogenerate for GIN expression indexes | Manual migration SQL | Alembic 1.13+ (known bug) | Must write `op.execute("CREATE INDEX ... USING GIN")` manually |
| `asyncio.run()` in Celery tasks | `asyncio.run(_async_func())` is the standard bridge | Established in Phase 3 document_tasks.py | Celery sync task calls `asyncio.run()` to enter async context |

**Deprecated/outdated:**
- `to_tsvector` with stored Computed column: adds schema complexity; expression index is simpler and equally performant for this scale
- Separate audit log middleware: D-14 decision is inline writes; middleware approach not used

---

## Assumptions Log

| # | Claim | Section | Risk if Wrong |
|---|-------|---------|---------------|
| A1 | The streaming proxy's shared-document access check (Pattern 4) is implied by D-08 and SHARE-02 | Pattern 4, Security Domain | If recipients cannot access the proxy, "Shared with me" is useless; planner must confirm access rule |
| A2 | `session.flush()` (not `commit()`) inside `write_audit_log()` is the correct transactional pattern | Pattern 6 | Using `commit()` would create two separate transactions; if primary operation fails after audit commit, audit entry is orphaned with no corresponding operation |
| A3 | 300ms debounce interval for search | Pattern 10 | If too short, excessive API calls; if too long, UX feels sluggish. Standard convention, low risk |
| A4 | `WITH RECURSIVE` CTE for subtree cascade-delete is the correct PostgreSQL approach; SQLite tests must use INTEGRATION=1 | Pitfall 1, Common Pitfalls | SQLite does not support `WITH RECURSIVE` — all subtree cascade tests require PostgreSQL |
| A5 | `MinIOBackend.put_object_raw()` method needs to be added for audit-logs bucket (different key scheme from documents) | Pattern 8 | If not added, Celery export task cannot upload to `audit-logs` bucket using the correct key |
| A6 | The `audit-logs` MinIO bucket creation is safe to gate on `MINIO_ENDPOINT` env var (same as migration 0003 pattern) | Pitfall 8 | If MinIO is present but bucket creation is skipped, first export task fails |

**If this table is empty:** Not applicable — some assumptions remain in this research.

---

## Open Questions (RESOLVED)

1. **Subtree folder delete — CTE vs. iterative in Python**
   - What we know: `WITH RECURSIVE` CTE works in PostgreSQL; SQLite does not support it
   - What's unclear: Should the planner use the CTE (PostgreSQL-only, integration test) or iterative Python (works in SQLite, slower for deep trees)?
   - Recommendation: Use CTE for the implementation (real database is PostgreSQL); mark the cascade-delete tests as `pytest.mark.skipif(not live_services_available, ...)` or use `INTEGRATION=1`

2. **Shared document access in proxy — `_can_access_document()` adds a DB query per request**
   - What we know: Every `GET /api/documents/{id}/content` call currently does one DB lookup; adding share check adds a second
   - What's unclear: Is the extra query acceptable given Phase 4 scale (single-user → small multi-user)?
   - Recommendation: Add the share check unconditionally; at Phase 4 scale a second indexed query is negligible

3. **`PATCH /api/me/preferences` endpoint path**
   - What we know: D-10 specifies this endpoint for pdf_open_mode
   - What's unclear: Should it be on `/api/auth/me/preferences` (alongside `/api/auth/me`) or `/api/me/preferences` (new router)?
   - Recommendation: Place on `/api/auth/me/preferences` (same router prefix as `/api/auth/me`) to keep auth-related user settings in the same module

---

## Sources

### Primary (HIGH confidence)
- `backend/db/models.py` — all ORM models verified: `Folder`, `Share`, `AuditLog`, `Document.folder_id`, `User` (missing `pdf_open_mode`), `CloudConnection` (`credentials_enc` column confirmed)
- `backend/api/documents.py` — ownership assertion pattern, `get_regular_user` dep, quota UPDATE pattern
- `backend/deps/auth.py` — `get_regular_user` raises 403 for admin; `get_current_admin` pattern
- `backend/celery_app.py` — existing `beat_schedule` structure; `cleanup_abandoned_uploads` as template
- `backend/storage/minio_backend.py` — `get_object()` confirmed; `asyncio.to_thread()` pattern confirmed
- `backend/services/storage.py` — `list_metadata` scoping by `user_id` confirmed

### Secondary (MEDIUM confidence)
- [FastAPI discussions #7718](https://github.com/fastapi/fastapi/discussions/7718) — Range header StreamingResponse pattern verified by multiple community confirmations
- [PostgreSQL docs — Full-text search tables](https://www.postgresql.org/docs/current/textsearch-tables.html) — GIN expression index SQL verified from official docs
- [SQLAlchemy PostgreSQL dialect docs](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html) — `func.to_tsvector`, `plainto_tsquery` support confirmed
- [Celery periodic tasks docs](https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html) — `crontab(hour=0, minute=0)` pattern
- [FastAPI response model docs](https://fastapi.tiangolo.com/tutorial/response-model/) — `response_model=CloudConnectionOut` exclusion pattern

### Tertiary (LOW confidence)
- [Alembic issue #1390](https://github.com/sqlalchemy/alembic/issues/1390) — tsvector autogenerate bug confirmed by issue thread; marked for validation

---

## Metadata

**Confidence breakdown:**
- Standard stack: HIGH — all packages already in requirements.txt; codebase verified
- Architecture: HIGH — all ORM models exist and were verified in codebase
- Pitfalls: MEDIUM — Pitfalls 1, 4 are confirmed from codebase + security requirements; Pitfall 2 confirmed from Alembic issue tracker
- Code patterns: MEDIUM — Range header pattern from FastAPI community discussion; PostgreSQL FTS from official docs

**Research date:** 2026-05-25
**Valid until:** 2026-06-25 (stable FastAPI/SQLAlchemy stack; no fast-moving dependencies)