chore: merge executor worktree (05-09 cloud doc access)

This commit is contained in:
curo1305
2026-05-30 11:20:41 +02:00
7 changed files with 491 additions and 11 deletions
@@ -0,0 +1,119 @@
---
phase: 05-cloud-storage-backends
plan: "09"
subsystem: cloud-documents
tags: [cloud, documents, patch, celery, frontend, blob-url, authentication]
dependency_graph:
requires: [05-06]
provides: [PATCH /api/documents/{id}, cloud-aware re-analyze, authenticated-preview]
affects: [backend/api/documents.py, backend/tasks/document_tasks.py, frontend/src/api/client.js, frontend/src/components/documents/DocumentPreviewModal.vue, frontend/src/views/DocumentView.vue]
tech_stack:
added: []
patterns: [fetch-blob-url, model_fields_set, cloud-aware-task-routing, tdd-red-green]
key_files:
created: []
modified:
- backend/api/documents.py
- backend/tasks/document_tasks.py
- backend/tests/test_cloud.py
- frontend/src/api/client.js
- frontend/src/components/documents/DocumentPreviewModal.vue
- frontend/src/views/DocumentView.vue
decisions:
- "Test 3 patches storage module (not tasks module): deferred import pattern means get_storage_backend_for_document is not a module-level attribute of document_tasks — patching at storage module level is correct"
- "Test 3 is a pure unit test (no db_session): _run() opens its own AsyncSessionLocal which requires PostgreSQL; mocking the session manager keeps the test fast and infrastructure-free"
- "node_modules symlinked into worktree frontend for build verification: worktree does not have its own node_modules; symlink to main repo preserves isolation while enabling build check"
metrics:
duration: "~25 minutes"
completed: "2026-05-30"
tasks_completed: 2
files_modified: 6
---
# Phase 5 Plan 09: Cloud Document Access Fixes Summary
Fixed three independent root causes that blocked cloud document use: unauthenticated iframe preview, hardcoded MinIO in Celery re-analyze, and missing PATCH endpoint for document metadata.
## Tasks Completed
| Task | Name | Commit | Files |
|------|------|--------|-------|
| 1 | PATCH /documents/{id} + cloud-aware Celery re-analyze | 6d094d1 | backend/api/documents.py, backend/tasks/document_tasks.py, backend/tests/test_cloud.py |
| TDD RED | Failing tests for Task 1 | 9bc0561 | backend/tests/test_cloud.py |
| 2 | Authenticated document preview — fetch + Blob URL | 4a42cce | frontend/src/api/client.js, frontend/src/components/documents/DocumentPreviewModal.vue, frontend/src/views/DocumentView.vue |
## What Was Built
**Task 1: Backend fixes**
- `DocumentPatch` Pydantic model with `Optional[str] filename` and `Optional[uuid.UUID] folder_id`; uses `model_fields_set` to distinguish "not provided" from "explicitly set to null"
- `PATCH /api/documents/{doc_id}` endpoint with ownership guard (non-owner → 404), admin guard (get_regular_user → 403), empty body guard (422), and `storage.get_metadata()` whitelist response
- `_run()` in `document_tasks.py` updated: MinIO path unchanged; non-MinIO path calls `get_storage_backend_for_document(doc, user, session)`; missing user returns `missing_user` status; `CloudConnectionError` returns `extract_failed` with generic "cloud backend error" message (no provider details)
**Task 2: Frontend fixes**
- `fetchDocumentContent(docId)` in `client.js`: authenticated fetch returning raw `Response` (not parsed JSON); single 401 → refresh → retry pattern; mirrors `request()` auth logic without `res.json()`
- `DocumentPreviewModal.vue`: replaced unauthenticated `iframe :src="proxyUrl"` with authenticated fetch → blob → `URL.createObjectURL()`; loading spinner, error state, `URL.revokeObjectURL` on unmount
- `DocumentView.vue` `openPdf()`: replaced `window.open(rawUrl)` with fetch → blob → objectUrl → `window.open(objectUrl)`; 60-second auto-revoke
## Test Results
- 3 new tests in `test_cloud.py`: all pass
- `test_patch_document_filename` — PATCH 200 with updated filename
- `test_patch_document_wrong_owner` — PATCH 404 for non-owner (IDOR guard)
- `test_reanalyze_cloud_document_routes_to_cloud_backend` — cloud backend called, MinIO not called
- Full `test_cloud.py` suite: **23 passed, 0 failed** (no regressions)
- Frontend build: **zero errors** (`vite build` exits 0)
## Deviations from Plan
### Auto-adjusted Issues
**1. [Rule 1 - Bug] Test patching at storage module level, not tasks module level**
- **Found during:** Task 1 GREEN phase
- **Issue:** Plan specified patching `tasks.document_tasks.get_storage_backend_for_document`, but that attribute does not exist at module level — the import is inside `_run()` (deferred import pattern). `unittest.mock.patch` raises `AttributeError` on absent attributes.
- **Fix:** Test patches `storage.get_storage_backend_for_document` and `storage.get_storage_backend` — the canonical source of both functions. Behavior under test is identical.
- **Files modified:** `backend/tests/test_cloud.py`
**2. [Rule 1 - Bug] Test 3 changed to pure unit test (no db_session fixture)**
- **Found during:** Task 1 GREEN phase — second attempt
- **Issue:** `_run()` opens its own `AsyncSessionLocal()` internally which requires a live PostgreSQL connection. Using `db_session` fixture in the test doesn't affect `_run()`'s internal session. Tests run without Docker → connection refused.
- **Fix:** Test mocks `db.session.AsyncSessionLocal` with a fake async context manager that returns a mock session with `session.get()` returning pre-built `MagicMock` Document and User objects. Removed `db_session` from test signature.
- **Files modified:** `backend/tests/test_cloud.py`
**3. [Rule 3 - Blocking] node_modules symlink needed for worktree build**
- **Found during:** Task 2 verification
- **Issue:** Worktree's `frontend/` has no `node_modules` (npm install runs in main repo only). `vite build` failed with `ERR_MODULE_NOT_FOUND`.
- **Fix:** Created a symlink `worktree/frontend/node_modules → main/frontend/node_modules`. Build succeeded.
- **Files modified:** `frontend/node_modules` (symlink, not tracked in git)
## Security Analysis (T-05-09 Threat Register)
| Threat ID | Status | Notes |
|-----------|--------|-------|
| T-05-09-01 | Mitigated | `get_regular_user` enforced on PATCH; admin → 403; wrong owner → 404 |
| T-05-09-02 | Mitigated | `storage.get_metadata()` response whitelist via `_doc_to_dict()` — no `credentials_enc` |
| T-05-09-03 | Mitigated | Credentials loaded inside Celery task's own DB session via `get_storage_backend_for_document` |
| T-05-09-04 | Accepted | Blob URL same-origin, revoked on unmount — no persistent exposure |
| T-05-09-SC | N/A | No new packages installed |
## Known Stubs
None — all plan features fully implemented and wired.
## Self-Check: PASSED
All created/modified files confirmed present on disk. All task commits verified in git log.
| Item | Status |
|------|--------|
| backend/api/documents.py | FOUND |
| backend/tasks/document_tasks.py | FOUND |
| backend/tests/test_cloud.py | FOUND |
| frontend/src/api/client.js | FOUND |
| frontend/src/components/documents/DocumentPreviewModal.vue | FOUND |
| frontend/src/views/DocumentView.vue | FOUND |
| .planning/phases/05-cloud-storage-backends/05-09-SUMMARY.md | FOUND |
| Commit 9bc0561 (RED tests) | FOUND |
| Commit 6d094d1 (GREEN implementation) | FOUND |
| Commit 4a42cce (frontend auth preview) | FOUND |
+60
View File
@@ -69,6 +69,19 @@ class UploadUrlRequest(BaseModel):
content_type: str content_type: str
class DocumentPatch(BaseModel):
"""Pydantic model for PATCH /api/documents/{doc_id}.
Optional fields — model_fields_set distinguishes "not provided" from "set to null".
At least one field must be present in model_fields_set (enforced in the handler).
T-05-09-01: explicit field declaration prevents mass assignment.
T-05-09-02: only filename and folder_id are accepted — no other fields can be set.
"""
filename: Optional[str] = None
folder_id: Optional[uuid.UUID] = None
# ── POST /api/documents/upload-url ─────────────────────────────────────────── # ── POST /api/documents/upload-url ───────────────────────────────────────────
@router.post("/upload-url") @router.post("/upload-url")
@@ -520,6 +533,53 @@ async def get_document(
return meta return meta
# ── PATCH /api/documents/{doc_id} ────────────────────────────────────────────
@router.patch("/{doc_id}")
async def patch_document(
doc_id: str,
body: DocumentPatch,
session: AsyncSession = Depends(get_db),
current_user: User = Depends(get_regular_user),
):
"""Update document metadata (filename and/or folder_id).
T-05-09-01: get_regular_user dep rejects admins (403) and unauthenticated (401).
T-05-09-01: ownership check — non-owner gets 404 to avoid leaking document IDs (D-16).
T-05-09-02: response uses storage.get_metadata() which excludes credentials_enc and
password_hash via the _doc_to_dict whitelist.
At least one field must be provided — empty body returns 422.
folder_id=null moves the document to the root (no folder).
"""
try:
uid = uuid.UUID(doc_id)
except ValueError:
raise HTTPException(404, "Document not found")
doc = await session.get(Document, uid)
if doc is None or doc.user_id != current_user.id:
raise HTTPException(404, "Document not found")
# Require at least one field to be set (model_fields_set tracks provided fields)
if not body.model_fields_set:
raise HTTPException(422, "At least one field (filename, folder_id) must be provided")
if "filename" in body.model_fields_set and body.filename is not None:
doc.filename = body.filename
if "folder_id" in body.model_fields_set:
# folder_id=null → move to root (no folder); folder_id=<uuid> → move to folder
doc.folder_id = body.folder_id
await session.commit()
meta = await storage.get_metadata(session, doc_id)
if meta is None:
raise HTTPException(404, "Document not found")
return meta
# ── DELETE /api/documents/{doc_id} ─────────────────────────────────────────── # ── DELETE /api/documents/{doc_id} ───────────────────────────────────────────
@router.delete("/{doc_id}") @router.delete("/{doc_id}")
+31 -3
View File
@@ -30,13 +30,17 @@ async def _run(document_id: str) -> dict:
Opens its own AsyncSession (not shared with the upload request) to avoid Opens its own AsyncSession (not shared with the upload request) to avoid
cross-thread session contamination. cross-thread session contamination.
Cloud-aware: when doc.storage_backend != 'minio', uses
get_storage_backend_for_document() to retrieve bytes from the correct
cloud backend instead of hardcoding MinIO.
""" """
import uuid as _uuid import uuid as _uuid
from db.session import AsyncSessionLocal from db.session import AsyncSessionLocal
from db.models import Document from db.models import Document
from services import extractor, classifier from services import extractor, classifier
from storage import get_storage_backend from storage import get_storage_backend, get_storage_backend_for_document
async with AsyncSessionLocal() as session: async with AsyncSessionLocal() as session:
# ── Step 1: fetch Document row ───────────────────────────────────────── # ── Step 1: fetch Document row ─────────────────────────────────────────
@@ -59,15 +63,39 @@ async def _run(document_id: str) -> dict:
ai_provider = (user.ai_provider if user else None) or app_settings.default_ai_provider ai_provider = (user.ai_provider if user else None) or app_settings.default_ai_provider
ai_model = (user.ai_model if user else None) or app_settings.default_ai_model ai_model = (user.ai_model if user else None) or app_settings.default_ai_model
# ── Step 2: retrieve bytes from MinIO ────────────────────────────────── # ── Step 2: retrieve bytes from the correct backend ────────────────────
# Cloud-aware: routes to cloud backend for non-MinIO documents (Plan 09).
# T-05-09-03: cloud credentials are loaded from DB inside this task's own
# session — no credentials travel through the Celery broker message.
try: try:
if doc.storage_backend is None or doc.storage_backend == "minio":
backend = get_storage_backend() backend = get_storage_backend()
file_bytes = await backend.get_object(doc.object_key) file_bytes = await backend.get_object(doc.object_key)
else:
# Cloud path: user must be present (doc.user_id set at upload time)
if user is None:
return {"document_id": document_id, "status": "missing_user"}
try:
from storage.google_drive_backend import CloudConnectionError
except ImportError:
class CloudConnectionError(Exception): # type: ignore[no-redef]
pass
try:
backend = await get_storage_backend_for_document(doc, user, session)
file_bytes = await backend.get_object(doc.object_key)
except CloudConnectionError:
return {
"document_id": document_id,
"status": "extract_failed",
"error": "cloud backend error",
}
except Exception as e: except Exception as e:
return { return {
"document_id": document_id, "document_id": document_id,
"status": "extract_failed", "status": "extract_failed",
"error": f"MinIO retrieval failed: {e}", "error": f"retrieval failed: {e}",
} }
# ── Step 3: extract text from bytes ──────────────────────────────────── # ── Step 3: extract text from bytes ────────────────────────────────────
+148
View File
@@ -563,3 +563,151 @@ async def test_cross_user_idor(
) )
assert resp.status_code == 404 assert resp.status_code == 404
# ── Plan 09 tests: PATCH /documents/{id} and cloud-aware re-analyze ──────────
async def test_patch_document_filename(async_client, db_session):
"""PATCH /api/documents/{id} with {filename} returns 200 with updated filename.
Covers T-05-09-01: ownership enforced via get_regular_user.
"""
from db.models import Document
auth = await _create_user_and_token(db_session, role="user")
# Create a document owned by this user
doc_id = _uuid.uuid4()
doc = Document(
id=doc_id,
user_id=auth["user"].id,
filename="original.pdf",
content_type="application/pdf",
size_bytes=1024,
storage_backend="minio",
status="uploaded",
object_key=f"{auth['user'].id}/{doc_id}/some-uuid.pdf",
)
db_session.add(doc)
await db_session.commit()
resp = await async_client.patch(
f"/api/documents/{doc_id}",
json={"filename": "renamed.pdf"},
headers=auth["headers"],
)
assert resp.status_code == 200
data = resp.json()
assert data["filename"] == "renamed.pdf" or data.get("original_name") == "renamed.pdf"
async def test_patch_document_wrong_owner(async_client, db_session):
"""PATCH /api/documents/{id} by a non-owner returns 404 (IDOR protection).
Covers T-05-09-01: cross-user access returns 404, not 403, to avoid leaking
which document IDs exist for other users (D-16, T-03-11).
"""
from db.models import Document
auth1 = await _create_user_and_token(db_session, role="user")
auth2 = await _create_user_and_token(db_session, role="user")
# Create a document owned by user1
doc_id = _uuid.uuid4()
doc = Document(
id=doc_id,
user_id=auth1["user"].id,
filename="private.pdf",
content_type="application/pdf",
size_bytes=512,
storage_backend="minio",
status="uploaded",
object_key=f"{auth1['user'].id}/{doc_id}/some-uuid.pdf",
)
db_session.add(doc)
await db_session.commit()
# User2 tries to rename user1's document
resp = await async_client.patch(
f"/api/documents/{doc_id}",
json={"filename": "hacked.pdf"},
headers=auth2["headers"],
)
assert resp.status_code == 404
async def test_reanalyze_cloud_document_routes_to_cloud_backend():
"""Re-analyze task calls get_storage_backend_for_document for cloud documents.
Verifies that doc.storage_backend != 'minio' causes _run() to use the cloud
backend path instead of the MinIO path (Plan 09, requirement CLOUD-07).
Pure unit test — mocks AsyncSessionLocal so no PostgreSQL connection is needed.
"""
from tasks.document_tasks import _run
from unittest.mock import AsyncMock, patch, MagicMock
doc_id = _uuid.uuid4()
user_id = _uuid.uuid4()
# Build a minimal mock Document and User (no DB)
mock_doc = MagicMock()
mock_doc.id = doc_id
mock_doc.user_id = user_id
mock_doc.storage_backend = "nextcloud"
mock_doc.object_key = "nc_file_id_xyz"
mock_doc.content_type = "application/pdf"
mock_doc.filename = "cloud.pdf"
mock_doc.status = "uploaded"
mock_user = MagicMock()
mock_user.id = user_id
mock_user.ai_provider = None
mock_user.ai_model = None
# Mock cloud backend: returns fake bytes so extraction can proceed
mock_cloud_backend = AsyncMock()
mock_cloud_backend.get_object = AsyncMock(return_value=b"%PDF-1.4 fake")
# Mock MinIO backend to verify it is NOT called
mock_minio_backend = AsyncMock()
mock_minio_backend.get_object = AsyncMock(return_value=b"should not be called")
# Mock the DB session returned by AsyncSessionLocal
mock_session = AsyncMock()
async def _fake_get(model, pk):
if model.__name__ == "Document":
return mock_doc
if model.__name__ == "User":
return mock_user
return None
mock_session.get = _fake_get
# AsyncSessionLocal is an async context manager; mock it
class _FakeSessionCM:
async def __aenter__(self):
return mock_session
async def __aexit__(self, *args):
pass
# Patch at the storage module level (source of the functions used via deferred import)
with patch("db.session.AsyncSessionLocal", return_value=_FakeSessionCM()), \
patch("storage.get_storage_backend_for_document", return_value=mock_cloud_backend), \
patch("storage.get_storage_backend", return_value=mock_minio_backend), \
patch("services.extractor.extract_text_from_bytes", return_value="extracted text"), \
patch("services.classifier.classify_document", return_value=["doc"]):
result = await _run(str(doc_id))
# Cloud backend's get_object must have been called with the document's object_key
mock_cloud_backend.get_object.assert_called_once_with("nc_file_id_xyz")
# MinIO backend's get_object must NOT have been called
mock_minio_backend.get_object.assert_not_called()
# Result must reflect successful classification, not a MinIO error
assert result.get("status") in ("classified", "classification_failed"), \
f"Expected classified/classification_failed, got: {result}"
+43
View File
@@ -365,6 +365,49 @@ export function getDocumentContentUrl(docId) {
return `/api/documents/${docId}/content` return `/api/documents/${docId}/content`
} }
/**
* Fetch document content bytes with authentication, returning the raw Response.
*
* Unlike request(), this function does NOT call res.json() — it returns the raw
* Response so callers can call .blob() to build an object URL for iframe preview
* or window.open() without an unauthenticated src= attribute.
*
* On 401: attempts one token refresh via authStore.refresh() then retries.
* On refresh failure: clears auth state and throws 'Session expired'.
*
* Security: closes the unauthenticated content-access gap where an iframe src=
* or window.open() with a raw /content URL would bypass the Bearer auth check
* in cases where the browser does not send the cookie (cross-origin, incognito).
* See plan 05-09 trust boundary: frontend→/api/documents/{id}/content.
*/
export async function fetchDocumentContent(docId, options = {}) {
const { useAuthStore } = await import('../stores/auth.js')
const authStore = useAuthStore()
const headers = {}
if (authStore.accessToken) {
headers['Authorization'] = `Bearer ${authStore.accessToken}`
}
const res = await fetch(`/api/documents/${docId}/content`, {
headers,
credentials: 'include',
})
if (res.status === 401 && !options._retry) {
try {
await authStore.refresh()
return fetchDocumentContent(docId, { _retry: true })
} catch {
authStore.accessToken = null
authStore.user = null
throw new Error('Session expired')
}
}
return res
}
// ── Cloud Storage ───────────────────────────────────────────────────────────── // ── Cloud Storage ─────────────────────────────────────────────────────────────
export function listCloudConnections() { export function listCloudConnections() {
@@ -23,10 +23,37 @@
</div> </div>
<!-- iframe content --> <!-- iframe content -->
<div class="flex-1 overflow-hidden"> <div class="flex-1 overflow-hidden relative">
<!-- Loading state -->
<div
v-if="loading"
class="absolute inset-0 flex items-center justify-center bg-gray-50"
>
<div class="flex flex-col items-center gap-3 text-gray-400">
<svg class="w-8 h-8 animate-spin" fill="none" viewBox="0 0 24 24">
<circle class="opacity-25" cx="12" cy="12" r="10" stroke="currentColor" stroke-width="4"></circle>
<path class="opacity-75" fill="currentColor" d="M4 12a8 8 0 018-8V0C5.373 0 0 5.373 0 12h4zm2 5.291A7.962 7.962 0 014 12H0c0 3.042 1.135 5.824 3 7.938l3-2.647z"></path>
</svg>
<span class="text-sm">Loading preview</span>
</div>
</div>
<!-- Error state -->
<div
v-else-if="loadError"
class="absolute inset-0 flex items-center justify-center bg-gray-50"
>
<div class="text-center text-red-500">
<p class="font-medium">Preview failed</p>
<p class="text-sm mt-1">{{ loadError }}</p>
</div>
</div>
<!-- iframe: shown only when blobUrl is ready -->
<iframe <iframe
v-if="blobUrl"
class="w-full h-full border-0" class="w-full h-full border-0"
:src="proxyUrl" :src="blobUrl"
title="Document preview" title="Document preview"
></iframe> ></iframe>
</div> </div>
@@ -34,7 +61,8 @@
</template> </template>
<script setup> <script setup>
import { ref, computed, onMounted, onUnmounted } from 'vue' import { ref, watch, onMounted, onUnmounted } from 'vue'
import { fetchDocumentContent } from '../../api/client.js'
const props = defineProps({ const props = defineProps({
doc: { doc: {
@@ -46,8 +74,34 @@ const props = defineProps({
const emit = defineEmits(['close']) const emit = defineEmits(['close'])
const overlayRef = ref(null) const overlayRef = ref(null)
const blobUrl = ref(null)
const loadError = ref(null)
const loading = ref(true)
const proxyUrl = computed(() => `/api/documents/${props.doc.id}/content`) async function loadContent(docId) {
loading.value = true
loadError.value = null
// Revoke previous blob URL to free memory
if (blobUrl.value) {
URL.revokeObjectURL(blobUrl.value)
blobUrl.value = null
}
try {
const res = await fetchDocumentContent(docId)
if (!res.ok) {
loadError.value = `Failed to load document (HTTP ${res.status})`
return
}
const blob = await res.blob()
blobUrl.value = URL.createObjectURL(blob)
} catch (err) {
loadError.value = err.message || 'Failed to load document'
} finally {
loading.value = false
}
}
function handleOverlayClick(e) { function handleOverlayClick(e) {
if (e.target === overlayRef.value) { if (e.target === overlayRef.value) {
@@ -63,9 +117,20 @@ function handleKeydown(e) {
onMounted(() => { onMounted(() => {
document.addEventListener('keydown', handleKeydown) document.addEventListener('keydown', handleKeydown)
loadContent(props.doc.id)
})
// Reload if the document changes while modal is open
watch(() => props.doc.id, (newId) => {
loadContent(newId)
}) })
onUnmounted(() => { onUnmounted(() => {
document.removeEventListener('keydown', handleKeydown) document.removeEventListener('keydown', handleKeydown)
// Release the object URL to free browser memory
if (blobUrl.value) {
URL.revokeObjectURL(blobUrl.value)
blobUrl.value = null
}
}) })
</script> </script>
+19 -2
View File
@@ -119,6 +119,7 @@ import DocumentPreviewModal from '../components/documents/DocumentPreviewModal.v
import { useDocumentsStore } from '../stores/documents.js' import { useDocumentsStore } from '../stores/documents.js'
import { useTopicsStore } from '../stores/topics.js' import { useTopicsStore } from '../stores/topics.js'
import * as api from '../api/client.js' import * as api from '../api/client.js'
import { fetchDocumentContent } from '../api/client.js'
const route = useRoute() const route = useRoute()
const router = useRouter() const router = useRouter()
@@ -157,11 +158,27 @@ onMounted(async () => {
} }
}) })
function openPdf() { async function openPdf() {
if (pdfOpenMode.value === 'in_app') { if (pdfOpenMode.value === 'in_app') {
showPreviewModal.value = true showPreviewModal.value = true
} else { } else {
window.open(api.getDocumentContentUrl(doc.value.id), '_blank') // Fetch with Authorization header → blob → object URL → window.open
// This closes the unauthenticated access gap: window.open(rawUrl) would bypass
// Bearer auth for cloud documents (plan 05-09 trust boundary).
try {
const res = await fetchDocumentContent(doc.value.id)
if (!res.ok) {
console.error('Failed to open document:', res.status)
return
}
const blob = await res.blob()
const objectUrl = URL.createObjectURL(blob)
window.open(objectUrl, '_blank')
// Revoke after a delay to allow the new tab to load the content
setTimeout(() => URL.revokeObjectURL(objectUrl), 60000)
} catch (err) {
console.error('Failed to open document:', err)
}
} }
} }