6fed5ba531
Research, pattern mapping, and verification complete. Walking Skeleton mode active (MVP Phase 1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
605 lines
53 KiB
Markdown
605 lines
53 KiB
Markdown
---
|
|
phase: 01-infrastructure-foundation
|
|
plan: 05
|
|
type: execute
|
|
wave: 4
|
|
depends_on:
|
|
- 01-04
|
|
files_modified:
|
|
- backend/main.py
|
|
- backend/api/documents.py
|
|
- backend/api/topics.py
|
|
- backend/celery_app.py
|
|
- backend/tasks/__init__.py
|
|
- backend/tasks/document_tasks.py
|
|
- backend/services/classifier.py
|
|
- backend/config.py
|
|
- backend/tests/conftest.py
|
|
- backend/tests/test_documents.py
|
|
- backend/data
|
|
autonomous: false
|
|
requirements:
|
|
- STORE-01
|
|
- STORE-07
|
|
user_setup: []
|
|
tags:
|
|
- api-wiring
|
|
- lifespan
|
|
- celery
|
|
- cutover
|
|
- walking-skeleton
|
|
|
|
must_haves:
|
|
truths:
|
|
- "`backend/main.py` lifespan opens the MinIO client, auto-creates the `docuvault` bucket if missing, attaches it to `app.state.minio`, and disposes the SQLAlchemy engine on shutdown"
|
|
- "`backend/main.py` `/health` endpoint returns `{\"status\": \"ok\"|\"degraded\", \"checks\": {\"postgres\": \"ok\"|\"error: ...\", \"minio\": \"ok\"|\"error: ...\"}}` (D-07)"
|
|
- "`backend/api/documents.py` upload, list, get, delete, classify endpoints all inject `session: AsyncSession = Depends(get_db)` and call the async `services.storage.*` functions"
|
|
- "`backend/api/topics.py` list/create/update/delete/suggest endpoints all inject the session dependency"
|
|
- "`backend/celery_app.py` instantiates a Celery app with broker + result_backend from `REDIS_URL`, JSON serialization, and a `documents` queue route"
|
|
- "`backend/tasks/document_tasks.py` declares a sync `def extract_and_classify(document_id: str) -> dict` Celery task that the upload handler calls via `.delay(...)`"
|
|
- "FastAPI `BackgroundTasks` usage is removed (STORE-08 satisfied for Phase 1; was never directly used in current codebase but the `await classifier.classify_document` inline call in the upload handler is replaced with `.delay()`)"
|
|
- "All legacy flat-file constants and helpers (`UPLOADS_DIR`, `METADATA_DIR`, `TOPICS_FILE`, `ensure_data_dirs`) are removed from `backend/config.py`; the `data/` directory contents are deleted (D-04)"
|
|
- "Existing `tests/test_documents.py` sync tests are DELETED (cutover) and the `_async` variants from Plan 02 have their `@pytest.mark.xfail` markers removed; the legacy `client` (sync TestClient) fixture is removed from conftest.py"
|
|
- "Walking-skeleton end-to-end verification passes: `docker compose up` boots all services healthy; a real PDF upload through `POST /api/documents/upload` persists to PostgreSQL + MinIO; Celery worker logs show `extract_and_classify` ran; the document appears in `GET /api/documents`"
|
|
artifacts:
|
|
- path: "backend/main.py"
|
|
provides: "Lifespan with engine + MinIO bucket init; extended /health"
|
|
contains: "app.state.minio"
|
|
- path: "backend/celery_app.py"
|
|
provides: "Celery app with Redis broker + JSON serialization + documents queue"
|
|
contains: "Celery(\"docuvault\")"
|
|
- path: "backend/tasks/document_tasks.py"
|
|
provides: "extract_and_classify Celery task"
|
|
contains: "@celery_app.task"
|
|
- path: "backend/api/documents.py"
|
|
provides: "Async route handlers using AsyncSession + Celery .delay()"
|
|
contains: "session: AsyncSession = Depends(get_db)"
|
|
- path: "backend/api/topics.py"
|
|
provides: "Async route handlers using AsyncSession"
|
|
contains: "session: AsyncSession = Depends(get_db)"
|
|
- path: "backend/services/classifier.py"
|
|
provides: "Updated to accept session and called from a sync wrapper inside Celery tasks"
|
|
contains: "async def classify_document"
|
|
- path: "backend/config.py"
|
|
provides: "Cleaned up Pydantic Settings — legacy data-dir constants removed"
|
|
contains: "class Settings(BaseSettings)"
|
|
key_links:
|
|
- from: "backend/api/documents.py upload handler"
|
|
to: "backend/tasks/document_tasks.extract_and_classify"
|
|
via: "extract_and_classify.delay(str(saved['id']))"
|
|
pattern: "extract_and_classify\\.delay"
|
|
- from: "backend/main.py lifespan"
|
|
to: "MinIO bucket auto-create"
|
|
via: "make_bucket if not bucket_exists"
|
|
pattern: "make_bucket|bucket_exists"
|
|
- from: "backend/main.py /health"
|
|
to: "AsyncSessionLocal + minio_client.bucket_exists"
|
|
via: "probe queries"
|
|
pattern: "AsyncSessionLocal|bucket_exists"
|
|
- from: "backend/celery_app.py"
|
|
to: "REDIS_URL env var"
|
|
via: "os.environ.get('REDIS_URL', ...)"
|
|
pattern: "REDIS_URL"
|
|
---
|
|
|
|
<objective>
|
|
Complete the Phase 1 cutover: wire every API route to the async storage layer, replace inline classification with a Celery `.delay()` call, extend the FastAPI lifespan with MinIO bucket creation + engine disposal, rewrite `/health` to probe PostgreSQL + MinIO (D-07), introduce `celery_app.py` + `tasks/document_tasks.py`, remove every legacy flat-file artifact (D-04), delete the legacy sync tests + sync TestClient fixture, and verify the walking skeleton end-to-end against a live Docker stack.
|
|
|
|
Purpose: This plan closes the loop. After it ships, ROADMAP.md Phase 1 success criteria #1, #3, and #4 are all satisfied (criterion #2 was satisfied in Plan 03). Phase 1 ends with a usable single-user app whose entire internal architecture is the multi-user-ready PostgreSQL + MinIO + Celery stack.
|
|
|
|
Output: Updated `backend/main.py`, `backend/api/documents.py`, `backend/api/topics.py`, `backend/services/classifier.py`; new `backend/celery_app.py` + `backend/tasks/document_tasks.py`; cleaned `backend/config.py`; final test-suite cutover; deletion of `data/` directory; and a passing end-to-end walking-skeleton verification checkpoint.
|
|
</objective>
|
|
|
|
<execution_context>
|
|
@$HOME/.claude/get-shit-done/workflows/execute-plan.md
|
|
@$HOME/.claude/get-shit-done/templates/summary.md
|
|
</execution_context>
|
|
|
|
<context>
|
|
@CLAUDE.md
|
|
@.planning/PROJECT.md
|
|
@.planning/ROADMAP.md
|
|
@.planning/STATE.md
|
|
@.planning/phases/01-infrastructure-foundation/01-CONTEXT.md
|
|
@.planning/phases/01-infrastructure-foundation/01-RESEARCH.md
|
|
@.planning/phases/01-infrastructure-foundation/01-PATTERNS.md
|
|
@.planning/phases/01-infrastructure-foundation/SKELETON.md
|
|
@.planning/phases/01-infrastructure-foundation/01-04-SUMMARY.md
|
|
@backend/services/storage.py
|
|
@backend/db/models.py
|
|
@backend/db/session.py
|
|
@backend/deps/db.py
|
|
@backend/storage/__init__.py
|
|
|
|
<interfaces>
|
|
After Plan 04, the async storage layer is in place. This plan wires consumers.
|
|
|
|
Existing `api/documents.py` consumer points (must be ported to async + session injection):
|
|
- `storage.save_upload(content, file.filename, mime)` → `await storage.save_upload(session, content, file.filename, mime)`
|
|
- `storage.save_metadata(meta)` → `await storage.save_metadata(session, meta)`
|
|
- `storage.list_metadata(topic=topic)` → `await storage.list_metadata(session, topic=topic)`
|
|
- `storage.get_metadata(doc_id)` → `await storage.get_metadata(session, doc_id)`
|
|
- `storage.delete_document(doc_id)` → `await storage.delete_document(session, doc_id)`
|
|
- `await classifier.classify_document(saved["id"])` → `extract_and_classify.delay(saved["id"])` (Celery task — STORE-08)
|
|
|
|
Existing `api/topics.py` consumer points:
|
|
- `storage.load_topics()` → `await storage.load_topics(session)`
|
|
- `storage.topic_doc_counts()` → `await storage.topic_doc_counts(session)`
|
|
- `storage.create_topic(...)` → `await storage.create_topic(session, ...)`
|
|
- `storage.update_topic(...)` → `await storage.update_topic(session, ...)`
|
|
- `storage.delete_topic(...)` → `await storage.delete_topic(session, ...)`
|
|
- `storage.get_metadata(...)` → `await storage.get_metadata(session, ...)` (used by `/suggest`)
|
|
|
|
Existing `services/classifier.py` consumer points (called by both the soon-removed inline upload path and the new Celery task; module signature changes from `async def classify_document(doc_id)` accepting no session to `async def classify_document(session, doc_id)`):
|
|
- Used inside the Celery task wrapper via `asyncio.run(classify_document(session, doc_id))` after manually opening a session
|
|
|
|
`api/settings.py` — KEEP AS-IS. The `settings.json` flat file lives until Phase 2 (D-03 settings deferred); the `services/storage.load_settings()` / `save_settings()` functions remain sync per Plan 04.
|
|
|
|
`main.py` lifespan contract (current → new):
|
|
```python
|
|
# current
|
|
async def lifespan(app):
|
|
ensure_data_dirs()
|
|
yield
|
|
|
|
# new
|
|
async def lifespan(app):
|
|
# MinIO bucket auto-create
|
|
minio_client = Minio(settings.minio_endpoint, access_key=..., secret_key=..., secure=False)
|
|
if not await asyncio.to_thread(minio_client.bucket_exists, settings.minio_bucket):
|
|
await asyncio.to_thread(minio_client.make_bucket, settings.minio_bucket)
|
|
app.state.minio = minio_client
|
|
yield
|
|
await engine.dispose()
|
|
```
|
|
|
|
`/health` response contract (D-07):
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"checks": {"postgres": "ok", "minio": "ok"}
|
|
}
|
|
```
|
|
Or `"status": "degraded"` if any check is not `"ok"`.
|
|
</interfaces>
|
|
</context>
|
|
|
|
<tasks>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 1: Introduce backend/celery_app.py + backend/tasks/document_tasks.py and update services/classifier.py</name>
|
|
<files>backend/celery_app.py, backend/tasks/__init__.py, backend/tasks/document_tasks.py, backend/services/classifier.py</files>
|
|
<behavior>
|
|
- `from celery_app import celery_app` imports the configured Celery instance
|
|
- `celery_app.conf.broker_url` and `celery_app.conf.result_backend` both read from `REDIS_URL` env var (falling back to `redis://redis:6379/0` if unset)
|
|
- `celery_app.conf.task_serializer == "json"` and `celery_app.conf.accept_content == ["json"]`
|
|
- `celery_app.conf.task_routes` routes `tasks.document_tasks.*` to the `documents` queue
|
|
- `tasks.document_tasks.extract_and_classify(document_id: str)` is a plain `def` (NOT `async def`) decorated with `@celery_app.task(name="tasks.document_tasks.extract_and_classify")`
|
|
- The task opens a fresh `AsyncSession` via `asyncio.run(...)` around the async body, calls `services.extractor.extract_text(...)` on the bytes pulled from MinIO via `MinIOBackend.get_object`, persists the extracted text via `services.storage.save_metadata`, then calls `services.classifier.classify_document(session, doc_id)` and persists the result
|
|
- Failures in classification do not raise — they update the document's `status` to `"classification_failed"` and store the error string in a `classification_error` field on the returned dict (parity with the existing non-fatal-classification pattern in `api/documents.py`)
|
|
- `services/classifier.py` is updated to accept a `session: AsyncSession` as its first arg; the previous `storage.get_metadata(doc_id)` becomes `await storage.get_metadata(session, doc_id)`; same pattern for `storage.load_settings()` (still sync, no change), `storage.load_topics(session)`, `storage.create_topic(session, name)`, `storage.update_document_topics(session, doc_id, topics)`
|
|
- `tasks/__init__.py` exists (empty file is acceptable) so `tasks/` is recognized as a package
|
|
</behavior>
|
|
<read_first>
|
|
- .planning/phases/01-infrastructure-foundation/01-RESEARCH.md (Pattern 5 — Celery + Redis configuration; Pitfall 7 — keep celery_app.py minimal to avoid circular imports; Anti-Pattern: do not use async def for Celery task functions)
|
|
- .planning/phases/01-infrastructure-foundation/01-PATTERNS.md (backend/celery_app.py + backend/tasks/document_tasks.py sections)
|
|
- .planning/phases/01-infrastructure-foundation/01-CONTEXT.md (D-08 Celery+Redis; D-10 celery-worker service exists per Plan 01)
|
|
- backend/services/classifier.py (read in full — every `storage.*` call site needs an `await ... (session, ...)` rewrite)
|
|
- backend/services/extractor.py (read once — verify which function name is used, then call it from the Celery task; do NOT modify this file)
|
|
- backend/db/session.py (Plan 03 output — confirm `AsyncSessionLocal` is the exported symbol)
|
|
- backend/services/storage.py (Plan 04 output — confirm async function signatures the task will call)
|
|
</read_first>
|
|
<action>
|
|
Create `backend/celery_app.py` with minimal imports per Pitfall 7: `import os`, `from celery import Celery`. Instantiate `celery_app = Celery("docuvault")`. Configure: `celery_app.conf.broker_url = os.environ.get("REDIS_URL", "redis://redis:6379/0")`, `celery_app.conf.result_backend = os.environ.get("REDIS_URL", "redis://redis:6379/0")`, `celery_app.conf.task_serializer = "json"`, `celery_app.conf.result_serializer = "json"`, `celery_app.conf.accept_content = ["json"]`, `celery_app.conf.task_routes = {"tasks.document_tasks.*": {"queue": "documents"}}`. Then `celery_app.autodiscover_tasks(["tasks"], force=True)` so registering tasks under `tasks/` works without an explicit import. Critically, DO NOT import `from config import settings` here — `config.py` triggers Pydantic Settings env-loading that may pull in FastAPI-related side effects in some setups. Read REDIS_URL directly from `os.environ`.
|
|
|
|
Create `backend/tasks/__init__.py` as an empty file.
|
|
|
|
Create `backend/tasks/document_tasks.py`. Imports: `import asyncio`, `from celery_app import celery_app`, `from db.session import AsyncSessionLocal`, `from services import storage, extractor, classifier`, `from storage import get_storage_backend`. Define the task:
|
|
|
|
```python
|
|
@celery_app.task(name="tasks.document_tasks.extract_and_classify")
|
|
def extract_and_classify(document_id: str) -> dict:
|
|
return asyncio.run(_run(document_id))
|
|
|
|
async def _run(document_id: str) -> dict:
|
|
async with AsyncSessionLocal() as session:
|
|
meta = await storage.get_metadata(session, document_id)
|
|
if meta is None:
|
|
return {"document_id": document_id, "status": "not_found"}
|
|
# Fetch the bytes from MinIO so the extractor can read them
|
|
backend = get_storage_backend()
|
|
try:
|
|
obj_key = meta.get("object_key") or meta.get("path")
|
|
# The object_key shape is {user_id}/{doc_id}/{uuid4}{ext} — retrieve via storage_backend
|
|
# We don't have object_key on the metadata dict in v1 — read from DB directly:
|
|
from db.models import Document
|
|
import uuid as _uuid
|
|
doc = await session.get(Document, _uuid.UUID(document_id))
|
|
if doc is None or not doc.object_key:
|
|
return {"document_id": document_id, "status": "missing_object"}
|
|
file_bytes = await backend.get_object(doc.object_key)
|
|
text = extractor.extract_text_from_bytes(file_bytes, doc.content_type) if hasattr(extractor, "extract_text_from_bytes") else extractor.extract_text_bytes(file_bytes, doc.content_type)
|
|
meta["extracted_text"] = text
|
|
await storage.save_metadata(session, meta)
|
|
except Exception as e:
|
|
return {"document_id": document_id, "status": "extract_failed", "error": str(e)}
|
|
try:
|
|
topics = await classifier.classify_document(session, document_id)
|
|
return {"document_id": document_id, "status": "classified", "topics": topics}
|
|
except Exception as e:
|
|
# Non-fatal — preserve the existing convention from api/documents.py line 54-56
|
|
doc.status = "classification_failed"
|
|
await session.commit()
|
|
return {"document_id": document_id, "status": "classification_failed", "error": str(e)}
|
|
```
|
|
|
|
Note: If `services/extractor.py` only exposes `extract_text(path, mime)` (file-path-based), add a new helper `extract_text_from_bytes(file_bytes: bytes, mime: str)` to `services/extractor.py` that writes `file_bytes` to a `tempfile.NamedTemporaryFile(suffix=...)`, calls the existing `extract_text(tmp.name, mime)`, and unlinks the temp file. Do not modify any other behavior in `services/extractor.py`.
|
|
|
|
Update `backend/services/classifier.py`: change `async def classify_document(doc_id: str, topic_names: list[str] | None = None)` to `async def classify_document(session: AsyncSession, doc_id: str, topic_names: list[str] | None = None)`. Add `from sqlalchemy.ext.asyncio import AsyncSession` at the top. Replace `storage.get_metadata(doc_id)` → `await storage.get_metadata(session, doc_id)`. Replace `storage.load_settings()` → `storage.load_settings()` (unchanged — Phase 1 keeps the flat file; this is sync). Replace `storage.load_topics()` → `await storage.load_topics(session)` (note signature change — adapter call). Replace `storage.create_topic(name.strip())` → `await storage.create_topic(session, name.strip())`. Replace `storage.update_document_topics(doc_id, final_topics)` → `await storage.update_document_topics(session, doc_id, final_topics)`. Apply the same session-injection treatment to `suggest_topics_for_document(session, doc_id)`. Preserve `MAX_AI_CHARS = 8_000` and every other line verbatim.
|
|
</action>
|
|
<verify>
|
|
<automated>cd /Users/nik/Documents/Progamming/document_scanner/backend && python3 -c "
|
|
import os
|
|
os.environ.setdefault('REDIS_URL', 'redis://localhost:6379/0')
|
|
from celery_app import celery_app
|
|
assert celery_app.conf.task_serializer == 'json'
|
|
assert celery_app.conf.accept_content == ['json']
|
|
assert 'tasks.document_tasks.*' in celery_app.conf.task_routes
|
|
assert celery_app.conf.task_routes['tasks.document_tasks.*'] == {'queue': 'documents'}
|
|
from tasks.document_tasks import extract_and_classify
|
|
import inspect
|
|
assert not inspect.iscoroutinefunction(extract_and_classify), 'Celery task must be sync def (not async def)'
|
|
# Verify the task is registered with Celery
|
|
registered = celery_app.tasks
|
|
assert 'tasks.document_tasks.extract_and_classify' in registered, f'task not registered; have: {list(registered.keys())[-5:]}'
|
|
import services.classifier as cl
|
|
sig = inspect.signature(cl.classify_document)
|
|
assert list(sig.parameters.keys())[0] == 'session', f'classify_document first param should be session, got: {list(sig.parameters.keys())}'
|
|
print('celery-task-ok')
|
|
"</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `backend/celery_app.py` exists and contains `celery_app = Celery("docuvault")`
|
|
- `backend/celery_app.py` does NOT import from `config` (Pitfall 7 — verifiable: `grep -c "from config\|import config" backend/celery_app.py | grep -q "^0$"`)
|
|
- `backend/celery_app.py` contains `task_routes = {"tasks.document_tasks.*": {"queue": "documents"}}`
|
|
- `backend/tasks/__init__.py` exists
|
|
- `backend/tasks/document_tasks.py` contains `@celery_app.task(name="tasks.document_tasks.extract_and_classify")`
|
|
- `backend/tasks/document_tasks.py` defines `def extract_and_classify` as a sync `def` (NOT `async def`) — verifiable via the inline `inspect.iscoroutinefunction` assertion
|
|
- `backend/tasks/document_tasks.py` uses `asyncio.run` to invoke the async body (verifiable: `grep -c "asyncio.run" backend/tasks/document_tasks.py >= 1`)
|
|
- `backend/services/classifier.py` first parameter of `classify_document` is `session` (verified by the inline signature inspection)
|
|
- `backend/services/classifier.py` calls `await storage.get_metadata(session, doc_id)` and `await storage.update_document_topics(session, doc_id, ...)`
|
|
- `services/extractor.py` either already exposes a bytes-based extraction function OR a new `extract_text_from_bytes` helper is added; in either case the Celery task can import and call it without raising on import — verifiable via `python3 -c "from services import extractor; assert hasattr(extractor, 'extract_text_from_bytes') or hasattr(extractor, 'extract_text_bytes') or hasattr(extractor, 'extract_text')"` exits 0
|
|
- The Verify command prints `celery-task-ok`
|
|
</acceptance_criteria>
|
|
<done>Celery is wired with a Redis-backed broker; the `extract_and_classify` task is registered and discoverable; `services/classifier.py` is session-aware; the Phase 1 background worker contract is in place.</done>
|
|
</task>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 2: Wire backend/main.py lifespan + /health, rewrite backend/api/documents.py and backend/api/topics.py to async session injection</name>
|
|
<files>backend/main.py, backend/api/documents.py, backend/api/topics.py</files>
|
|
<behavior>
|
|
- `GET /health` returns HTTP 200 with body `{"status": "ok", "checks": {"postgres": "ok", "minio": "ok"}}` when both services are healthy
|
|
- `GET /health` returns HTTP 200 with body `{"status": "degraded", "checks": {"postgres": "error: ...", "minio": "ok"}}` (or analogous shape) when one service is unreachable — `/health` never returns 5xx
|
|
- `POST /api/documents/upload` calls `await storage.save_upload(session, ...)` then `extract_and_classify.delay(str(saved["id"]))` if `auto_classify` is true; the response shape preserves `{"id", "original_name", "filename", "mime_type", "size_bytes", "extracted_text", "topics", "created_at", "classified_at"}` so the frontend continues to work
|
|
- `GET /api/documents` calls `await storage.list_metadata(session, topic=topic)` and paginates the result
|
|
- `GET /api/documents/{doc_id}` and `DELETE /api/documents/{doc_id}` use the session dependency
|
|
- `POST /api/documents/{doc_id}/classify` injects the session and either calls the Celery task with `.delay(...)` or `await classifier.classify_document(session, doc_id, topic_names)` synchronously and returns the result; choose the synchronous in-route call for this endpoint because it has historically returned the topic list (preserve behavior — Phase 4 may change this)
|
|
- `GET /api/topics`, `POST /api/topics`, `PATCH /api/topics/{topic_id}`, `DELETE /api/topics/{topic_id}`, `POST /api/topics/suggest` all inject the session dependency
|
|
- `backend/main.py` lifespan creates the MinIO client, auto-creates the `docuvault` bucket if absent, stores it on `app.state.minio`, and disposes `engine` on shutdown
|
|
- `backend/main.py` no longer calls `ensure_data_dirs()` (legacy)
|
|
</behavior>
|
|
<read_first>
|
|
- backend/main.py (current 34-line file — preserve `app = FastAPI(...)`, `app.add_middleware(CORSMiddleware, ...)`, `app.include_router(...)` calls; replace only the lifespan body and the `/health` handler)
|
|
- backend/api/documents.py (current 102 lines — read every route handler; preserve every `HTTPException` message and the `ALLOWED_MIME_TYPES` set verbatim)
|
|
- backend/api/topics.py (current 73 lines — read every route handler; preserve Pydantic models `TopicCreate`, `TopicUpdate`, `SuggestRequest` verbatim)
|
|
- backend/services/storage.py (Plan 04 output — async function signatures)
|
|
- backend/db/session.py (Plan 03 output — `AsyncSessionLocal`, `engine`)
|
|
- backend/deps/db.py (Plan 03 output — `get_db`)
|
|
- backend/tasks/document_tasks.py (Task 1 output — `extract_and_classify`)
|
|
- .planning/phases/01-infrastructure-foundation/01-PATTERNS.md (backend/main.py + backend/api/documents.py + backend/api/topics.py sections)
|
|
- .planning/phases/01-infrastructure-foundation/01-RESEARCH.md (Pattern 4 — MinIO bucket initialization at startup)
|
|
- .planning/phases/01-infrastructure-foundation/01-CONTEXT.md (D-07 /health extended)
|
|
</read_first>
|
|
<action>
|
|
Rewrite `backend/main.py`. Imports: keep `from contextlib import asynccontextmanager`, `from fastapi import FastAPI, Request`, `from fastapi.middleware.cors import CORSMiddleware`, `from api.documents import router as documents_router`, `from api.topics import router as topics_router`, `from api.settings import router as settings_router`. Add `import asyncio`, `from minio import Minio`, `from sqlalchemy import text`, `from db.session import engine, AsyncSessionLocal`, `from config import settings`. DO NOT import `ensure_data_dirs`.
|
|
|
|
Rewrite the lifespan function:
|
|
```python
|
|
@asynccontextmanager
|
|
async def lifespan(app: FastAPI):
|
|
# MinIO bucket initialization (RESEARCH.md Pattern 4)
|
|
minio_client = Minio(
|
|
settings.minio_endpoint,
|
|
access_key=settings.minio_access_key,
|
|
secret_key=settings.minio_secret_key,
|
|
secure=False,
|
|
)
|
|
exists = await asyncio.to_thread(minio_client.bucket_exists, settings.minio_bucket)
|
|
if not exists:
|
|
await asyncio.to_thread(minio_client.make_bucket, settings.minio_bucket)
|
|
app.state.minio = minio_client
|
|
yield
|
|
await engine.dispose()
|
|
```
|
|
|
|
Rewrite the `/health` handler (preserving `@app.get("/health")` and `async def`):
|
|
```python
|
|
@app.get("/health")
|
|
async def health(request: Request):
|
|
checks = {}
|
|
try:
|
|
async with AsyncSessionLocal() as session:
|
|
await session.execute(text("SELECT 1"))
|
|
checks["postgres"] = "ok"
|
|
except Exception as e:
|
|
checks["postgres"] = f"error: {type(e).__name__}: {e}"
|
|
try:
|
|
ok = await asyncio.to_thread(request.app.state.minio.bucket_exists, settings.minio_bucket)
|
|
checks["minio"] = "ok" if ok else "error: bucket missing"
|
|
except Exception as e:
|
|
checks["minio"] = f"error: {type(e).__name__}: {e}"
|
|
status = "ok" if all(v == "ok" for v in checks.values()) else "degraded"
|
|
return {"status": status, "checks": checks}
|
|
```
|
|
|
|
Keep the existing CORS middleware (`allow_origins=["*"]` for Phase 1 — Phase 2 locks down).
|
|
|
|
Rewrite `backend/api/documents.py`. Imports: replace `from services import storage, extractor, classifier` with `from sqlalchemy.ext.asyncio import AsyncSession`, `from deps.db import get_db`, `from services import storage, extractor`, `from tasks.document_tasks import extract_and_classify`, `from services import classifier` (only used by the `/classify` endpoint), and keep `from fastapi import APIRouter, UploadFile, File, Form, HTTPException, Query, Depends` (add `Depends`). Preserve the router definition and `ALLOWED_MIME_TYPES` set.
|
|
|
|
For every route, append `session: AsyncSession = Depends(get_db)` to the signature. For each storage call, prepend `await` and pass `session` as the first arg:
|
|
- `upload_document`: read content, validate empty, generate filename + mime, call `await storage.save_upload(session, content, file.filename or "upload", mime)`, then `text = extractor.extract_text(saved["path"], mime)` — wait: `saved["path"]` is now the object_key, not a filesystem path. CHANGE the extraction step: pull bytes already in memory (`content`) and call the new `extractor.extract_text_from_bytes(content, mime)` helper introduced in Task 1. Build the `meta` dict exactly as before (preserve the keys), call `await storage.save_metadata(session, meta)`. If `auto_classify` is True, call `extract_and_classify.delay(saved["id"])` (NOTE: this re-fetches from MinIO inside the worker — acceptable for Phase 1; Phase 3 will pass the bytes through Redis directly if perf demands). Return `meta`. CHANGE: since classification is now async via Celery, the response no longer includes `topics` populated by the inline classifier call — set `meta["topics"] = []` and `meta["classified_at"] = None` and rely on the worker to update the DB row. Document this in a comment as the Phase 1 cutover behavior.
|
|
- `list_documents`: `docs = await storage.list_metadata(session, topic=topic)` then preserve the existing pagination math.
|
|
- `get_document`: `meta = await storage.get_metadata(session, doc_id)`; raise 404 if `None`.
|
|
- `delete_document`: `ok = await storage.delete_document(session, doc_id)`; raise 404 if `False`.
|
|
- `classify_document` (the `/classify` route): `meta = await storage.get_metadata(session, doc_id)`; raise 404 if None; preserve the inline `await classifier.classify_document(session, doc_id, topic_names)` call (this endpoint historically returned the topic list synchronously — keep that behavior; the upload-time path is the async one).
|
|
|
|
Rewrite `backend/api/topics.py` similarly: add `Depends` import, `from sqlalchemy.ext.asyncio import AsyncSession`, `from deps.db import get_db`. Add `session: AsyncSession = Depends(get_db)` to every route. Wrap every `storage.*` call with `await` and prepend `session`:
|
|
- `list_topics`: `topics = await storage.load_topics(session)`, `counts = await storage.topic_doc_counts(session)`.
|
|
- `create_topic`: `topic = await storage.create_topic(session, body.name, body.description, body.color)`.
|
|
- `update_topic`: `topic = await storage.update_topic(session, topic_id, name=body.name, description=body.description, color=body.color)`.
|
|
- `delete_topic`: `name = await storage.delete_topic(session, topic_id)`.
|
|
- `suggest_topics`: `meta = await storage.get_metadata(session, body.document_id)`; if None, 404; then `await classifier.suggest_topics_for_document(session, body.document_id)`.
|
|
|
|
Preserve every Pydantic model and HTTPException message verbatim.
|
|
</action>
|
|
<verify>
|
|
<automated>cd /Users/nik/Documents/Progamming/document_scanner/backend && python3 -c "
|
|
import os
|
|
os.environ.setdefault('REDIS_URL', 'redis://localhost:6379/0')
|
|
os.environ.setdefault('DATABASE_URL', 'postgresql+psycopg://docuvault_app:changeme_app@localhost:5432/docuvault')
|
|
import inspect
|
|
from main import app
|
|
# Confirm /health route exists with new shape
|
|
routes = {r.path for r in app.routes}
|
|
assert '/health' in routes, 'health route missing'
|
|
# Confirm session injection on documents and topics
|
|
from api.documents import upload_document, list_documents, get_document, delete_document
|
|
from api.topics import list_topics, create_topic, update_topic, delete_topic
|
|
for fn in [upload_document, list_documents, get_document, delete_document, list_topics, create_topic, update_topic, delete_topic]:
|
|
sig = inspect.signature(fn)
|
|
params = list(sig.parameters)
|
|
assert 'session' in params, f'{fn.__name__} missing session param: {params}'
|
|
print('routes-wired-ok')
|
|
"</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `backend/main.py` no longer contains `ensure_data_dirs` (verifiable: `grep -c "ensure_data_dirs" backend/main.py | grep -q "^0$"`)
|
|
- `backend/main.py` contains `from minio import Minio`, `from db.session import engine, AsyncSessionLocal`, `from config import settings`
|
|
- `backend/main.py` lifespan contains `app.state.minio = minio_client`
|
|
- `backend/main.py` lifespan contains `await engine.dispose()`
|
|
- `backend/main.py` `/health` contains both `SELECT 1` and `bucket_exists` probes
|
|
- `backend/main.py` `/health` returns the shape `{"status": ..., "checks": {"postgres": ..., "minio": ...}}` (verifiable by inspecting the return statement and by Task 3 live test)
|
|
- `backend/api/documents.py` contains `from deps.db import get_db` and `from tasks.document_tasks import extract_and_classify`
|
|
- `backend/api/documents.py` upload handler contains `extract_and_classify.delay(` (Celery enqueue) — verifiable via `grep -c "extract_and_classify.delay" backend/api/documents.py >= 1`
|
|
- `backend/api/documents.py` upload handler no longer contains `await classifier.classify_document` in the upload path (the only remaining classifier call is on the `/classify` endpoint) — verifiable via `grep -c "await classifier.classify_document" backend/api/documents.py | grep -q "^1$"`
|
|
- Every route in `backend/api/documents.py` and `backend/api/topics.py` contains `session: AsyncSession = Depends(get_db)` (verifiable via `grep -c "session: AsyncSession = Depends(get_db)" backend/api/documents.py >= 5` and similarly `>= 5` for topics.py)
|
|
- The Verify command prints `routes-wired-ok`
|
|
- `cd backend && python3 -m pytest tests/ -v --collect-only` exits 0 (collection succeeds — no import errors)
|
|
</acceptance_criteria>
|
|
<done>FastAPI lifespan creates the MinIO bucket and disposes the engine; `/health` probes both services; all document and topic routes use async session injection; upload-time classification is queued via Celery; the `/classify` endpoint remains synchronous for compatibility.</done>
|
|
</task>
|
|
|
|
<task type="auto" tdd="true">
|
|
<name>Task 3: Final cutover — delete legacy data/, prune config.py, prune conftest.py + test_documents.py legacy fixtures and sync tests, unfail async ports</name>
|
|
<files>backend/config.py, backend/tests/conftest.py, backend/tests/test_documents.py, backend/tests/test_health.py, backend/data</files>
|
|
<behavior>
|
|
- `backend/data/` directory and all its contents are deleted (D-04). The git repo no longer tracks this directory.
|
|
- `backend/config.py` no longer declares `DATA_DIR`, `UPLOADS_DIR`, `METADATA_DIR`, `TOPICS_FILE`, `ensure_data_dirs` (legacy flat-file constants). It retains `DEFAULT_SETTINGS`, `DEFAULT_SYSTEM_PROMPT`, the Pydantic `Settings` class, and the module-level `settings = Settings()`. `SETTINGS_FILE` is RETAINED (still used for the Phase-2-deferred settings JSON file path) but its value is rebased onto the new `settings.data_dir` field rather than a removed module-level constant.
|
|
- `backend/tests/conftest.py` no longer defines the autouse `isolated_data_dir` fixture (the flat-file scaffold). It no longer defines the sync `client` fixture (which built a `TestClient`). The only fixtures remaining are `db_session`, `async_client`, `sample_txt`, `sample_pdf`. Tests are rewired to use `async_client` everywhere.
|
|
- `backend/tests/test_documents.py` no longer contains any of the legacy sync test functions (`test_upload_txt_no_classify`, `test_upload_pdf_no_classify`, `test_list_documents`, `test_list_documents_filter_by_topic`, `test_get_document`, `test_get_document_not_found`, `test_delete_document`, `test_delete_document_not_found`, `test_upload_empty_file`) — they are DELETED. The `_async`-suffixed tests from Plan 02 have their `@pytest.mark.xfail` markers REMOVED and now run as live tests.
|
|
- `backend/tests/test_health.py` has the `@pytest.mark.xfail` marker removed from `test_health_checks_postgres_and_minio`; the old sync `test_health(client)` is REPLACED with `test_health_status_ok_sync_deprecated` that is `@pytest.mark.skip(reason="legacy sync client removed in plan 05")` OR deleted entirely. Choose deletion — cleaner.
|
|
- The full pytest run reports zero XFAIL/SKIPPED for Phase 1 tests except where the underlying service (live PostgreSQL/MinIO/Redis) is unavailable in the test env (in which case the integration tests are skipped via a fixture that probes for service availability — see action below).
|
|
</behavior>
|
|
<read_first>
|
|
- backend/config.py (Plan 01 output — identify exactly which legacy constants must be removed)
|
|
- backend/tests/conftest.py (Plan 02 output — verify which fixtures need to go)
|
|
- backend/tests/test_documents.py (Plan 02 output — verify which test functions to delete vs keep)
|
|
- backend/tests/test_health.py (Plan 02 output — same scrutiny)
|
|
- backend/services/storage.py (Plan 04 output — confirm it still imports `SETTINGS_FILE` from config; we will preserve `SETTINGS_FILE` as a derived path)
|
|
- backend/api/settings.py (read — uses `services.storage.load_settings/save_settings` which depend on `SETTINGS_FILE`)
|
|
- .planning/phases/01-infrastructure-foundation/01-CONTEXT.md (D-04 delete `data/` contents)
|
|
</read_first>
|
|
<action>
|
|
Step 1 — delete legacy data: run `git rm -rf backend/data/` (if tracked) and `rm -rf backend/data/`. Add `backend/data/` to `.gitignore`. If `services/storage.py` or any other file still references `UPLOADS_DIR`/`METADATA_DIR`/`TOPICS_FILE`, fix those references first (per Plan 04 they should already be gone for documents/topics; `SETTINGS_FILE` is the only remaining legacy path).
|
|
|
|
Step 2 — prune `backend/config.py`. Read the current file (post-Plan 01). Remove:
|
|
- `DATA_DIR = Path(os.environ.get("DATA_DIR", "/app/data"))`
|
|
- `UPLOADS_DIR = DATA_DIR / "uploads"`
|
|
- `METADATA_DIR = DATA_DIR / "metadata"`
|
|
- `TOPICS_FILE = DATA_DIR / "topics.json"`
|
|
- the `def ensure_data_dirs():` function
|
|
- the `import json` at the top if no longer used (keep if `DEFAULT_SETTINGS` interpolates from JSON anywhere — currently no)
|
|
- the `import os` (no longer needed since `Settings` reads env via Pydantic)
|
|
Preserve:
|
|
- `from pathlib import Path` (used by `SETTINGS_FILE`)
|
|
- `DEFAULT_SYSTEM_PROMPT` (used by `services/classifier.py`)
|
|
- `DEFAULT_SETTINGS` (used by `services/storage.load_settings` fallback)
|
|
- `class Settings(BaseSettings):` with all Phase 1 fields (Plan 01 output)
|
|
- `settings = Settings()` module instance
|
|
Rebase `SETTINGS_FILE` as a derived path computed from `settings.data_dir`:
|
|
`SETTINGS_FILE = Path(settings.data_dir) / "settings.json"` — placed AFTER the `settings = Settings()` line. Add a comment: `# SETTINGS_FILE: still flat-file in Phase 1; migrates to users.ai_provider in Phase 2`.
|
|
|
|
Step 3 — prune `backend/tests/conftest.py`:
|
|
- DELETE the entire `isolated_data_dir` fixture (the autouse one that monkey-patches `config.DATA_DIR` etc.)
|
|
- DELETE the sync `client` fixture (`with TestClient(app) as c: yield c`)
|
|
- KEEP the `db_session`, `async_client`, `sample_txt`, `sample_pdf` fixtures introduced in Plan 02. Promote `async_client` so the previous behavior — fail gracefully if `deps.db` does not exist — is replaced with a hard dependency: remove the `try/except ImportError: pytest.skip(...)` wrapper inside the async fixtures because `deps.db.get_db` now exists.
|
|
- ADD a new `pytest_asyncio.fixture(scope="session")` named `live_services_available` that probes localhost:5432, localhost:9000, localhost:6379 via `socket.create_connection(..., timeout=1)`; if any probe fails, the fixture yields `False`; otherwise `True`. Update the `async_client` fixture (or add a new `live_async_client` fixture) to use an actual PostgreSQL + MinIO when `live_services_available` is True, falling back to the in-memory aiosqlite engine when False. Use `pytest.mark.skipif(not live_services_available, reason="docker compose not running")` on integration tests that need real MinIO; unit tests using only the in-memory DB do not need the skip marker. (Simpler approach acceptable: detect via env var `INTEGRATION=1`; if unset, skip integration tests.)
|
|
|
|
Step 4 — prune `backend/tests/test_documents.py`:
|
|
- DELETE the legacy sync tests: `test_upload_txt_no_classify`, `test_upload_pdf_no_classify`, `test_list_documents`, `test_list_documents_filter_by_topic`, `test_get_document`, `test_get_document_not_found`, `test_delete_document`, `test_delete_document_not_found`, `test_upload_empty_file`. (Plan 02 left them in place during the cutover; this plan completes the deletion.)
|
|
- On every `_async`-suffixed test added in Plan 02, REMOVE the `@pytest.mark.xfail(strict=False, reason="async storage layer implemented in plan 05")` marker.
|
|
- Update any test that previously referenced `import services.storage as st; st.update_document_topics(...)` to use the async ORM API via the `db_session` fixture: `from db.models import Document, DocumentTopic; from sqlalchemy import insert; ...`. For tests that need a topic-tagged document, build it via the API itself (call `POST /api/topics` then `PATCH /api/documents/.../classify`).
|
|
|
|
Step 5 — prune `backend/tests/test_health.py`:
|
|
- DELETE the legacy `def test_health(client):` (it used the sync TestClient fixture which is gone).
|
|
- REMOVE the `@pytest.mark.xfail` marker from `test_health_checks_postgres_and_minio`.
|
|
- If `live_services_available` is False, this test should be skipped via `pytest.mark.skipif(...)`.
|
|
|
|
Step 6 — run the full suite end-to-end against the in-memory engine: `cd backend && python3 -m pytest tests/ -v` should exit 0 with the storage tests PASSED, the alembic tests PASSED (or SKIPPED if no PostgreSQL available — the in-memory aiosqlite test path covers them), the health test SKIPPED or PASSED, and the async document tests PASSED or SKIPPED depending on `live_services_available`.
|
|
</action>
|
|
<verify>
|
|
<automated>cd /Users/nik/Documents/Progamming/document_scanner && [ ! -d backend/data ] && echo "data-dir-deleted" && grep -c "DATA_DIR\|UPLOADS_DIR\|METADATA_DIR\|TOPICS_FILE\|ensure_data_dirs" backend/config.py | awk '{exit ($1 == 0) ? 0 : 1}' && echo "config-pruned" && cd backend && python3 -m pytest tests/ -v 2>&1 | tail -20</automated>
|
|
</verify>
|
|
<acceptance_criteria>
|
|
- `backend/data/` directory does not exist (verifiable: `[ ! -d backend/data ]` exits 0)
|
|
- `.gitignore` contains `backend/data/` (verifiable: `grep -Fx "backend/data/" .gitignore` exits 0)
|
|
- `backend/config.py` no longer mentions `DATA_DIR`, `UPLOADS_DIR`, `METADATA_DIR`, `TOPICS_FILE`, `ensure_data_dirs` (verifiable via the Verify command's grep-c check)
|
|
- `backend/config.py` still defines `DEFAULT_SETTINGS`, `DEFAULT_SYSTEM_PROMPT`, `class Settings(BaseSettings)`, `settings = Settings()`, and `SETTINGS_FILE = Path(...) / "settings.json"`
|
|
- `backend/tests/conftest.py` no longer defines `isolated_data_dir` (verifiable: `grep -c "def isolated_data_dir" backend/tests/conftest.py | grep -q "^0$"`)
|
|
- `backend/tests/conftest.py` no longer defines a sync `client` fixture using `TestClient` (verifiable: `grep -c "TestClient" backend/tests/conftest.py | grep -q "^0$"`)
|
|
- `backend/tests/test_documents.py` no longer contains the legacy sync test names (verifiable: `grep -cE "^def test_(upload_txt_no_classify|upload_pdf_no_classify|list_documents|get_document|delete_document|upload_empty_file)\b" backend/tests/test_documents.py | grep -q "^0$"`)
|
|
- `backend/tests/test_documents.py` no longer contains `@pytest.mark.xfail` markers (the cutover removes them — verifiable: `grep -c "@pytest.mark.xfail" backend/tests/test_documents.py | grep -q "^0$"`)
|
|
- `backend/tests/test_health.py` no longer contains `@pytest.mark.xfail` (verifiable: `grep -c "@pytest.mark.xfail" backend/tests/test_health.py | grep -q "^0$"`)
|
|
- `cd backend && python3 -m pytest tests/ -v 2>&1` shows 0 FAILED and 0 ERROR lines (verifiable: `python3 -m pytest tests/ 2>&1 | grep -E "^FAILED|^ERROR" | wc -l | grep -q "^0$"`)
|
|
- The Verify command output shows `data-dir-deleted` and `config-pruned`
|
|
</acceptance_criteria>
|
|
<done>The Phase 1 cutover is complete: no flat-file artifacts remain in code or on disk; the test suite uses only async fixtures; the legacy tests have been deleted; the async ports of every legacy test run as first-class tests.</done>
|
|
</task>
|
|
|
|
<task type="checkpoint:human-verify" gate="blocking">
|
|
<name>Task 4: End-to-end walking-skeleton verification — docker compose up + real PDF upload + Celery worker</name>
|
|
<files>(verification only)</files>
|
|
<read_first>
|
|
- .planning/phases/01-infrastructure-foundation/SKELETON.md (the success contract for this checkpoint)
|
|
- .planning/phases/01-infrastructure-foundation/01-03-SUMMARY.md (the Alembic upgrade output from Plan 03)
|
|
- .planning/phases/01-infrastructure-foundation/01-04-SUMMARY.md (the storage rewrite summary from Plan 04)
|
|
</read_first>
|
|
<what-built>
|
|
Plans 01-05 together: a fully wired DocuVault backend running on Docker Compose with PostgreSQL + MinIO + Redis + Celery + FastAPI. This checkpoint verifies the walking-skeleton end-to-end: a real document upload via the rewritten API persists metadata to PostgreSQL, stores bytes in MinIO with a UUID-based object key, enqueues extraction + classification on Redis, and the Celery worker processes the task. `GET /health` returns `postgres: ok` and `minio: ok`. ROADMAP.md Phase 1 success criteria #1, #3, and #4 are verified live here (#2 was verified in Plan 03).
|
|
</what-built>
|
|
<how-to-verify>
|
|
From the project root:
|
|
|
|
1. Ensure `.env` exists with all variables from `.env.example` filled in: `cp .env.example .env` (if not present) and replace each `changeme_*` placeholder with a value of your choice. The DATABASE_URL/DATABASE_MIGRATE_URL passwords MUST match the hardcoded passwords in `docker/postgres/initdb.d/01-init-users.sql` from Plan 01 (which itself was committed during Wave 1). The REDIS_URL password must match REDIS_PASSWORD.
|
|
|
|
2. Tear down any prior state: `docker compose down -v` (the `-v` deletes the postgres_data and minio_data named volumes so the init script will re-run).
|
|
|
|
3. Boot everything: `docker compose up --build -d`. Wait ~30 seconds.
|
|
|
|
4. Verify all services are healthy: `docker compose ps`. The `STATUS` column must show `Up (healthy)` for `postgres`, `minio`, `redis`, `backend`, AND `celery-worker`. If any is `unhealthy`, capture `docker compose logs <service>` and resolve before continuing.
|
|
|
|
5. Apply the migration against the live DB: `docker compose exec backend bash -lc "cd /app && alembic upgrade head"`. Must exit 0 with `Running upgrade -> 0001`.
|
|
|
|
6. Hit the health endpoint: `curl -s http://localhost:8000/health | python3 -m json.tool`. The response MUST be:
|
|
```
|
|
{
|
|
"status": "ok",
|
|
"checks": {
|
|
"postgres": "ok",
|
|
"minio": "ok"
|
|
}
|
|
}
|
|
```
|
|
|
|
7. Upload a real PDF or text file. Pick any small PDF (or use `printf 'Test document about invoices and contracts.' > /tmp/test.txt` first). Then:
|
|
```
|
|
curl -s -X POST http://localhost:8000/api/documents/upload \
|
|
-F "file=@/tmp/test.txt;type=text/plain" \
|
|
-F "auto_classify=true" | python3 -m json.tool
|
|
```
|
|
Confirm the response includes:
|
|
- `"id"` — a 36-character UUID string
|
|
- `"original_name": "test.txt"`
|
|
- `"size_bytes"` matching the file size
|
|
- `"topics": []` (classification is async — the Celery worker fills this in seconds later)
|
|
|
|
8. Confirm the document landed in PostgreSQL:
|
|
`docker compose exec postgres psql -U docuvault_app -d docuvault -c "SELECT id, filename, object_key, status FROM documents ORDER BY created_at DESC LIMIT 1;"`
|
|
— exactly one row; `object_key` starts with `null-user/` (D-03 sentinel from Plan 04); `status` is `pending` initially then `classified` or `classification_failed` after the worker runs.
|
|
|
|
9. Confirm the document landed in MinIO. The object key from step 8 will look like `null-user/<doc-uuid>/<random-uuid>.txt`. Either use the MinIO web console at `http://localhost:9001` (login with `MINIO_ROOT_USER` / `MINIO_ROOT_PASSWORD` from `.env`) and navigate to `docuvault` bucket → confirm the object exists with non-zero size — OR use `mc`:
|
|
`docker compose exec minio mc alias set local http://localhost:9000 $MINIO_ROOT_USER $MINIO_ROOT_PASSWORD` then `docker compose exec minio mc ls local/docuvault/null-user/`.
|
|
|
|
10. Confirm the Celery worker processed the task:
|
|
`docker compose logs celery-worker | tail -30`
|
|
— look for a `Task tasks.document_tasks.extract_and_classify[...] received` line followed by `succeeded` or a structured error. If the task succeeded, run:
|
|
`curl -s http://localhost:8000/api/documents | python3 -m json.tool`
|
|
— the response should show one item with `extracted_text` populated and possibly `topics` populated by the AI classifier (depending on AI provider config; if no `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` is set, classification will fail gracefully and `status` will be `classification_failed` — that is acceptable for this walking-skeleton check; the storage layer worked.).
|
|
|
|
11. Delete the document:
|
|
`curl -s -X DELETE http://localhost:8000/api/documents/<id-from-step-7>` returns `{"success": true}`.
|
|
Then confirm the MinIO object is gone: `docker compose exec minio mc ls local/docuvault/null-user/<doc-uuid>/` returns empty or "Object does not exist".
|
|
|
|
12. Run the test suite against the live stack:
|
|
`docker compose exec -e INTEGRATION=1 backend bash -lc "cd /app && pytest tests/ -v"`
|
|
— every test PASSED, zero FAILED, zero XFAIL (skips for integration tests when INTEGRATION=0 are acceptable on a host-only run; when INTEGRATION=1 inside the container with live services, they must run and pass).
|
|
</how-to-verify>
|
|
<expected-outcome>
|
|
All 12 verification steps succeed. The walking skeleton is live: PDF → API → PostgreSQL + MinIO + Celery → extracted text → classification → DB row. ROADMAP.md Phase 1 is complete.
|
|
</expected-outcome>
|
|
<if-broken>
|
|
Common failures and fixes:
|
|
- `/health` reports `postgres: error: ConnectionRefusedError`: postgres healthcheck didn't gate startup; check `depends_on: condition: service_healthy` is set on `backend` and `celery-worker`. Inspect `docker compose ps` and `docker compose logs postgres`.
|
|
- `/health` reports `minio: error: bucket missing`: the lifespan bucket-create failed. Check `docker compose logs backend` for the `make_bucket` error. Likely cause: `MINIO_ACCESS_KEY` / `MINIO_SECRET_KEY` mismatch — the lifespan client connects with app-level keys but MinIO only knows about the root user on first boot. Workaround for Phase 1: temporarily set `MINIO_ACCESS_KEY=$MINIO_ROOT_USER` and `MINIO_SECRET_KEY=$MINIO_ROOT_PASSWORD` in `.env` (Phase 2 will set up an app-level access policy via `mc admin user add` during MinIO init).
|
|
- Celery worker logs show `[ERROR/MainProcess] consumer: Cannot connect to redis://...`: the REDIS_PASSWORD or REDIS_URL is wrong, or the password contains a special character not URL-encoded. Re-confirm `REDIS_URL` form `redis://:<password>@redis:6379/0`.
|
|
- Upload returns 500 with `MissingGreenlet`: a session attribute access happened after commit; verify `expire_on_commit=False` in `db/session.py`.
|
|
- Task never runs: `docker compose logs celery-worker` shows it can't import `tasks.document_tasks`; verify `tasks/__init__.py` exists and `celery_app.autodiscover_tasks(["tasks"])` is called.
|
|
</if-broken>
|
|
<resume-signal>Type "approved" once steps 1-12 all pass. If any step fails, describe the failure mode and we resume with a fix plan.</resume-signal>
|
|
</task>
|
|
|
|
</tasks>
|
|
|
|
<threat_model>
|
|
## Trust Boundaries
|
|
|
|
| Boundary | Description |
|
|
|----------|-------------|
|
|
| Browser → FastAPI | HTTP/JSON (Phase 1: CORS `*` — Phase 2 locks down); multipart upload bytes traverse this boundary |
|
|
| FastAPI → Celery / Redis | Task payload is the document_id string only; no user input passed |
|
|
| FastAPI lifespan → MinIO | Bucket auto-create at startup; client persists on `app.state.minio` |
|
|
| Celery worker → MinIO + PostgreSQL | Worker re-fetches bytes from MinIO and reads/writes Document row |
|
|
|
|
## STRIDE Threat Register
|
|
|
|
| Threat ID | Category | Component | Disposition | Mitigation Plan |
|
|
|-----------|----------|-----------|-------------|-----------------|
|
|
| T-01-05-01 | Spoofing | Unauthenticated upload endpoint accepting any client | accept | Phase 1 has no auth (D-03 — user_id nullable); upload accessible to anyone reaching `localhost:8000`. Phase 2 adds JWT + CSRF + rate-limit. Documented in SKELETON.md "Out of Scope". |
|
|
| T-01-05-02 | Tampering | MIME-type spoofing on upload | mitigate | The existing `ALLOWED_MIME_TYPES` set in `api/documents.py` is preserved verbatim. Phase 4 (DOC-02) adds magic-byte verification before download/preview. |
|
|
| T-01-05-03 | Information Disclosure | `/health` revealing internal error class names | mitigate | `/health` error strings format as `f"error: {type(e).__name__}: {e}"` — exposes Python exception class name, which is acceptable for an internal/dev endpoint in Phase 1. Phase 2 will trim to `"error"` or `"unhealthy"` once the endpoint is reachable from the internet. Documented note in `main.py`. |
|
|
| T-01-05-04 | Tampering | Celery task receives untrusted document_id and might query arbitrary rows | mitigate | `extract_and_classify` only takes a `document_id` string from the upload path — never from a user query parameter. Task code does `session.get(Document, uuid.UUID(document_id))` which raises `ValueError` for non-UUID input; no SQL injection vector. Document row lookup is single-row by primary key only. |
|
|
| T-01-05-05 | Denial of Service | Lifespan bucket-create on every reboot blocks startup | mitigate | `if not bucket_exists: make_bucket` is idempotent — fast on warm starts. If MinIO is unreachable at startup, lifespan raises and the FastAPI app fails to boot — this is intentional and surfaces the failure to Compose's `depends_on: condition: service_healthy` (which gated startup but cannot catch a post-startup MinIO crash). |
|
|
| T-01-05-06 | Information Disclosure | `app.state.minio` reused across handlers | accept | The client holds connection state but no per-user credentials. All app handlers see the same `app.state.minio` — acceptable since Phase 1 has no per-user isolation. Phase 5 will introduce per-user `StorageBackend` instances for cloud backends. |
|
|
| T-01-05-SC | Tampering | npm/pip installs | N/A | No new package installs in this plan — all dependencies were added in Plan 01 and verified via RESEARCH.md Package Legitimacy Audit. |
|
|
</threat_model>
|
|
|
|
<verification>
|
|
- Tasks 1-3 are autonomous; Task 4 is a blocking human-verify checkpoint.
|
|
- After Task 4 approval, ROADMAP.md Phase 1 success criteria #1 (docker compose up healthy), #3 (extract + classify pipeline works), and #4 (MinIO key schema enforced) are all live-verified. Criterion #2 was verified in Plan 03 Task 3.
|
|
- `docker compose exec -e INTEGRATION=1 backend bash -lc "cd /app && pytest tests/ -v"` exits 0 with zero FAILED.
|
|
</verification>
|
|
|
|
<success_criteria>
|
|
- `backend/main.py` lifespan creates the MinIO bucket and disposes the engine; `/health` returns the postgres+minio shape per D-07.
|
|
- `backend/api/documents.py` and `backend/api/topics.py` are entirely async-session-driven; upload-time classification is queued via Celery `.delay()`.
|
|
- `backend/celery_app.py` and `backend/tasks/document_tasks.py` are wired and discoverable.
|
|
- `backend/services/classifier.py` accepts an `AsyncSession`.
|
|
- `backend/config.py` is pruned of legacy flat-file constants.
|
|
- `backend/data/` is deleted; `tests/test_documents.py` is async-only; `tests/conftest.py` no longer ships a sync TestClient fixture.
|
|
- `docker compose up` boots healthy; the walking skeleton end-to-end check from Task 4 passes.
|
|
</success_criteria>
|
|
|
|
<output>
|
|
Create `.planning/phases/01-infrastructure-foundation/01-05-SUMMARY.md` when done. Include: the exact `/health` JSON response observed at step 6 of Task 4, the actual MinIO object key produced at step 7-9, the Celery task log line from step 10, and any deviations from the plan (e.g., the temporary MinIO-root-as-app-key workaround called out in `if-broken`).
|
|
</output>
|